Using Solr in a schemaless mode
Many use cases allow us to define our index structure upfront. We can look at the data, see which parts are important, which we want to search, how we want to do it, and finally, we can create the schema.xml
file that we will use. However, this is not always possible. Sometimes, you don't know the data structure before you go into production, or you know very little about it. Of course, we can use dynamic fields, but such an approach is limited. This is why the newest versions of Solr allow us to use the so-called schemaless mode in which Solr is able to guess the type of data and create a field for it.
How to do it...
Let's assume that we don't know anything about the data and we want to fully rely on Solr when it comes to it.
- To do this, we start with the
schema.xml
file—thefields
section of it. We need to include two fields, so ourschema.xml
file looks as follows:<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="_version_" type="long" indexed="true" stored="true"/>
- In addition to this, we need to specify the unique identifier. We do this by including the following section in the
schema.xml
file:<uniqueKey>id</uniqueKey>
- In addition, we need to have the field types defined. To do this we add a section that looks as follows:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" /> <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> <fieldType name="booleans" class="solr.BoolField" sortMissingLast="true" multiValued="true"/> <fieldType name="tlongs" class="solr.TrieLongField" precisionStep="8" positionIncrementGap="0" multiValued="true"/> <fieldType name="tdoubles" class="solr.TrieDoubleField" precisionStep="8" positionIncrementGap="0" multiValued="true"/> <fieldType name="tdates" class="solr.TrieDateField" precisionStep="6" positionIncrementGap="0" multiValued="true"/> <fieldType name="text" class="solr.TextField" positionIncrementGap="100" multiValued="true"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType>
- Now, we can switch to the
solrconfig.xml
file to add the so-called managed index schema. We do this by adding the following configuration snippet to the root section of thesolrconfig.xml
file:<schemaFactory class="ManagedIndexSchemaFactory"> <bool name="mutable">true</bool> <str name="managedSchemaResourceName">managed-schema</str> </schemaFactory>
- We alter our
update
request handler to include additional update chains (we can just alter the same section in thesolrconfig.xml
file we already have):<requestHandler name="/update" class="solr.UpdateRequestHandler"> <lst name="defaults"> <str name="update.chain">add-unknown-fields</str> </lst> </requestHandler>
- Finally, we define the used update request processor chain by adding the following section to the
solrconfig.xml
file:<updateRequestProcessorChain name="add-unknown-fields"> <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/> <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/> <processor class="solr.ParseLongFieldUpdateProcessorFactory"/> <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/> <processor class="solr.ParseDateFieldUpdateProcessorFactory"> <arr name="format"> <str>yyyy-MM-dd</str> </arr> </processor> <processor class="solr.AddSchemaFieldsUpdateProcessorFactory"> <str name="defaultFieldType">text</str> <lst name="typeMapping"> <str name="valueClass">java.lang.Boolean</str> <str name="fieldType">booleans</str> </lst> <lst name="typeMapping"> <str name="valueClass">java.util.Date</str> <str name="fieldType">tdates</str> </lst> <lst name="typeMapping"> <str name="valueClass">java.lang.Long</str> <str name="valueClass">java.lang.Integer</str> <str name="fieldType">tlongs</str> </lst> <lst name="typeMapping"> <str name="valueClass">java.lang.Number</str> <str name="fieldType">tdoubles</str> </lst> </processor> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain>
Now, if we index a document, it looks like this:
<add> <doc> <field name="id">1</field> <field name="title">Test document</field> <field name="published">2014-04-21</field> <field name="likes">12</field> </doc> </add>
Solr will index it without any problem, creating fields such as
titles
,likes
, orpublished
, with a proper format. We can check them by running aq=*:*
query, which will result in the following response:<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">*:*</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <arr name="title"> <str>Test document</str> </arr> <arr name="published"> <date>2014-04-21T00:00:00Z</date> </arr> <arr name="likes"> <long>12</long> </arr> <long name="_version_">1466477993631154176</long></doc> </result> </response>
How it works...
We start with our index having two fields, id
and _version_
. The id
field is used as the unique identifier; we informed Solr about this by adding the unqiueKey
section in schema.xml
. We will need it for functionalities such as document updates, deletes by identifiers, and so forth. The _version_
field is used by Solr internally, and is required by some Solr functionalities (such as optimistic locking); this is why we include it. The rest of the fields will be added automatically.
We also need to define the field types that we will use. Apart from the string
type used by the id
field, and the long
type used by the _version_
field, it contains types our documents will use. We will also define these types in our custom processor chain in the solrconfig.xml
file.
The next thing is very important; the managed schema factory that we defined in solrconfig.xml
, which is a ManagedIndexSchemaFactory
type (the class
property set to this value). By adding this section, we say that we want Solr to manage our schema.xml
file. This means that Solr will load the schema.xml
file during startup, change its name to schema.xml.bak
, and will then create a file called managed-schema
(the value of the managedSchemaResourceName
property). From this point, we shouldn't modify our index structure manually—we should either let Solr do it during indexation or add and alter fields using the schema API (we will talk about this in the Altering the index structure on a live collection recipe in Chapter 8, Using Additional Functionalities). Since I assume that we will use the schema API, I've set the mutable
property to true
. If we want to disallow using the schema API, we should set the mutable
property to false
.
Note
Note that you need to have a single schemaFactory
defined, and it needs to be set to the ManagedIndexSchemaFactory
type. If it is not set to this type, field discovery will not work and the indexation will result in an error.
We also need to include an update request processor chain. Since we want all index requests to use our custom request chain, we add the update.chain
property and set it to add-unknown-fields
in the defaults
section of our update
request handler configuration.
Finally, the second most important thing in this recipe is our update request processor chain called add-unknown-fields
(the same as we used in the update processor configuration). It defines several update processors that allow us to get the functionality of fields and their types' discoveries. The solr.RemoveBlankFieldUpdateProcessorFactory
processor factory removes empty fields from the documents we send to indexation. The solr.ParseBooleanFieldUpdateProcessorFactory
processor factory is responsible for parsing Boolean fields; solr.ParseLongFieldUpdateProcessorFactory
parses fields that have data that uses the long type; solr.ParseDoubleFieldUpdateProcessorFactory
parses fields with data of double type; and solr.ParseDateFieldUpdateProcessorFactory
parses the date-based fields. We specify the format we want Solr to recognize (we will discuss this in more detail in the Using parsing update processors to parse data recipe in Chapter 2, Indexing Your Data).
Finally, we include the solr.AddSchemaFieldsUpdateProcessorFactory
processor factory that adds the actual fields to our managed schema. We specify the default field type to text
by adding the defaultFieldType
property. This type will be used when no other type will match the field. After the default field type definition, we see four lists called typeMapping
. These sections define the field type mappings Solr will use. Each list contains at least one valueClass
property and one fieldType
property. The valueClass
property defines the type of data Solr will assign to the field type defined by the fieldType
property.
In our case, if Solr finds a date (<str name="valueClass">java.util.Date</str>
) value in a field, it will create a new field using the tdates
field type (<str name="fieldType">tdates</str>
). If Solr finds a long or an integer value, it creates a new field using the tlongs
field type. Of course, a field won't be created if it already exists in our managed schema. The name of the field created in our managed schema will be the same as the name of the field in the indexed document.
Finally, the solr.LogUpdateProcessorFactory
processor factory tells Solr to write information about the update to log, and the solr.RunUpdateProcessorFactory
processor factory tells Solr to run the update itself.
As we can see, our data includes fields that we didn't specify in the schema.xml
file, and the document was indexed properly, which allows us to assume that the functionality works. If you want to check how our index structure looks like after indexation, use the schema API; you can do it yourself after reading the Retrieving information about the index structure recipe in Chapter 8, Using Additional Functionalities.
One thing to remember is that by default, Solr is able to automatically detect field types such as Boolean, integer, float, long, double, and date.
Note
Take a look at https://cwiki.apache.org/confluence/display/solr/Schemaless+Mode for further information regarding the Solr schemaless mode.