Geospatial search
Geomatics (also known as geospatial technology or geomatics engineering) is a discipline of gathering, storing, processing, and delivering geographic information, or spatial referenced information. This geographic information is based out of longitudes (vertical lines) and latitudes (horizontal lines) and can be effectively used in various ways and forms. For instance, you wish to store the location of your company when your company has multiple locations; or sorting the search results based on the distance from a point. To be more specific, geospatial is playing around with different co-ordinates throughout the globe.
In this section, we will talk about and understand how to:
Store geographical points in the index
Sort results by a distance from a point
Storing geographical points in the index
You might come across situations wherein you are supposed to store multiple locations of a company in the index. Yes of course, we can add multiple dynamic fields and remember the field names in our application, but that isn't comfortable. No worries, Solr will be able to handle such a situation and the next example will guide you how to store pairs of fields (in our case, location co-ordinates/geographical point).
Let us define three fields in the field definition section of our schema.xml
file to store company's data:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text" indexed="true" stored="true" /> <field name="location" type="point" indexed="true" stored="true" multiValued="true" />
In addition to the preceding fields, we shall also have one dynamic field defined in our schema.xml
file as shown:
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>
Our point type should look like this:
<fieldType name="point" class="solr.PointType" dimension="2" subFieldSuffix="_d"/>
Now, let us look into our example data which I stored in the geodata.xml
file:
<add> <doc> <field name="id">1</field> <field name="name">company</field> <field name="location">10,10</field> <field name="location">30,30</field> </doc> </add>
Let us now index our data and for doing so, run the following command from the exampledocs
directory (where our geodata.xml
file resides).
java -jar post.jar geodata.xml
After we index our data, now it's time to run our following query to get the data:
http://localhost:8080/solr/select?q=location:10,10
If you get the following response, then its bingo! You have done it.
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3</int> <lst name="params"> <str name="q">location:10,10</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">1</str> <arr name="location"> <str>10,10</str> <str>30,30</str> </arr> <arr name="location_0_d"> <double>10.0</double> <double>30.0</double> </arr> <arr name="location_1_d"> <double>10.0</double> <double>30.0</double> </arr> <str name="name">company</str> </doc> </result> </response>
We have four fields, one of them being a dynamic field which we have defined in our schema.xml
file. The first field is the one responsible for holding the unique identifier. The second one is responsible for holding the name of the company. The third one, named location, is responsible for holding the geographical points and of course can have multiple values. The dynamic field will be used as a helper for the point type.
Then, we have the point type definition, which is based on the solr.PointType
class and is defined by the following two attributes:
dimension
: The number of dimensions that the field will store. In our case, as we have stored a pair of values, we set this attribute to 2.subFieldSuffix
: It is used to store the actual values of the field. This is where our dynamic field comes into play. Using this field, we instruct Solr that our helper field will be the dynamic field ending with the suffix of_d
.
How did this type of field actually work? When defining a two dimensional field, like we did, there are actually three fields created in the index. The first field is named like the field we added in the schema.xml
file, so in our case it is location. This field will be responsible for holding the stored value of the field. Additionally, this field will only be created when we set the field attribute store to true
.
The next two fields are based on the dynamic field. Their names would be field_0_d
and field_1_d
. Fields are ordered as the field name, _
character, the index of the value, another _
character, and finally the suffix defined by the subFieldSuffix
attribute of the type.
Now, let us understand how the data is indexed. If you look at our example data file, you will see that the values in each pair are separated by the comma character. And that's how you can add the data to the index.
Querying is just the same as the pairs should be represented, except it differs from the standard one-valued fields as each value in the pair is separated by a comma character which is passed in the query.
Looking at the response, you can see that besides the location field, there are two dynamic fields (location_0_d
and location_1_d
) created.
Sort results by a distance from a point
Taking forward the above described scenario (as discussed in the Storing Geographical points in the index section of this chapter), imagine a scenario wherein you got to sort your search results based on the distance from a user's location. This section will show you how to do it.
Let us assume that we have the following index which we have added to the field definition section of schema.xml
.
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="string" indexed="true" stored="true" /> <field name="x" type="float" indexed="true" stored="true" /> <field name="y" type="float" indexed="true" stored="true" />
Here in this example, we have assumed that the user location will be provided from the application making the query.
Our example data looks like this:
<add> <doc> <field name="id">1</field> <field name="name">Company 1</field> <field name="x">56.4</field> <field name="y">40.2</field> </doc> <doc> <field name="id">2</field> <field name="name">Company 2</field> <field name="x">50.1</field> <field name="y">48.9</field> </doc> <doc> <field name="id">3</field> <field name="name">Company 3</field> <field name="x">23.18</field> <field name="y">39.1</field> </doc> </add>
Suppose that the user is using this search application standing at the North Pole. Our query to find the companies and sort them in ascending order on the basis of the distance from the North Pole would be:
http://localhost:8080/solr/select?q=company&sort=dist(2,x,y,0,0)+asc
Our result would look something like this:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="q">company</str> <str name="sort">dist(2,x,y,0,0) asc</str> </lst> </lst> <result name="response" numFound="3" start="0"> <doc> <str name="id">3</str> <str name="name">Company 3</str> <float name="x">23.18</float> <float name="y">39.1</float> </doc> <doc> <str name="id">1</str> <str name="name">Company 1</str> <float name="x">56.4</float> <float name="y">40.2</float> </doc> <doc> <str name="id">2</str> <str name="name">Company 2</str> <float name="x">50.1</float> <float name="y">48.9</float> </doc> </result> </response>
As you can see in the index structure and the data, every company is described by four fields: the unique identifier (id
), company name (name
), the latitude of the company's location (x
), and the longitude of the company's location (y
).
To achieve the expected results, we run a standard query with a non-standard sort. The sort parameter consists of a function name, dist
, which calculates the distance between points. In our example, the function (dist(2,x,y,0,0)
) takes five parameters, which are:
The first parameter mentions the algorithm used to calculate the distance. In our case, the value 2
tells Solr to calculate the Euclidean distance.
The second parameter x
contains the latitude.
The third parameter y
contains the longitude.
The fourth parameter is the latitude value of the point from which the distance will be calculated (Latitude value of North Pole is 0
).
The fifth parameter is the longitude value of the point from which the distance will be calculated (Longitude of North Pole is 0
).
If you would like to explore more about the functions available for you with Solr, you may navigate to Solr Wiki page at http://wiki.apache.org/solr/FunctionQuery