Merge branch '15-solr-based-spatial-search' into release-v2.0
This commit is contained in:
commit
51a2b20501
199
README.rst
199
README.rst
|
@ -32,30 +32,15 @@ About the components
|
|||
Spatial Search
|
||||
--------------
|
||||
|
||||
To enable the spatial query you need to add the `spatial_query` plugin to your
|
||||
ini file (See `Configuration`_). This plugin requires the `spatial_metadata`
|
||||
plugin.
|
||||
|
||||
The extension adds the following call to the CKAN search API, which returns
|
||||
datasets with an extent that intersects with the bounding box provided::
|
||||
|
||||
/api/2/search/dataset/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]
|
||||
|
||||
If the bounding box coordinates are not in the same projection as the one
|
||||
defined in the database, a CRS must be provided, in one of the following
|
||||
forms:
|
||||
|
||||
- urn:ogc:def:crs:EPSG::4326
|
||||
- EPSG:4326
|
||||
- 4326
|
||||
|
||||
As of CKAN 1.6, you can integrate your spatial query in the full CKAN
|
||||
search, via the web interface (see the `Spatial Search Widget`_) or
|
||||
via the `action API`__, e.g.::
|
||||
The spatial extension allows to index datasets with spatial information so
|
||||
they can be filtered via a spatial query. This includes both via the web
|
||||
interface (see the `Spatial Search Widget`_) or via the `action API`__, e.g.::
|
||||
|
||||
POST http://localhost:5000/api/action/package_search
|
||||
{
|
||||
"q": "Pollution",
|
||||
"facet": "true",
|
||||
"facet.field": "country",
|
||||
"extras": {
|
||||
"ext_bbox": "-7.535093,49.208494,3.890688,57.372349"
|
||||
}
|
||||
|
@ -63,12 +48,107 @@ via the `action API`__, e.g.::
|
|||
|
||||
__ http://docs.ckan.org/en/latest/apiv3.html
|
||||
|
||||
To enable the spatial query you need to add the ``spatial_query`` plugin to your
|
||||
ini file (See `Configuration`_). This plugin requires the ``spatial_metadata``
|
||||
plugin.
|
||||
|
||||
There are different backends supported for the spatial search, it is important
|
||||
to understand their differences and the necessary setup required when choosing
|
||||
which one to use. The backend to use is defined with the configuration option
|
||||
``ckanext.spatial.search_backend``, eg::
|
||||
|
||||
ckanext.spatial.search_backend = solr
|
||||
|
||||
The following table summarizes the different spatial search backends:
|
||||
|
||||
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
|
||||
| Backend | Solr Versions | Supported geometries | Sorting and relevance | Performance with large number of datasets |
|
||||
+========================+===============+=====================================+===========================================================+===========================================+
|
||||
| ``solr`` | 3.1 to 4.x | Bounding Box | Yes, spatial sorting combined with other query parameters | Good |
|
||||
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
|
||||
| ``solr-spatial-field`` | 4.x | Bounding Box, Point and Polygon (1) | Not implemented | Good |
|
||||
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
|
||||
| ``postgis`` | 1.3 to 4.x | Bounding Box | Partial, only spatial sorting supported (2) | Poor |
|
||||
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
|
||||
|
||||
(1) Requires JTS
|
||||
(2) Needs ``ckanext.spatial.use_postgis_sorting`` set to True
|
||||
|
||||
|
||||
We recommend to use the ``solr`` backend whenever possible. Here are more
|
||||
details about the available options:
|
||||
|
||||
* ``solr`` (Recommended)
|
||||
This option uses normal Solr fields to index the relevant bits of
|
||||
information about the geometry and uses an algorithm function to
|
||||
sort results by relevance, keeping any other non-spatial filtering. It only
|
||||
supports bounding boxes both for the geometries to be indexed and the input
|
||||
query shape. It requires `EDisMax`_ query parser, so it will only work on
|
||||
versions of Solr greater than 3.1 (We recommend using Solr 4.x).
|
||||
|
||||
You will need to add the following fields to your Solr schema file to enable it::
|
||||
|
||||
<fields>
|
||||
<!-- ... -->
|
||||
<field name="bbox_area" type="float" indexed="true" stored="true" />
|
||||
<field name="maxx" type="float" indexed="true" stored="true" />
|
||||
<field name="maxy" type="float" indexed="true" stored="true" />
|
||||
<field name="minx" type="float" indexed="true" stored="true" />
|
||||
<field name="miny" type="float" indexed="true" stored="true" />
|
||||
</fields>
|
||||
|
||||
|
||||
* ``solr-spatial-field``
|
||||
This option uses the `spatial field <http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4>`_
|
||||
introduced in Solr 4, which allows to index points, rectangles and more
|
||||
complex geometries (complex geometries will require `JTS`_, check the
|
||||
documentation). Sorting has not yet been implemented, users willing to do so
|
||||
will need to modify the query using the ``before_search`` extension point.
|
||||
|
||||
You will need to add the following field type and field to your Solr schema
|
||||
file to enable it (Check the Solr documentation for more information on
|
||||
the different parameters, note that you don't need ``spatialContextFactory`` if
|
||||
you are not using JTS)::
|
||||
|
||||
<types>
|
||||
<!-- ... -->
|
||||
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
|
||||
spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
|
||||
distErrPct="0.025"
|
||||
maxDistErr="0.000009"
|
||||
units="degrees"
|
||||
/>
|
||||
</types>
|
||||
<fields>
|
||||
<!-- ... -->
|
||||
<field name="spatial_geom" type="location_rpt" indexed="true" stored="true" multiValued="true" />
|
||||
</fields>
|
||||
|
||||
* ``postgis``
|
||||
This is the original implementation of the spatial search. It does not
|
||||
require any change in the Solr schema and can run on Solr 1.x, but it is
|
||||
not as efficient as the previous ones. Basically the bounding box based
|
||||
query is performed in PostGIS first, and the ids of the matched datasets
|
||||
are added as a filter to the Solr request. This, apart from being much
|
||||
less efficient, can led to issues on Solr due to size of the requests (See
|
||||
`Solr configuration issues on legacy PostGIS backend`_). There is support
|
||||
for a spatial ranking on this backend (setting
|
||||
``ckanext.spatial.use_postgis_sorting`` to True on the ini file), but it
|
||||
can not be combined with any other filtering.
|
||||
|
||||
|
||||
|
||||
|
||||
.. _edismax: http://wiki.apache.org/solr/ExtendedDisMax
|
||||
.. _JTS: http://www.vividsolutions.com/jts/JTSHome.htm
|
||||
|
||||
|
||||
Geo-Indexing your datasets
|
||||
++++++++++++++++++++++++++
|
||||
|
||||
In order to make a dataset queryable by location, an special extra must
|
||||
be defined, with its key named 'spatial'. The value must be a valid GeoJSON_
|
||||
geometry, for example::
|
||||
Regardless of the backend that you are using, in order to make a dataset
|
||||
queryable by location, an special extra must be defined, with its key named
|
||||
'spatial'. The value must be a valid GeoJSON_ geometry, for example::
|
||||
|
||||
{"type":"Polygon","coordinates":[[[2.05827, 49.8625],[2.05827, 55.7447], [-6.41736, 55.7447], [-6.41736, 49.8625], [2.05827, 49.8625]]]}
|
||||
|
||||
|
@ -83,7 +163,7 @@ the information stored in the extra with the geometry table.
|
|||
|
||||
|
||||
Spatial Search Widget
|
||||
---------------------
|
||||
+++++++++++++++++++++
|
||||
|
||||
**Note**: this plugin requires CKAN 1.6 or higher.
|
||||
|
||||
|
@ -95,6 +175,56 @@ When the plugin is enabled, a map widget will be shown in the dataset search for
|
|||
where users can refine their searchs drawing an area of interest.
|
||||
|
||||
|
||||
|
||||
Solr configuration issues on legacy PostGIS backend
|
||||
+++++++++++++++++++++++++++++++++++++++++++++++++++
|
||||
|
||||
.. warning::
|
||||
|
||||
If you find any of the issues described in this section it is strongly
|
||||
suggested that you consider switching to one of the Solr based backends
|
||||
which are much more efficient. These notes are just kept for informative
|
||||
purposes.
|
||||
|
||||
|
||||
If using Spatial Query functionality then there is an additional SOLR/Lucene setting that should be used to set the limit on number of datasets searchable with a spatial value.
|
||||
|
||||
The setting is ``maxBooleanClauses`` in the solrconfig.xml and the value is the number of datasets spatially searchable. The default is ``1024`` and this could be increased to say ``16384``. For a SOLR single core this will probably be at `/etc/solr/conf/solrconfig.xml`. For a multiple core set-up, there will me several solrconfig.xml files a couple of levels below `/etc/solr`. For that case, *all* of the cores' `solrconfig.xml` should have this setting at the new value.
|
||||
|
||||
Example::
|
||||
|
||||
<maxBooleanClauses>16384</maxBooleanClauses>
|
||||
|
||||
This setting is needed because PostGIS spatial query results are fed into SOLR using a Boolean expression, and the parser for that has a limit. So if your spatial area contains more than the limit (of which the default is 1024) then you will get this error::
|
||||
|
||||
Dataset search error: ('SOLR returned an error running query...
|
||||
|
||||
and in the SOLR logs you see::
|
||||
|
||||
too many boolean clauses
|
||||
...
|
||||
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses:
|
||||
maxClauseCount is set to 1024
|
||||
|
||||
|
||||
Legacy API
|
||||
++++++++++
|
||||
|
||||
The extension adds the following call to the CKAN search API, which returns
|
||||
datasets with an extent that intersects with the bounding box provided::
|
||||
|
||||
/api/2/search/dataset/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]
|
||||
|
||||
If the bounding box coordinates are not in the same projection as the one
|
||||
defined in the database, a CRS must be provided, in one of the following
|
||||
forms:
|
||||
|
||||
- urn:ogc:def:crs:EPSG::4326
|
||||
- EPSG:4326
|
||||
- 4326
|
||||
|
||||
|
||||
|
||||
Dataset Extent Map
|
||||
------------------
|
||||
|
||||
|
@ -444,27 +574,6 @@ cswservice.rndlog_threshold is the percentage of interactions to store in the lo
|
|||
|
||||
|
||||
|
||||
SOLR Configuration
|
||||
------------------
|
||||
|
||||
If using Spatial Query functionality then there is an additional SOLR/Lucene setting that should be used to set the limit on number of datasets searchable with a spatial value.
|
||||
|
||||
The setting is ``maxBooleanClauses`` in the solrconfig.xml and the value is the number of datasets spatially searchable. The default is ``1024`` and this could be increased to say ``16384``. For a SOLR single core this will probably be at `/etc/solr/conf/solrconfig.xml`. For a multiple core set-up, there will me several solrconfig.xml files a couple of levels below `/etc/solr`. For that case, *ALL* of the cores' `solrconfig.xml` should have this setting at the new value.
|
||||
|
||||
Example::
|
||||
|
||||
<maxBooleanClauses>16384</maxBooleanClauses>
|
||||
|
||||
This setting is needed because PostGIS spatial query results are fed into SOLR using a Boolean expression, and the parser for that has a limit. So if your spatial area contains more than the limit (of which the default is 1024) then you will get this error::
|
||||
|
||||
Dataset search error: ('SOLR returned an error running query...
|
||||
|
||||
and in the SOLR logs you see::
|
||||
|
||||
too many boolean clauses
|
||||
...
|
||||
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses:
|
||||
maxClauseCount is set to 1024
|
||||
|
||||
|
||||
Troubleshooting
|
||||
|
|
|
@ -114,7 +114,7 @@ class SpatialHarvester(HarvesterBase):
|
|||
force_import = False
|
||||
|
||||
extent_template = Template('''
|
||||
{"type": "Polygon", "coordinates": [[[$xmin, $ymin], [$xmin, $ymax], [$xmax, $ymax], [$xmax, $ymin], [$xmin, $ymin]]]}
|
||||
{"type": "Polygon", "coordinates": [[[$xmin, $ymin], [$xmax, $ymin], [$xmax, $ymax], [$xmin, $ymax], [$xmin, $ymin]]]}
|
||||
''')
|
||||
|
||||
## IHarvester
|
||||
|
|
|
@ -6,6 +6,8 @@ from pylons import config
|
|||
from genshi.input import HTML
|
||||
from genshi.filters import Transformer
|
||||
|
||||
import shapely
|
||||
|
||||
from ckan import plugins as p
|
||||
|
||||
from ckan.lib.search import SearchError, PackageSearchQuery
|
||||
|
@ -122,6 +124,13 @@ class SpatialQuery(p.SingletonPlugin):
|
|||
|
||||
p.implements(p.IRoutes, inherit=True)
|
||||
p.implements(p.IPackageController, inherit=True)
|
||||
p.implements(p.IConfigurable, inherit=True)
|
||||
|
||||
search_backend = None
|
||||
|
||||
def configure(self, config):
|
||||
|
||||
self.search_backend = config.get('ckanext.spatial.search_backend', 'postgis')
|
||||
|
||||
def before_map(self, map):
|
||||
|
||||
|
@ -130,7 +139,67 @@ class SpatialQuery(p.SingletonPlugin):
|
|||
action='spatial_query')
|
||||
return map
|
||||
|
||||
def before_search(self,search_params):
|
||||
def before_index(self, pkg_dict):
|
||||
|
||||
if 'extras_spatial' in pkg_dict and self.search_backend in ('solr', 'solr-spatial-field'):
|
||||
try:
|
||||
geometry = json.loads(pkg_dict['extras_spatial'])
|
||||
except ValueError, e:
|
||||
log.error('Geometry not valid GeoJSON, not indexing')
|
||||
return pkg_dict
|
||||
|
||||
if self.search_backend == 'solr':
|
||||
# Only bbox supported for this backend
|
||||
if not (geometry['type'] == 'Polygon'
|
||||
and len(geometry['coordinates']) == 1
|
||||
and len(geometry['coordinates'][0]) == 5):
|
||||
log.error('Solr backend only supports bboxes, ignoring geometry {0}'.format(pkg_dict['extras_spatial']))
|
||||
return pkg_dict
|
||||
|
||||
coords = geometry['coordinates']
|
||||
pkg_dict['maxy'] = max(coords[0][2][1], coords[0][0][1])
|
||||
pkg_dict['miny'] = min(coords[0][2][1], coords[0][0][1])
|
||||
pkg_dict['maxx'] = max(coords[0][2][0], coords[0][0][0])
|
||||
pkg_dict['minx'] = min(coords[0][2][0], coords[0][0][0])
|
||||
pkg_dict['bbox_area'] = (pkg_dict['maxx'] - pkg_dict['minx']) * \
|
||||
(pkg_dict['maxy'] - pkg_dict['miny'])
|
||||
|
||||
elif self.search_backend == 'solr-spatial-field':
|
||||
wkt = None
|
||||
|
||||
# Check potential problems with bboxes
|
||||
if geometry['type'] == 'Polygon' \
|
||||
and len(geometry['coordinates']) == 1 \
|
||||
and len(geometry['coordinates'][0]) == 5:
|
||||
|
||||
# Check wrong bboxes (4 same points)
|
||||
xs = [p[0] for p in geometry['coordinates'][0]]
|
||||
ys = [p[1] for p in geometry['coordinates'][0]]
|
||||
|
||||
if xs.count(xs[0]) == 5 and ys.count(ys[0]) == 5:
|
||||
wkt = 'POINT({x} {y})'.format(x=xs[0], y=ys[0])
|
||||
else:
|
||||
# Check if coordinates are defined counter-clockwise,
|
||||
# otherwise we'll get wrong results from Solr
|
||||
lr = shapely.geometry.polygon.LinearRing(geometry['coordinates'][0])
|
||||
if not lr.is_ccw:
|
||||
lr.coords = list(lr.coords)[::-1]
|
||||
polygon = shapely.geometry.polygon.Polygon(lr)
|
||||
wkt = polygon.wkt
|
||||
|
||||
if not wkt:
|
||||
shape = shapely.geometry.asShape(geometry)
|
||||
if not shape.is_valid:
|
||||
log.error('Wrong geometry, not indexing')
|
||||
return pkg_dict
|
||||
wkt = shape.wkt
|
||||
|
||||
pkg_dict['spatial_geom'] = wkt
|
||||
|
||||
|
||||
return pkg_dict
|
||||
|
||||
def before_search(self, search_params):
|
||||
if 'extras' in search_params and 'ext_bbox' in search_params['extras'] \
|
||||
and search_params['extras']['ext_bbox']:
|
||||
|
||||
|
@ -138,6 +207,78 @@ class SpatialQuery(p.SingletonPlugin):
|
|||
if not bbox:
|
||||
raise SearchError('Wrong bounding box provided')
|
||||
|
||||
if self.search_backend == 'solr':
|
||||
search_params = self._params_for_solr_search(bbox, search_params)
|
||||
elif self.search_backend == 'solr-spatial-field':
|
||||
search_params = self._params_for_solr_spatial_search(bbox, search_params)
|
||||
elif self.search_backend == 'postgis':
|
||||
search_params = self._params_for_postgis_search(bbox, search_params)
|
||||
|
||||
return search_params
|
||||
|
||||
def _params_for_solr_search(self, bbox, search_params):
|
||||
'''
|
||||
This will add the following parameters to the query:
|
||||
|
||||
defType - edismax (We need to define EDisMax to use bf)
|
||||
bf - {function} A boost function to influence the score (thus
|
||||
influencing the sorting). The algorithm can be basically defined as:
|
||||
|
||||
2 * X / Q + T
|
||||
|
||||
Where X is the intersection between the query area Q and the
|
||||
target geometry T. It gives a ratio from 0 to 1 where 0 means
|
||||
no overlap at all and 1 a perfect fit
|
||||
|
||||
fq - Adds a filter that force the value returned by the previous
|
||||
function to be between 0 and 1, effectively applying the
|
||||
spatial filter.
|
||||
|
||||
'''
|
||||
|
||||
variables =dict(
|
||||
x11=bbox['minx'],
|
||||
x12=bbox['maxx'],
|
||||
y11=bbox['miny'],
|
||||
y12=bbox['maxy'],
|
||||
x21='minx',
|
||||
x22='maxx',
|
||||
y21='miny',
|
||||
y22='maxy',
|
||||
area_search = abs(bbox['maxx'] - bbox['minx']) * abs(bbox['maxy'] - bbox['miny'])
|
||||
)
|
||||
|
||||
bf = '''div(
|
||||
mul(
|
||||
mul(max(0, sub(min({x12},{x22}) , max({x11},{x21}))),
|
||||
max(0, sub(min({y12},{y22}) , max({y11},{y21})))
|
||||
),
|
||||
2),
|
||||
add({area_search}, mul(sub({y22}, {y21}), sub({x22}, {x21})))
|
||||
)'''.format(**variables).replace('\n','').replace(' ','')
|
||||
|
||||
search_params['fq_list'] = ['{!frange incl=false l=0 u=1}%s' % bf]
|
||||
|
||||
search_params['bf'] = bf
|
||||
search_params['defType'] = 'edismax'
|
||||
|
||||
return search_params
|
||||
|
||||
def _params_for_solr_spatial_field_search(self, bbox, search_params):
|
||||
'''
|
||||
This will add an fq filter with the form:
|
||||
|
||||
+spatial_geom:"Intersects({minx} {miny} {maxx} {maxy})
|
||||
|
||||
'''
|
||||
search_params['fq_list'] = search_params.get('fq_list', [])
|
||||
search_params['fq_list'].append('+spatial_geom:"Intersects({minx} {miny} {maxx} {maxy})"'
|
||||
.format(minx=bbox['minx'],miny=bbox['miny'],maxx=bbox['maxx'],maxy=bbox['maxy']))
|
||||
|
||||
return search_params
|
||||
|
||||
def _params_for_postgis_search(self, bbox, search_params):
|
||||
|
||||
# Note: This will be deprecated at some point in favour of the
|
||||
# Solr 4 spatial sorting capabilities
|
||||
if search_params['sort'] == 'spatial desc' and \
|
||||
|
|
Loading…
Reference in New Issue