Merge branch '15-solr-based-spatial-search' into release-v2.0

This commit is contained in:
amercader 2013-04-12 10:56:21 +01:00
commit 51a2b20501
3 changed files with 339 additions and 89 deletions

View File

@ -32,30 +32,15 @@ About the components
Spatial Search
--------------
To enable the spatial query you need to add the `spatial_query` plugin to your
ini file (See `Configuration`_). This plugin requires the `spatial_metadata`
plugin.
The extension adds the following call to the CKAN search API, which returns
datasets with an extent that intersects with the bounding box provided::
/api/2/search/dataset/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]
If the bounding box coordinates are not in the same projection as the one
defined in the database, a CRS must be provided, in one of the following
forms:
- urn:ogc:def:crs:EPSG::4326
- EPSG:4326
- 4326
As of CKAN 1.6, you can integrate your spatial query in the full CKAN
search, via the web interface (see the `Spatial Search Widget`_) or
via the `action API`__, e.g.::
The spatial extension allows to index datasets with spatial information so
they can be filtered via a spatial query. This includes both via the web
interface (see the `Spatial Search Widget`_) or via the `action API`__, e.g.::
POST http://localhost:5000/api/action/package_search
{
"q": "Pollution",
"facet": "true",
"facet.field": "country",
"extras": {
"ext_bbox": "-7.535093,49.208494,3.890688,57.372349"
}
@ -63,12 +48,107 @@ via the `action API`__, e.g.::
__ http://docs.ckan.org/en/latest/apiv3.html
To enable the spatial query you need to add the ``spatial_query`` plugin to your
ini file (See `Configuration`_). This plugin requires the ``spatial_metadata``
plugin.
There are different backends supported for the spatial search, it is important
to understand their differences and the necessary setup required when choosing
which one to use. The backend to use is defined with the configuration option
``ckanext.spatial.search_backend``, eg::
ckanext.spatial.search_backend = solr
The following table summarizes the different spatial search backends:
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
| Backend | Solr Versions | Supported geometries | Sorting and relevance | Performance with large number of datasets |
+========================+===============+=====================================+===========================================================+===========================================+
| ``solr`` | 3.1 to 4.x | Bounding Box | Yes, spatial sorting combined with other query parameters | Good |
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
| ``solr-spatial-field`` | 4.x | Bounding Box, Point and Polygon (1) | Not implemented | Good |
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
| ``postgis`` | 1.3 to 4.x | Bounding Box | Partial, only spatial sorting supported (2) | Poor |
+------------------------+---------------+-------------------------------------+-----------------------------------------------------------+-------------------------------------------+
(1) Requires JTS
(2) Needs ``ckanext.spatial.use_postgis_sorting`` set to True
We recommend to use the ``solr`` backend whenever possible. Here are more
details about the available options:
* ``solr`` (Recommended)
This option uses normal Solr fields to index the relevant bits of
information about the geometry and uses an algorithm function to
sort results by relevance, keeping any other non-spatial filtering. It only
supports bounding boxes both for the geometries to be indexed and the input
query shape. It requires `EDisMax`_ query parser, so it will only work on
versions of Solr greater than 3.1 (We recommend using Solr 4.x).
You will need to add the following fields to your Solr schema file to enable it::
<fields>
<!-- ... -->
<field name="bbox_area" type="float" indexed="true" stored="true" />
<field name="maxx" type="float" indexed="true" stored="true" />
<field name="maxy" type="float" indexed="true" stored="true" />
<field name="minx" type="float" indexed="true" stored="true" />
<field name="miny" type="float" indexed="true" stored="true" />
</fields>
* ``solr-spatial-field``
This option uses the `spatial field <http://wiki.apache.org/solr/SolrAdaptersForLuceneSpatial4>`_
introduced in Solr 4, which allows to index points, rectangles and more
complex geometries (complex geometries will require `JTS`_, check the
documentation). Sorting has not yet been implemented, users willing to do so
will need to modify the query using the ``before_search`` extension point.
You will need to add the following field type and field to your Solr schema
file to enable it (Check the Solr documentation for more information on
the different parameters, note that you don't need ``spatialContextFactory`` if
you are not using JTS)::
<types>
<!-- ... -->
<fieldType name="location_rpt" class="solr.SpatialRecursivePrefixTreeFieldType"
spatialContextFactory="com.spatial4j.core.context.jts.JtsSpatialContextFactory"
distErrPct="0.025"
maxDistErr="0.000009"
units="degrees"
/>
</types>
<fields>
<!-- ... -->
<field name="spatial_geom" type="location_rpt" indexed="true" stored="true" multiValued="true" />
</fields>
* ``postgis``
This is the original implementation of the spatial search. It does not
require any change in the Solr schema and can run on Solr 1.x, but it is
not as efficient as the previous ones. Basically the bounding box based
query is performed in PostGIS first, and the ids of the matched datasets
are added as a filter to the Solr request. This, apart from being much
less efficient, can led to issues on Solr due to size of the requests (See
`Solr configuration issues on legacy PostGIS backend`_). There is support
for a spatial ranking on this backend (setting
``ckanext.spatial.use_postgis_sorting`` to True on the ini file), but it
can not be combined with any other filtering.
.. _edismax: http://wiki.apache.org/solr/ExtendedDisMax
.. _JTS: http://www.vividsolutions.com/jts/JTSHome.htm
Geo-Indexing your datasets
++++++++++++++++++++++++++
In order to make a dataset queryable by location, an special extra must
be defined, with its key named 'spatial'. The value must be a valid GeoJSON_
geometry, for example::
Regardless of the backend that you are using, in order to make a dataset
queryable by location, an special extra must be defined, with its key named
'spatial'. The value must be a valid GeoJSON_ geometry, for example::
{"type":"Polygon","coordinates":[[[2.05827, 49.8625],[2.05827, 55.7447], [-6.41736, 55.7447], [-6.41736, 49.8625], [2.05827, 49.8625]]]}
@ -83,7 +163,7 @@ the information stored in the extra with the geometry table.
Spatial Search Widget
---------------------
+++++++++++++++++++++
**Note**: this plugin requires CKAN 1.6 or higher.
@ -95,6 +175,56 @@ When the plugin is enabled, a map widget will be shown in the dataset search for
where users can refine their searchs drawing an area of interest.
Solr configuration issues on legacy PostGIS backend
+++++++++++++++++++++++++++++++++++++++++++++++++++
.. warning::
If you find any of the issues described in this section it is strongly
suggested that you consider switching to one of the Solr based backends
which are much more efficient. These notes are just kept for informative
purposes.
If using Spatial Query functionality then there is an additional SOLR/Lucene setting that should be used to set the limit on number of datasets searchable with a spatial value.
The setting is ``maxBooleanClauses`` in the solrconfig.xml and the value is the number of datasets spatially searchable. The default is ``1024`` and this could be increased to say ``16384``. For a SOLR single core this will probably be at `/etc/solr/conf/solrconfig.xml`. For a multiple core set-up, there will me several solrconfig.xml files a couple of levels below `/etc/solr`. For that case, *all* of the cores' `solrconfig.xml` should have this setting at the new value.
Example::
<maxBooleanClauses>16384</maxBooleanClauses>
This setting is needed because PostGIS spatial query results are fed into SOLR using a Boolean expression, and the parser for that has a limit. So if your spatial area contains more than the limit (of which the default is 1024) then you will get this error::
Dataset search error: ('SOLR returned an error running query...
and in the SOLR logs you see::
too many boolean clauses
...
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024
Legacy API
++++++++++
The extension adds the following call to the CKAN search API, which returns
datasets with an extent that intersects with the bounding box provided::
/api/2/search/dataset/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]
If the bounding box coordinates are not in the same projection as the one
defined in the database, a CRS must be provided, in one of the following
forms:
- urn:ogc:def:crs:EPSG::4326
- EPSG:4326
- 4326
Dataset Extent Map
------------------
@ -444,27 +574,6 @@ cswservice.rndlog_threshold is the percentage of interactions to store in the lo
SOLR Configuration
------------------
If using Spatial Query functionality then there is an additional SOLR/Lucene setting that should be used to set the limit on number of datasets searchable with a spatial value.
The setting is ``maxBooleanClauses`` in the solrconfig.xml and the value is the number of datasets spatially searchable. The default is ``1024`` and this could be increased to say ``16384``. For a SOLR single core this will probably be at `/etc/solr/conf/solrconfig.xml`. For a multiple core set-up, there will me several solrconfig.xml files a couple of levels below `/etc/solr`. For that case, *ALL* of the cores' `solrconfig.xml` should have this setting at the new value.
Example::
<maxBooleanClauses>16384</maxBooleanClauses>
This setting is needed because PostGIS spatial query results are fed into SOLR using a Boolean expression, and the parser for that has a limit. So if your spatial area contains more than the limit (of which the default is 1024) then you will get this error::
Dataset search error: ('SOLR returned an error running query...
and in the SOLR logs you see::
too many boolean clauses
...
Caused by: org.apache.lucene.search.BooleanQuery$TooManyClauses:
maxClauseCount is set to 1024
Troubleshooting

View File

@ -114,7 +114,7 @@ class SpatialHarvester(HarvesterBase):
force_import = False
extent_template = Template('''
{"type": "Polygon", "coordinates": [[[$xmin, $ymin], [$xmin, $ymax], [$xmax, $ymax], [$xmax, $ymin], [$xmin, $ymin]]]}
{"type": "Polygon", "coordinates": [[[$xmin, $ymin], [$xmax, $ymin], [$xmax, $ymax], [$xmin, $ymax], [$xmin, $ymin]]]}
''')
## IHarvester

View File

@ -6,6 +6,8 @@ from pylons import config
from genshi.input import HTML
from genshi.filters import Transformer
import shapely
from ckan import plugins as p
from ckan.lib.search import SearchError, PackageSearchQuery
@ -122,6 +124,13 @@ class SpatialQuery(p.SingletonPlugin):
p.implements(p.IRoutes, inherit=True)
p.implements(p.IPackageController, inherit=True)
p.implements(p.IConfigurable, inherit=True)
search_backend = None
def configure(self, config):
self.search_backend = config.get('ckanext.spatial.search_backend', 'postgis')
def before_map(self, map):
@ -130,7 +139,67 @@ class SpatialQuery(p.SingletonPlugin):
action='spatial_query')
return map
def before_search(self,search_params):
def before_index(self, pkg_dict):
if 'extras_spatial' in pkg_dict and self.search_backend in ('solr', 'solr-spatial-field'):
try:
geometry = json.loads(pkg_dict['extras_spatial'])
except ValueError, e:
log.error('Geometry not valid GeoJSON, not indexing')
return pkg_dict
if self.search_backend == 'solr':
# Only bbox supported for this backend
if not (geometry['type'] == 'Polygon'
and len(geometry['coordinates']) == 1
and len(geometry['coordinates'][0]) == 5):
log.error('Solr backend only supports bboxes, ignoring geometry {0}'.format(pkg_dict['extras_spatial']))
return pkg_dict
coords = geometry['coordinates']
pkg_dict['maxy'] = max(coords[0][2][1], coords[0][0][1])
pkg_dict['miny'] = min(coords[0][2][1], coords[0][0][1])
pkg_dict['maxx'] = max(coords[0][2][0], coords[0][0][0])
pkg_dict['minx'] = min(coords[0][2][0], coords[0][0][0])
pkg_dict['bbox_area'] = (pkg_dict['maxx'] - pkg_dict['minx']) * \
(pkg_dict['maxy'] - pkg_dict['miny'])
elif self.search_backend == 'solr-spatial-field':
wkt = None
# Check potential problems with bboxes
if geometry['type'] == 'Polygon' \
and len(geometry['coordinates']) == 1 \
and len(geometry['coordinates'][0]) == 5:
# Check wrong bboxes (4 same points)
xs = [p[0] for p in geometry['coordinates'][0]]
ys = [p[1] for p in geometry['coordinates'][0]]
if xs.count(xs[0]) == 5 and ys.count(ys[0]) == 5:
wkt = 'POINT({x} {y})'.format(x=xs[0], y=ys[0])
else:
# Check if coordinates are defined counter-clockwise,
# otherwise we'll get wrong results from Solr
lr = shapely.geometry.polygon.LinearRing(geometry['coordinates'][0])
if not lr.is_ccw:
lr.coords = list(lr.coords)[::-1]
polygon = shapely.geometry.polygon.Polygon(lr)
wkt = polygon.wkt
if not wkt:
shape = shapely.geometry.asShape(geometry)
if not shape.is_valid:
log.error('Wrong geometry, not indexing')
return pkg_dict
wkt = shape.wkt
pkg_dict['spatial_geom'] = wkt
return pkg_dict
def before_search(self, search_params):
if 'extras' in search_params and 'ext_bbox' in search_params['extras'] \
and search_params['extras']['ext_bbox']:
@ -138,53 +207,125 @@ class SpatialQuery(p.SingletonPlugin):
if not bbox:
raise SearchError('Wrong bounding box provided')
# Note: This will be deprecated at some point in favour of the
# Solr 4 spatial sorting capabilities
if search_params['sort'] == 'spatial desc' and \
p.toolkit.asbool(config.get('ckanext.spatial.use_postgis_sorting', 'False')):
if search_params['q'] or search_params['fq']:
raise SearchError('Spatial ranking cannot be mixed with other search parameters')
# ...because it is too inefficient to use SOLR to filter
# results and return the entire set to this class and
# after_search do the sorting and paging.
extents = bbox_query_ordered(bbox)
are_no_results = not extents
search_params['extras']['ext_rows'] = search_params['rows']
search_params['extras']['ext_start'] = search_params['start']
# this SOLR query needs to return no actual results since
# they are in the wrong order anyway. We just need this SOLR
# query to get the count and facet counts.
rows = 0
search_params['sort'] = None # SOLR should not sort.
# Store the rankings of the results for this page, so for
# after_search to construct the correctly sorted results
rows = search_params['extras']['ext_rows'] = search_params['rows']
start = search_params['extras']['ext_start'] = search_params['start']
search_params['extras']['ext_spatial'] = [
(extent.package_id, extent.spatial_ranking) \
for extent in extents[start:start+rows]]
else:
extents = bbox_query(bbox)
are_no_results = extents.count() == 0
if self.search_backend == 'solr':
search_params = self._params_for_solr_search(bbox, search_params)
elif self.search_backend == 'solr-spatial-field':
search_params = self._params_for_solr_spatial_search(bbox, search_params)
elif self.search_backend == 'postgis':
search_params = self._params_for_postgis_search(bbox, search_params)
if are_no_results:
# We don't need to perform the search
search_params['abort_search'] = True
else:
# We'll perform the existing search but also filtering by the ids
# of datasets within the bbox
bbox_query_ids = [extent.package_id for extent in extents]
return search_params
q = search_params.get('q','').strip() or '""'
new_q = '%s AND ' % q if q else ''
new_q += '(%s)' % ' OR '.join(['id:%s' % id for id in bbox_query_ids])
def _params_for_solr_search(self, bbox, search_params):
'''
This will add the following parameters to the query:
search_params['q'] = new_q
defType - edismax (We need to define EDisMax to use bf)
bf - {function} A boost function to influence the score (thus
influencing the sorting). The algorithm can be basically defined as:
2 * X / Q + T
Where X is the intersection between the query area Q and the
target geometry T. It gives a ratio from 0 to 1 where 0 means
no overlap at all and 1 a perfect fit
fq - Adds a filter that force the value returned by the previous
function to be between 0 and 1, effectively applying the
spatial filter.
'''
variables =dict(
x11=bbox['minx'],
x12=bbox['maxx'],
y11=bbox['miny'],
y12=bbox['maxy'],
x21='minx',
x22='maxx',
y21='miny',
y22='maxy',
area_search = abs(bbox['maxx'] - bbox['minx']) * abs(bbox['maxy'] - bbox['miny'])
)
bf = '''div(
mul(
mul(max(0, sub(min({x12},{x22}) , max({x11},{x21}))),
max(0, sub(min({y12},{y22}) , max({y11},{y21})))
),
2),
add({area_search}, mul(sub({y22}, {y21}), sub({x22}, {x21})))
)'''.format(**variables).replace('\n','').replace(' ','')
search_params['fq_list'] = ['{!frange incl=false l=0 u=1}%s' % bf]
search_params['bf'] = bf
search_params['defType'] = 'edismax'
return search_params
def _params_for_solr_spatial_field_search(self, bbox, search_params):
'''
This will add an fq filter with the form:
+spatial_geom:"Intersects({minx} {miny} {maxx} {maxy})
'''
search_params['fq_list'] = search_params.get('fq_list', [])
search_params['fq_list'].append('+spatial_geom:"Intersects({minx} {miny} {maxx} {maxy})"'
.format(minx=bbox['minx'],miny=bbox['miny'],maxx=bbox['maxx'],maxy=bbox['maxy']))
return search_params
def _params_for_postgis_search(self, bbox, search_params):
# Note: This will be deprecated at some point in favour of the
# Solr 4 spatial sorting capabilities
if search_params['sort'] == 'spatial desc' and \
p.toolkit.asbool(config.get('ckanext.spatial.use_postgis_sorting', 'False')):
if search_params['q'] or search_params['fq']:
raise SearchError('Spatial ranking cannot be mixed with other search parameters')
# ...because it is too inefficient to use SOLR to filter
# results and return the entire set to this class and
# after_search do the sorting and paging.
extents = bbox_query_ordered(bbox)
are_no_results = not extents
search_params['extras']['ext_rows'] = search_params['rows']
search_params['extras']['ext_start'] = search_params['start']
# this SOLR query needs to return no actual results since
# they are in the wrong order anyway. We just need this SOLR
# query to get the count and facet counts.
rows = 0
search_params['sort'] = None # SOLR should not sort.
# Store the rankings of the results for this page, so for
# after_search to construct the correctly sorted results
rows = search_params['extras']['ext_rows'] = search_params['rows']
start = search_params['extras']['ext_start'] = search_params['start']
search_params['extras']['ext_spatial'] = [
(extent.package_id, extent.spatial_ranking) \
for extent in extents[start:start+rows]]
else:
extents = bbox_query(bbox)
are_no_results = extents.count() == 0
if are_no_results:
# We don't need to perform the search
search_params['abort_search'] = True
else:
# We'll perform the existing search but also filtering by the ids
# of datasets within the bbox
bbox_query_ids = [extent.package_id for extent in extents]
q = search_params.get('q','').strip() or '""'
new_q = '%s AND ' % q if q else ''
new_q += '(%s)' % ' OR '.join(['id:%s' % id for id in bbox_query_ids])
search_params['q'] = new_q
return search_params
def after_search(self, search_results, search_params):
# Note: This will be deprecated at some point in favour of the
# Solr 4 spatial sorting capabilities
@ -267,12 +408,12 @@ class HarvestMetadataApi(p.SingletonPlugin):
'''
Harvest Metadata API
(previously called "InspireApi")
A way for a user to view the harvested metadata XML, either as a raw file or
styled to view in a web browser.
'''
p.implements(p.IRoutes)
def before_map(self, route_map):
controller = "ckanext.spatial.controllers.api:HarvestMetadataApiController"