spatial-d4science/doc/csw.rst

7.6 KiB

CSW support

The extension provides the support for the CSW standard, a specification from the Open Geospatial Consortium for exposing geospatial catalogues over the web.

This support consists of:

  • Ability to import records from CSW servers with the CSW harvester. See harvesters for more details.
  • Integration with pycsw to provide a fully compliant CSW interface for harvested records. This integration is described in the following sections.

ckan-pycsw

The spatial extension offers the ckan-pycsw command, which allows to expose the spatial datasets harvested from other sources in a CSW interface. This is powered by pycsw, which fully implements the OGC CSW specification.

How it works

The current implementation is based on CKAN and pycsw being loosely integrated via the CKAN API. pycsw will be generally installed in the same server as CKAN (although it can also be run on a separate one), and the synchronization command will be run regularly to keep the records on the pycsw repository up to date. This is done using the CKAN API to get all the datasets identifiers (more precisely the ones from datasets that have been harvested) and then deciding which ones need to be created, updated or deleted on the pycsw repository. For those that need to be created or updated, the original harvested spatial document (ie ISO 19139) is requested from CKAN, and it is then imported using pycsw internal functions:

Harvested
datasets
   +
   |
   v
+--------+                 +---------+
|        |    CKAN API     |         |
|  CKAN  | +------------>  |  pycsw  | +------> CSW
|        |                 |         |
+--------+                 +---------+

Remember, only datasets that were harvested with the harvesters can currently be exposed via pycsw.

All necessary tasks are done with the ckan-pycsw command. To get more details of its usage, run the following:

cd /usr/lib/ckan/default/src/ckanext-spatial
python bin/ckan_pycsw.py --help

Setup

  1. Install pycsw. There are several options for this, depending on your server setup, check the pycsw documentation.

    Note

    CKAN integration requires least pycsw version 1.8.0. In general, use the latest stable version.

    The following instructions assume that you have installed CKAN via a package install and should be run as root, but the steps are the same if you are setting it up in another location:

    cd /usr/lib/ckan/default/src
    source ../bin/activate
    
    # From now on the virtualenv should be activated
    
    git clone https://github.com/geopython/pycsw.git
    cd pycsw
    # always use the latest stable version
    git checkout 1.10.4
    pip install -e .
    python setup.py build
    python setup.py install
  2. Create a database for pycsw. In theory you can use the same database that CKAN is using, but if you want to keep them separated, use the following command to create a new one (we'll use the same default user though):

    sudo -u postgres createdb -O ckan_default pycsw -E utf-8

    It is strongly recommended that you install PostGIS in the pycsw database, so its spatial functions are used.

  3. Configure pycsw. An example configuration file is included on the source:

    cp default-sample.cfg default.cfg

    To keep things tidy we will create a symlink to this file on the CKAN configuration directory:

    ln -s /usr/lib/ckan/default/src/pycsw/default.cfg /etc/ckan/default/pycsw.cfg

    Open the file with your favourite editor. The main settings you should tweak are server.home and repository.database:

    [server]
    home=/usr/lib/ckan/default/src/pycsw
    ...
    [repository]
    database=postgresql://ckan_default:pass@localhost/pycsw

    The rest of the options are described here.

  4. Setup the pycsw table. This is done with the ckan-pycsw script (Remember to have the virtualenv activated when running it):

    cd /usr/lib/ckan/default/src/ckanext-spatial
    python bin/ckan_pycsw.py setup -p /etc/ckan/default/pycsw.cfg

    At this point you should be ready to run pycsw with the wsgi script that it includes:

    cd /usr/lib/ckan/default/src/pycsw
    python csw.wsgi

    This will run pycsw at http://localhost:8000. Visiting the following URL should return you the Capabilities file:

    http://localhost:8000/?service=CSW&version=2.0.2&request=GetCapabilities

  5. Load the CKAN datasets into pycsw. Again, we will use the ckan-pycsw command for this:

    cd /usr/lib/ckan/default/src/ckanext-spatial
    python bin/ckan_pycsw.py load -p /etc/ckan/default/pycsw.cfg

    When the loading is finished, check that results are returned when visiting this link:

    http://localhost:8000/?request=GetRecords&service=CSW&version=2.0.2&resultType=results&outputSchema=http://www.isotc211.org/2005/gmd&typeNames=csw:Record&elementSetName=summary

    The numberOfRecordsMatched should match the number of harvested datasets in CKAN (minus import errors). If you run the command again new or udpated datasets will be synchronized and deleted datasets from CKAN will be removed from pycsw as well.

Setting Service Metadata Keywords

The CSW standard allows for administrators to set CSW service metadata. These values can be set in the pycsw configuration metadata:main section. If you would like the CSW service metadata keywords to be reflective of the CKAN tags, run the following convenience command:

python ckan_pycsw.py set_keywords -p /etc/ckan/default/pycsw.cfg

Note that you must have privileges to write to the pycsw configuration file.

Running it on production site

On a production site you probably want to run the load command regularly to keep CKAN and pycsw in sync, and serve pycsw with Apache + mod_wsgi like CKAN.

  • To run the load command regularly you can set up a cron job. Type crontab -e and copy the following lines:

    # m h  dom mon dow   command
    0 *  *   *   *     /var/lib/ckan/default/bin/python /var/lib/ckan/default/src/ckanext-spatial/bin/ckan_pycsw.py load -p /etc/ckan/default/pycsw.cfg

    This particular example will run the load command every hour. You can of course modify this periodicity, for instance reducing it for huge instances. This Wikipedia page has a good overview of the crontab syntax.

  • To run pycsw under Apache check the pycsw installation documentation or follow these quick steps (they assume the paths used in previous steps):

    • Edit /etc/apache2/sites-available/ckan_default and add the following line just before the existing WSGIScriptAlias directive:

      WSGIScriptAlias /csw /usr/lib/ckan/default/src/pycsw/csw.wsgi
    • Edit the /usr/lib/ckan/default/src/pycsw/csw.wsgi file and add these two lines just after the imports on the top of the file:

      activate_this = os.path.join('/usr/lib/ckan/default/bin/activate_this.py')
      execfile(activate_this, {"__file__":activate_this})

      We need these to activate the virtualenv where we installed pycsw into.

    • Restart Apache:

      service apache2 restart

      pycsw should be now accessible at http://localhost/csw