2011-04-15 16:36:53 +02:00
|
|
|
=============================================
|
2011-03-16 12:06:25 +01:00
|
|
|
ckanext-harvest - Remote harvesting extension
|
2011-04-15 16:36:53 +02:00
|
|
|
=============================================
|
|
|
|
|
|
|
|
This extension provides a common harvesting framework for ckan extensions
|
|
|
|
and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
|
|
|
|
|
2011-07-18 18:34:24 +02:00
|
|
|
Installation
|
2011-04-15 16:36:53 +02:00
|
|
|
============
|
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
1. The harvest extension can use two different backends. You can choose whichever
|
|
|
|
you prefer depending on your needs:
|
2011-04-15 16:36:53 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
* `RabbitMQ <http://www.rabbitmq.com/>`_: To install it, run::
|
2011-04-15 16:36:53 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
sudo apt-get install rabbitmq-server
|
2011-04-15 16:36:53 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
* `Redis <http://redis.io/>`_: To install it, run::
|
2013-05-14 18:00:20 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
sudo apt-get install redis-server
|
2013-05-14 18:00:20 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
2. Install the extension into your python environment.
|
2011-07-18 18:34:24 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
*Note:* Depending on the CKAN core version you are targeting you will need to
|
|
|
|
use a different branch from the extension.
|
2011-07-18 18:34:24 +02:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
For a production site, use the `stable` branch, unless there is a specific
|
|
|
|
branch that targets the CKAN core version that you are using.
|
|
|
|
|
|
|
|
To target the latest CKAN core release::
|
|
|
|
|
|
|
|
(pyenv) $ pip install -e git+https://github.com/okfn/ckanext-harvest.git@stable#egg=ckanext-harvest
|
|
|
|
|
|
|
|
To target an old release (if a release branch exists, otherwise use `stable`)::
|
|
|
|
|
|
|
|
(pyenv) $ pip install -e git+https://github.com/okfn/ckanext-harvest.git@release-v1.8#egg=ckanext-harvest
|
|
|
|
|
|
|
|
To target CKAN `master`, use the extension `master` branch (ie no branch defined)::
|
|
|
|
|
|
|
|
(pyenv) $ pip install -e git+https://github.com/okfn/ckanext-harvest.git#egg=ckanext-harvest
|
|
|
|
|
|
|
|
3. Install the rest of python modules required by the extension::
|
|
|
|
|
|
|
|
(pyenv) $ pip install -r pip-requirements.txt
|
|
|
|
|
|
|
|
4. Make sure the CKAN configuration ini file contains the harvest main plugin, as
|
|
|
|
well as the harvester for CKAN instances if you need it (included with the extension)::
|
2011-07-18 18:34:24 +02:00
|
|
|
|
2011-12-12 12:10:11 +01:00
|
|
|
ckan.plugins = harvest ckan_harvester
|
2011-03-11 10:49:28 +01:00
|
|
|
|
2013-05-14 18:14:03 +02:00
|
|
|
5. Also define the backend that you are using with the ``ckan.harvest.mq.type``
|
|
|
|
option (it defaults to ``rabbitmq``)::
|
2013-05-14 18:00:20 +02:00
|
|
|
|
|
|
|
ckan.harvest.mq.type = redis
|
|
|
|
|
2011-03-11 10:49:28 +01:00
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
Configuration
|
|
|
|
=============
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
Run the following command to create the necessary tables in the database::
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
paster --plugin=ckanext-harvest harvester initdb --config=mysite.ini
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
After installation, the harvest source listing should be available under /harvest, eg:
|
2011-07-18 18:34:24 +02:00
|
|
|
|
|
|
|
http://localhost:5000/harvest
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
Command line interface
|
|
|
|
======================
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
The following operations can be run from the command line using the
|
2012-04-10 17:20:43 +02:00
|
|
|
``paster --plugin=ckanext-harvest harvester`` command::
|
2011-03-16 12:06:25 +01:00
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester initdb
|
|
|
|
- Creates the necessary tables in the database
|
|
|
|
|
2013-08-14 12:54:51 +02:00
|
|
|
harvester source {name} {url} {type} [{title}] [{active}] [{owner_org}] [{frequency}] [{config}]
|
2011-03-16 12:06:25 +01:00
|
|
|
- create new harvest source
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester rmsource {id}
|
|
|
|
- remove (inactivate) a harvester source
|
2011-03-16 12:06:25 +01:00
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
harvester sources [all]
|
2011-03-16 12:06:25 +01:00
|
|
|
- lists harvest sources
|
2011-04-13 13:39:53 +02:00
|
|
|
If 'all' is defined, it also shows the Inactive sources
|
2011-03-16 12:06:25 +01:00
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester job {source-id}
|
|
|
|
- create new harvest job
|
2011-05-13 19:39:36 +02:00
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
harvester jobs
|
2011-04-13 13:39:53 +02:00
|
|
|
- lists harvest jobs
|
2011-03-16 12:06:25 +01:00
|
|
|
|
|
|
|
harvester run
|
2011-04-13 13:39:53 +02:00
|
|
|
- runs harvest jobs
|
|
|
|
|
|
|
|
harvester gather_consumer
|
|
|
|
- starts the consumer for the gathering queue
|
|
|
|
|
|
|
|
harvester fetch_consumer
|
|
|
|
- starts the consumer for the fetching queue
|
2011-05-13 19:39:36 +02:00
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
harvester purge_queues
|
|
|
|
- removes all jobs from fetch and gather queue
|
|
|
|
|
|
|
|
harvester [-j] [--segments={segments}] import [{source-id}]
|
|
|
|
- perform the import stage with the last fetched objects, optionally belonging to a certain source.
|
|
|
|
Please note that no objects will be fetched from the remote server. It will only affect
|
|
|
|
the last fetched objects already present in the database.
|
|
|
|
|
|
|
|
If the -j flag is provided, the objects are not joined to existing datasets. This may be useful
|
|
|
|
when importing objects for the first time.
|
|
|
|
|
|
|
|
The --segments flag allows to define a string containing hex digits that represent which of
|
|
|
|
the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f
|
2011-07-18 18:34:24 +02:00
|
|
|
|
2011-09-06 19:25:17 +02:00
|
|
|
harvester job-all
|
|
|
|
- create new harvest jobs for all active sources.
|
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
harvester reindex
|
|
|
|
- reindexes the harvest source datasets
|
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
The commands should be run with the pyenv activated and refer to your sites configuration file (mysite.ini in this example)::
|
2011-03-11 10:49:28 +01:00
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
paster --plugin=ckanext-harvest harvester sources --config=mysite.ini
|
2011-03-09 19:56:55 +01:00
|
|
|
|
2013-03-06 17:54:33 +01:00
|
|
|
Authorization
|
|
|
|
=============
|
2012-03-08 18:29:05 +01:00
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
Starting from CKAN 2.0, harvest sources behave exactly the same as datasets
|
|
|
|
(they are actually internally implemented as a dataset type). That means that
|
|
|
|
can be searched and faceted, and that the same authorization rules can be
|
|
|
|
applied to them. The default authorization settings are based on organizations
|
|
|
|
(equivalent to the `publisher profile` found in old versions).
|
2012-03-08 18:29:05 +01:00
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
Have a look at the `Authorization <http://docs.ckan.org/en/latest/authorization.html>`_
|
|
|
|
documentation on CKAN core to see how to configure your instance depending on
|
|
|
|
your needs.
|
2012-03-08 18:29:05 +01:00
|
|
|
|
2012-04-10 21:53:12 +02:00
|
|
|
The CKAN harvester
|
2012-02-16 13:52:43 +01:00
|
|
|
===================
|
2011-05-13 18:08:21 +02:00
|
|
|
|
2011-07-18 18:34:24 +02:00
|
|
|
The plugin includes a harvester for remote CKAN instances. To use it, you need
|
2011-12-12 12:10:11 +01:00
|
|
|
to add the `ckan_harvester` plugin to your options file::
|
2011-07-18 18:34:24 +02:00
|
|
|
|
|
|
|
ckan.plugins = harvest ckan_harvester
|
|
|
|
|
|
|
|
After adding it, a 'CKAN' option should appear in the 'New harvest source' form.
|
2011-05-13 18:08:21 +02:00
|
|
|
|
2011-11-18 14:20:41 +01:00
|
|
|
The CKAN harvesters support a number of configuration options to control their
|
2012-01-10 18:55:45 +01:00
|
|
|
behaviour. Those need to be defined as a JSON object in the configuration form
|
2011-11-18 14:20:41 +01:00
|
|
|
field. The currently supported configuration options are:
|
|
|
|
|
2013-05-22 16:46:14 +02:00
|
|
|
* api_version: You can force the harvester to use either version 1 or 2 of
|
|
|
|
the CKAN API. Default is 2.
|
2011-11-18 14:20:41 +01:00
|
|
|
|
2012-01-10 18:07:19 +01:00
|
|
|
* default_tags: A list of tags that will be added to all harvested datasets.
|
2011-11-18 14:20:41 +01:00
|
|
|
Tags don't need to previously exist.
|
|
|
|
|
2012-01-10 18:07:19 +01:00
|
|
|
* default_groups: A list of groups to which the harvested datasets will be
|
2011-11-18 14:20:41 +01:00
|
|
|
added to. The groups must exist. Note that you must use ids or names to
|
2012-01-10 18:07:19 +01:00
|
|
|
define the groups according to the API version you defined (names for version
|
2013-05-27 13:36:56 +02:00
|
|
|
1, ids for version 2).
|
2011-11-18 14:20:41 +01:00
|
|
|
|
2012-01-10 18:07:19 +01:00
|
|
|
* default_extras: A dictionary of key value pairs that will be added to extras
|
|
|
|
of the harvested datasets. You can use the following replacement strings,
|
|
|
|
that will be replaced before creating or updating the datasets:
|
|
|
|
|
|
|
|
* {dataset_id}
|
|
|
|
* {harvest_source_id}
|
|
|
|
* {harvest_source_url} # Will be stripped of trailing forward slashes (/)
|
2012-03-13 13:38:14 +01:00
|
|
|
* {harvest_source_title} # Requires CKAN 1.6
|
2012-01-10 18:07:19 +01:00
|
|
|
* {harvest_job_id}
|
|
|
|
* {harvest_object_id}
|
|
|
|
|
|
|
|
* override_extras: Assign default extras even if they already exist in the
|
|
|
|
remote dataset. Default is False (only non existing extras are added).
|
|
|
|
|
|
|
|
* user: User who will run the harvesting process. Please note that this user
|
2011-11-18 15:12:30 +01:00
|
|
|
needs to have permission for creating packages, and if default groups were
|
|
|
|
defined, the user must have permission to assign packages to these groups.
|
|
|
|
|
2012-01-10 18:55:45 +01:00
|
|
|
* api_key: If the remote CKAN instance has restricted access to the API, you
|
2011-11-23 12:09:16 +01:00
|
|
|
can provide a CKAN API key, which will be sent in any request.
|
|
|
|
|
2012-01-10 18:07:19 +01:00
|
|
|
* read_only: Create harvested packages in read-only mode. Only the user who
|
2011-11-18 15:30:10 +01:00
|
|
|
performed the harvest (the one defined in the previous setting or the
|
|
|
|
'harvest' sysadmin) will be able to edit and administer the packages
|
|
|
|
created from this harvesting source. Logged in users and visitors will be
|
|
|
|
only able to read them.
|
|
|
|
|
2012-02-03 18:54:34 +01:00
|
|
|
* force_all: By default, after the first harvesting, the harvester will gather
|
|
|
|
only the modified packages from the remote site since the last harvesting.
|
|
|
|
Setting this property to true will force the harvester to gather all remote
|
|
|
|
packages regardless of the modification date. Default is False.
|
|
|
|
|
2013-05-24 17:55:05 +02:00
|
|
|
* remote_groups: By default, remote groups are ignored. Setting this property
|
|
|
|
enables the harvester to import the remote groups. There are two alternatives.
|
|
|
|
Setting it to 'only_local' will just import groups which name/id is already
|
|
|
|
present in the local CKAN. Setting it to 'create' will make an attempt to
|
|
|
|
create the groups by copying the details from the remote CKAN.
|
|
|
|
|
2011-11-18 14:20:41 +01:00
|
|
|
Here is an example of a configuration object (the one that must be entered in
|
|
|
|
the configuration field)::
|
|
|
|
|
|
|
|
{
|
2013-05-22 16:46:14 +02:00
|
|
|
"api_version": 1,
|
2011-11-18 14:20:41 +01:00
|
|
|
"default_tags":["new-tag-1","new-tag-2"],
|
2011-11-18 15:12:30 +01:00
|
|
|
"default_groups":["my-own-group"],
|
2012-05-09 16:58:55 +02:00
|
|
|
"default_extras":{"new_extra":"Test","harvest_url":"{harvest_source_url}/dataset/{dataset_id}"},
|
2012-01-10 18:07:19 +01:00
|
|
|
"override_extras": true,
|
2011-11-18 15:30:10 +01:00
|
|
|
"user":"harverster-user",
|
2011-11-23 12:09:16 +01:00
|
|
|
"api_key":"<REMOTE_API_KEY>",
|
2013-05-24 17:55:05 +02:00
|
|
|
"read_only": true,
|
|
|
|
"remote_groups": "only_local"
|
2011-11-18 14:20:41 +01:00
|
|
|
}
|
|
|
|
|
2011-05-13 18:08:21 +02:00
|
|
|
|
2011-04-15 16:36:53 +02:00
|
|
|
The harvesting interface
|
|
|
|
========================
|
|
|
|
|
|
|
|
Extensions can implement the harvester interface to perform harvesting
|
|
|
|
operations. The harvesting process takes place on three stages:
|
|
|
|
|
|
|
|
1. The **gather** stage compiles all the resource identifiers that need to
|
2011-05-13 19:39:36 +02:00
|
|
|
be fetched in the next stage (e.g. in a CSW server, it will perform a
|
2011-04-15 16:36:53 +02:00
|
|
|
`GetRecords` operation).
|
|
|
|
|
|
|
|
2. The **fetch** stage gets the contents of the remote objects and stores
|
2011-05-13 19:39:36 +02:00
|
|
|
them in the database (e.g. in a CSW server, it will perform n
|
2011-04-15 16:36:53 +02:00
|
|
|
`GetRecordById` operations).
|
|
|
|
|
|
|
|
3. The **import** stage performs any necessary actions on the fetched
|
|
|
|
resource (generally creating a CKAN package, but it can be anything the
|
|
|
|
extension needs).
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
Plugins willing to implement the harvesting interface must provide the
|
2011-04-15 16:36:53 +02:00
|
|
|
following methods::
|
|
|
|
|
|
|
|
from ckan.plugins.core import SingletonPlugin, implements
|
|
|
|
from ckanext.harvest.interfaces import IHarvester
|
|
|
|
|
|
|
|
class MyHarvester(SingletonPlugin):
|
|
|
|
'''
|
|
|
|
A Test Harvester
|
|
|
|
'''
|
|
|
|
implements(IHarvester)
|
|
|
|
|
2013-01-24 19:39:19 +01:00
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
def info(self):
|
2011-04-15 16:36:53 +02:00
|
|
|
'''
|
2011-05-13 19:39:36 +02:00
|
|
|
Harvesting implementations must provide this method, which will return a
|
|
|
|
dictionary containing different descriptors of the harvester. The
|
|
|
|
returned dictionary should contain:
|
|
|
|
|
|
|
|
* name: machine-readable name. This will be the value stored in the
|
|
|
|
database, and the one used by ckanext-harvest to call the appropiate
|
|
|
|
harvester.
|
|
|
|
* title: human-readable name. This will appear in the form's select box
|
|
|
|
in the WUI.
|
|
|
|
* description: a small description of what the harvester does. This will
|
|
|
|
appear on the form as a guidance to the user.
|
|
|
|
|
|
|
|
A complete example may be::
|
|
|
|
|
|
|
|
{
|
|
|
|
'name': 'csw',
|
|
|
|
'title': 'CSW Server',
|
2013-01-24 19:39:19 +01:00
|
|
|
'description': 'A server that implements OGC's Catalog Service
|
2011-05-13 19:39:36 +02:00
|
|
|
for the Web (CSW) standard'
|
|
|
|
}
|
|
|
|
|
2013-01-24 19:39:19 +01:00
|
|
|
:returns: A dictionary with the harvester descriptors
|
2011-04-15 16:36:53 +02:00
|
|
|
'''
|
|
|
|
|
2011-06-07 13:07:53 +02:00
|
|
|
def validate_config(self, config):
|
|
|
|
'''
|
2013-01-24 19:39:19 +01:00
|
|
|
|
|
|
|
[optional]
|
|
|
|
|
2011-06-07 13:07:53 +02:00
|
|
|
Harvesters can provide this method to validate the configuration entered in the
|
|
|
|
form. It should return a single string, which will be stored in the database.
|
|
|
|
Exceptions raised will be shown in the form's error messages.
|
|
|
|
|
2013-01-24 19:39:19 +01:00
|
|
|
:param harvest_object_id: Config string coming from the form
|
|
|
|
:returns: A string with the validated configuration options
|
|
|
|
'''
|
|
|
|
|
|
|
|
def get_original_url(self, harvest_object_id):
|
|
|
|
'''
|
|
|
|
|
|
|
|
[optional]
|
|
|
|
|
|
|
|
This optional but very recommended method allows harvesters to return
|
|
|
|
the URL to the original remote document, given a Harvest Object id.
|
|
|
|
Note that getting the harvest object you have access to its guid as
|
|
|
|
well as the object source, which has the URL.
|
|
|
|
This URL will be used on error reports to help publishers link to the
|
|
|
|
original document that has the errors. If this method is not provided
|
|
|
|
or no URL is returned, only a link to the local copy of the remote
|
|
|
|
document will be shown.
|
|
|
|
|
|
|
|
Examples:
|
|
|
|
* For a CKAN record: http://{ckan-instance}/api/rest/{guid}
|
|
|
|
* For a WAF record: http://{waf-root}/{file-name}
|
|
|
|
* For a CSW record: http://{csw-server}/?Request=GetElementById&Id={guid}&...
|
|
|
|
|
|
|
|
:param harvest_object_id: HarvestObject id
|
|
|
|
:returns: A string with the URL to the original document
|
2011-06-07 13:07:53 +02:00
|
|
|
'''
|
|
|
|
|
2011-04-15 16:36:53 +02:00
|
|
|
def gather_stage(self, harvest_job):
|
|
|
|
'''
|
|
|
|
The gather stage will recieve a HarvestJob object and will be
|
|
|
|
responsible for:
|
|
|
|
- gathering all the necessary objects to fetch on a later.
|
|
|
|
stage (e.g. for a CSW server, perform a GetRecords request)
|
|
|
|
- creating the necessary HarvestObjects in the database, specifying
|
2013-01-24 19:39:19 +01:00
|
|
|
the guid and a reference to its job. The HarvestObjects need a
|
|
|
|
reference date with the last modified date for the resource, this
|
|
|
|
may need to be set in a different stage depending on the type of
|
|
|
|
source.
|
2011-04-15 16:36:53 +02:00
|
|
|
- creating and storing any suitable HarvestGatherErrors that may
|
|
|
|
occur.
|
|
|
|
- returning a list with all the ids of the created HarvestObjects.
|
|
|
|
|
|
|
|
:param harvest_job: HarvestJob object
|
|
|
|
:returns: A list of HarvestObject ids
|
|
|
|
'''
|
|
|
|
|
|
|
|
def fetch_stage(self, harvest_object):
|
|
|
|
'''
|
|
|
|
The fetch stage will receive a HarvestObject object and will be
|
|
|
|
responsible for:
|
|
|
|
- getting the contents of the remote object (e.g. for a CSW server,
|
|
|
|
perform a GetRecordById request).
|
|
|
|
- saving the content in the provided HarvestObject.
|
|
|
|
- creating and storing any suitable HarvestObjectErrors that may
|
|
|
|
occur.
|
|
|
|
- returning True if everything went as expected, False otherwise.
|
|
|
|
|
|
|
|
:param harvest_object: HarvestObject object
|
|
|
|
:returns: True if everything went right, False if errors were found
|
|
|
|
'''
|
|
|
|
|
|
|
|
def import_stage(self, harvest_object):
|
|
|
|
'''
|
|
|
|
The import stage will receive a HarvestObject object and will be
|
|
|
|
responsible for:
|
2011-05-13 19:39:36 +02:00
|
|
|
- performing any necessary action with the fetched object (e.g
|
2011-04-15 16:36:53 +02:00
|
|
|
create a CKAN package).
|
|
|
|
Note: if this stage creates or updates a package, a reference
|
2013-01-24 19:39:19 +01:00
|
|
|
to the package should be added to the HarvestObject.
|
2011-04-15 16:36:53 +02:00
|
|
|
- creating the HarvestObject - Package relation (if necessary)
|
|
|
|
- creating and storing any suitable HarvestObjectErrors that may
|
|
|
|
occur.
|
|
|
|
- returning True if everything went as expected, False otherwise.
|
|
|
|
|
|
|
|
:param harvest_object: HarvestObject object
|
|
|
|
:returns: True if everything went right, False if errors were found
|
|
|
|
'''
|
|
|
|
|
2013-01-24 19:39:19 +01:00
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
See the CKAN harvester for an example of how to implement the harvesting
|
2011-04-15 16:36:53 +02:00
|
|
|
interface:
|
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
ckanext-harvest/ckanext/harvest/harvesters/ckanharvester.py
|
2011-07-18 18:34:24 +02:00
|
|
|
|
|
|
|
Here you can also find other examples of custom harvesters:
|
|
|
|
|
2012-04-10 21:53:12 +02:00
|
|
|
* https://github.com/okfn/ckanext-pdeu/tree/master/ckanext/pdeu/harvesters
|
|
|
|
* https://github.com/okfn/ckanext-inspire/ckanext/inspire/harvesters.py
|
2011-04-15 16:36:53 +02:00
|
|
|
|
|
|
|
|
|
|
|
Running the harvest jobs
|
|
|
|
========================
|
|
|
|
|
|
|
|
The harvesting extension uses two different queues, one that handles the
|
|
|
|
gathering and another one that handles the fetching and importing. To start
|
2012-04-10 17:20:43 +02:00
|
|
|
the consumers run the following command
|
2011-04-15 16:36:53 +02:00
|
|
|
(make sure you have your python environment activated)::
|
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
paster --plugin=ckanext-harvest harvester gather_consumer --config=mysite.ini
|
2011-04-15 16:36:53 +02:00
|
|
|
|
|
|
|
On another terminal, run the following command::
|
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
paster --plugin=ckanext-harvest harvester fetch_consumer --config=mysite.ini
|
2011-04-15 16:36:53 +02:00
|
|
|
|
|
|
|
Finally, on a third console, run the following command to start any
|
|
|
|
pending harvesting jobs::
|
|
|
|
|
2012-04-10 17:20:43 +02:00
|
|
|
paster --plugin=ckanext-harvest harvester run --config=mysite.ini
|
2011-11-18 15:30:10 +01:00
|
|
|
|
2013-05-14 18:00:20 +02:00
|
|
|
The ``run`` command not only starts any pending harvesting jobs, but also
|
|
|
|
flags those that are finished, allowing new jobs to be created on that particular
|
|
|
|
source and refreshing the source statistics. That means that you will need to run
|
|
|
|
this command before being able to create a new job on a source that was being
|
|
|
|
harvested (On a production site you will tipically have a cron job that runs the
|
|
|
|
command regularly, see next section).
|
2012-02-16 13:52:43 +01:00
|
|
|
|
|
|
|
|
|
|
|
Setting up the harvesters on a production server
|
|
|
|
================================================
|
|
|
|
|
|
|
|
The previous approach works fine during development or debugging, but it is
|
|
|
|
not recommended for production servers. There are several possible ways of
|
|
|
|
setting up the harvesters, which will depend on your particular infrastructure
|
|
|
|
and needs. The bottom line is that the gather and fetch process should be kept
|
|
|
|
running somehow and then the run command should be run periodically to start
|
|
|
|
any pending jobs.
|
|
|
|
|
|
|
|
The following approach is the one generally used on CKAN deployments, and it
|
|
|
|
will probably suit most of the users. It uses Supervisor_, a tool to monitor
|
|
|
|
processes, and a cron job to run the harvest jobs, and it assumes that you
|
|
|
|
have already installed and configured the harvesting extension (See
|
|
|
|
`Installation` if not).
|
|
|
|
|
|
|
|
Note: It is recommended to run the harvest process from a non-root user
|
|
|
|
(generally the one you are running CKAN with). Replace the user `okfn` in the
|
|
|
|
following steps with the one you are using.
|
|
|
|
|
|
|
|
1. Install Supervisor::
|
|
|
|
|
|
|
|
sudo apt-get install supervisor
|
|
|
|
|
|
|
|
You can check if it is running with this command::
|
|
|
|
|
|
|
|
ps aux | grep supervisord
|
|
|
|
|
|
|
|
You should see a line similar to this one::
|
|
|
|
|
|
|
|
root 9224 0.0 0.3 56420 12204 ? Ss 15:52 0:00 /usr/bin/python /usr/bin/supervisord
|
|
|
|
|
|
|
|
2. Supervisor needs to have programs added to its configuration, which will
|
|
|
|
describe the tasks that need to be monitored. This configuration files are
|
2012-04-10 21:58:10 +02:00
|
|
|
stored in ``/etc/supervisor/conf.d``.
|
2012-02-16 13:52:43 +01:00
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
Create a file named ``/etc/supervisor/conf.d/ckan_harvesting.conf``, and copy the following contents::
|
2012-02-16 13:52:43 +01:00
|
|
|
|
2012-02-16 22:08:00 +01:00
|
|
|
|
|
|
|
; ===============================
|
|
|
|
; ckan harvester
|
|
|
|
; ===============================
|
|
|
|
|
|
|
|
[program:ckan_gather_consumer]
|
|
|
|
|
|
|
|
command=/var/lib/ckan/std/pyenv/bin/paster --plugin=ckanext-harvest harvester gather_consumer --config=/etc/ckan/std/std.ini
|
|
|
|
|
|
|
|
; user that owns virtual environment.
|
|
|
|
user=okfn
|
|
|
|
|
|
|
|
numprocs=1
|
|
|
|
stdout_logfile=/var/log/ckan/std/gather_consumer.log
|
|
|
|
stderr_logfile=/var/log/ckan/std/gather_consumer.log
|
|
|
|
autostart=true
|
|
|
|
autorestart=true
|
|
|
|
startsecs=10
|
|
|
|
|
|
|
|
[program:ckan_fetch_consumer]
|
|
|
|
|
|
|
|
command=/var/lib/ckan/std/pyenv/bin/paster --plugin=ckanext-harvest harvester fetch_consumer --config=/etc/ckan/std/std.ini
|
|
|
|
|
|
|
|
; user that owns virtual environment.
|
|
|
|
user=okfn
|
|
|
|
|
|
|
|
numprocs=1
|
|
|
|
stdout_logfile=/var/log/ckan/std/fetch_consumer.log
|
|
|
|
stderr_logfile=/var/log/ckan/std/fetch_consumer.log
|
|
|
|
autostart=true
|
|
|
|
autorestart=true
|
|
|
|
startsecs=10
|
|
|
|
|
2012-02-16 13:52:43 +01:00
|
|
|
|
|
|
|
There are a number of things that you will need to replace with your
|
|
|
|
specific installation settings (the example above shows paths from a
|
|
|
|
ckan instance installed via Debian packages):
|
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
* command: The absolute path to the paster command located in the
|
|
|
|
python virtual environment and the absolute path to the config
|
|
|
|
ini file.
|
2012-02-16 13:52:43 +01:00
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
* user: The unix user you are running CKAN with
|
2012-02-16 13:52:43 +01:00
|
|
|
|
2012-04-10 21:58:10 +02:00
|
|
|
* stdout_logfile and stderr_logfile: All output coming from the
|
|
|
|
harvest consumers will be written to this file. Ensure that the
|
|
|
|
necessary permissions are setup.
|
2012-02-16 13:52:43 +01:00
|
|
|
|
|
|
|
The rest of the configuration options are pretty self explanatory. Refer
|
|
|
|
to the `Supervisor documentation <http://supervisord.org/configuration.html#program-x-section-settings>`_
|
|
|
|
to know more about these and other options available.
|
|
|
|
|
|
|
|
3. Start the supervisor tasks with the following commands::
|
|
|
|
|
2012-02-16 17:03:15 +01:00
|
|
|
sudo supervisorctl reread
|
|
|
|
sudo supervisorctl add ckan_gather_consumer
|
|
|
|
sudo supervisorctl add ckan_fetch_consumer
|
2012-02-16 13:52:43 +01:00
|
|
|
sudo supervisorctl start ckan_gather_consumer
|
|
|
|
sudo supervisorctl start ckan_fetch_consumer
|
|
|
|
|
|
|
|
To check that the processes are running, you can run::
|
|
|
|
|
|
|
|
sudo supervisorctl status
|
|
|
|
|
|
|
|
ckan_fetch_consumer RUNNING pid 6983, uptime 0:22:06
|
|
|
|
ckan_gather_consumer RUNNING pid 6968, uptime 0:22:45
|
|
|
|
|
|
|
|
Some problems you may encounter when starting the processes:
|
|
|
|
|
2012-04-10 21:55:59 +02:00
|
|
|
* `ckan_gather_consumer: ERROR (no such process)`
|
|
|
|
Double-check your supervisor configuration file and stop and restart the supervisor daemon::
|
2012-02-16 13:52:43 +01:00
|
|
|
|
|
|
|
sudo service supervisor start; sudo service supervisor stop
|
|
|
|
|
2012-04-10 21:55:59 +02:00
|
|
|
* `ckan_gather_consumer: ERROR (abnormal termination)`
|
|
|
|
Something prevented the command from running properly. Have a look at the log file that
|
|
|
|
you defined in the `stdout_logfile` section to see what happened. Common errors include::
|
2012-02-16 13:52:43 +01:00
|
|
|
|
2012-04-10 21:55:59 +02:00
|
|
|
`socket.error: [Errno 111] Connection refused`
|
2013-01-24 19:39:19 +01:00
|
|
|
RabbitMQ is not running::
|
|
|
|
|
2012-02-16 13:52:43 +01:00
|
|
|
sudo service rabbitmq-server start
|
|
|
|
|
|
|
|
4. Once we have the two consumers running and monitored, we just need to create a cron job
|
|
|
|
that will run the `run` harvester command periodically. To do so, edit the cron table with
|
|
|
|
the following command (it may ask you to choose an editor)::
|
|
|
|
|
|
|
|
sudo crontab -e -u okfn
|
|
|
|
|
|
|
|
Note that we are running this command as the same user we configured the processes to be run with
|
|
|
|
(`okfn` in our example).
|
|
|
|
|
|
|
|
Paste this line into your crontab, again replacing the paths to paster and the ini file with yours::
|
|
|
|
|
|
|
|
# m h dom mon dow command
|
|
|
|
*/15 * * * * /var/lib/ckan/std/pyenv/bin/paster --plugin=ckanext-harvest harvester run --config=/etc/ckan/std/std.ini
|
|
|
|
|
|
|
|
This particular example will check for pending jobs every fifteen minutes.
|
|
|
|
You can of course modify this periodicity, this `Wikipedia page <http://en.wikipedia.org/wiki/Cron#CRON_expression>`_
|
|
|
|
has a good overview of the crontab syntax.
|
|
|
|
|
|
|
|
|
|
|
|
.. _Supervisor: http://supervisord.org
|
2011-07-18 18:34:24 +02:00
|
|
|
|