2011-04-15 16:36:53 +02:00
|
|
|
=============================================
|
2011-03-16 12:06:25 +01:00
|
|
|
ckanext-harvest - Remote harvesting extension
|
2011-04-15 16:36:53 +02:00
|
|
|
=============================================
|
|
|
|
|
|
|
|
This extension provides a common harvesting framework for ckan extensions
|
|
|
|
and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
|
|
|
|
|
|
|
|
Dependencies
|
|
|
|
============
|
|
|
|
|
|
|
|
The harvest extension uses Message Queuing to handle the different gather
|
2011-05-13 19:39:36 +02:00
|
|
|
stages.
|
2011-04-15 16:36:53 +02:00
|
|
|
|
|
|
|
You will need to install the RabbitMQ server::
|
|
|
|
|
|
|
|
sudo apt-get install rabbitmq-server
|
|
|
|
|
2011-04-28 11:14:13 +02:00
|
|
|
The extension uses `carrot` as messaging library::
|
2011-04-15 16:36:53 +02:00
|
|
|
|
|
|
|
http://ask.github.com/carrot/
|
2011-03-11 10:49:28 +01:00
|
|
|
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
Configuration
|
|
|
|
=============
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
Run the following command (in the ckanext-harvest directory) to create
|
2011-04-13 13:39:53 +02:00
|
|
|
the necessary tables in the database::
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
paster harvester initdb --config=../ckan/development.ini
|
2011-03-25 18:01:26 +01:00
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
The extension needs a user with sysadmin privileges to perform the
|
2011-03-16 12:06:25 +01:00
|
|
|
harvesting jobs. You can create such a user running these two commands in
|
|
|
|
the ckan directory::
|
|
|
|
|
|
|
|
paster user add harvest
|
|
|
|
|
|
|
|
paster sysadmin add harvest
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
Tests
|
|
|
|
=====
|
|
|
|
|
|
|
|
To run the tests, this is the basic command::
|
|
|
|
|
|
|
|
$ nosetests --ckan tests/
|
|
|
|
|
|
|
|
Or with postgres::
|
|
|
|
|
|
|
|
$ nosetests --ckan --with-pylons=../ckan/test-core.ini tests/
|
|
|
|
|
|
|
|
(See the Ckan README for more information.)
|
|
|
|
|
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
Command line interface
|
|
|
|
======================
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
The following operations can be run from the command line using the
|
2011-03-16 12:06:25 +01:00
|
|
|
``paster harvester`` command::
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester initdb
|
|
|
|
- Creates the necessary tables in the database
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
harvester source {url} {type} [{active}] [{user-id}] [{publisher-id}]
|
2011-03-16 12:06:25 +01:00
|
|
|
- create new harvest source
|
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester rmsource {id}
|
|
|
|
- remove (inactivate) a harvester source
|
2011-03-16 12:06:25 +01:00
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
harvester sources [all]
|
2011-03-16 12:06:25 +01:00
|
|
|
- lists harvest sources
|
2011-04-13 13:39:53 +02:00
|
|
|
If 'all' is defined, it also shows the Inactive sources
|
2011-03-16 12:06:25 +01:00
|
|
|
|
2011-04-13 13:39:53 +02:00
|
|
|
harvester job {source-id}
|
|
|
|
- create new harvest job
|
2011-05-13 19:39:36 +02:00
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
harvester jobs
|
2011-04-13 13:39:53 +02:00
|
|
|
- lists harvest jobs
|
2011-03-16 12:06:25 +01:00
|
|
|
|
|
|
|
harvester run
|
2011-04-13 13:39:53 +02:00
|
|
|
- runs harvest jobs
|
|
|
|
|
|
|
|
harvester gather_consumer
|
|
|
|
- starts the consumer for the gathering queue
|
|
|
|
|
|
|
|
harvester fetch_consumer
|
|
|
|
- starts the consumer for the fetching queue
|
2011-05-13 19:39:36 +02:00
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
The commands should be run from the ckanext-harvest directory and expect
|
2011-05-13 19:39:36 +02:00
|
|
|
a development.ini file to be present. Most of the time you will specify
|
2011-03-16 12:06:25 +01:00
|
|
|
the config explicitly though::
|
2011-03-11 10:49:28 +01:00
|
|
|
|
2011-03-16 12:06:25 +01:00
|
|
|
paster harvester sources --config=../ckan/development.ini
|
2011-03-09 19:56:55 +01:00
|
|
|
|
2011-05-13 18:08:21 +02:00
|
|
|
The CKAN haverster
|
|
|
|
==================
|
|
|
|
|
|
|
|
TODO
|
|
|
|
|
|
|
|
|
2011-04-15 16:36:53 +02:00
|
|
|
The harvesting interface
|
|
|
|
========================
|
|
|
|
|
|
|
|
Extensions can implement the harvester interface to perform harvesting
|
|
|
|
operations. The harvesting process takes place on three stages:
|
|
|
|
|
|
|
|
1. The **gather** stage compiles all the resource identifiers that need to
|
2011-05-13 19:39:36 +02:00
|
|
|
be fetched in the next stage (e.g. in a CSW server, it will perform a
|
2011-04-15 16:36:53 +02:00
|
|
|
`GetRecords` operation).
|
|
|
|
|
|
|
|
2. The **fetch** stage gets the contents of the remote objects and stores
|
2011-05-13 19:39:36 +02:00
|
|
|
them in the database (e.g. in a CSW server, it will perform n
|
2011-04-15 16:36:53 +02:00
|
|
|
`GetRecordById` operations).
|
|
|
|
|
|
|
|
3. The **import** stage performs any necessary actions on the fetched
|
|
|
|
resource (generally creating a CKAN package, but it can be anything the
|
|
|
|
extension needs).
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
Plugins willing to implement the harvesting interface must provide the
|
2011-04-15 16:36:53 +02:00
|
|
|
following methods::
|
|
|
|
|
|
|
|
from ckan.plugins.core import SingletonPlugin, implements
|
|
|
|
from ckanext.harvest.interfaces import IHarvester
|
|
|
|
|
|
|
|
class MyHarvester(SingletonPlugin):
|
|
|
|
'''
|
|
|
|
A Test Harvester
|
|
|
|
'''
|
|
|
|
implements(IHarvester)
|
|
|
|
|
2011-05-13 19:39:36 +02:00
|
|
|
def info(self):
|
2011-04-15 16:36:53 +02:00
|
|
|
'''
|
2011-05-13 19:39:36 +02:00
|
|
|
Harvesting implementations must provide this method, which will return a
|
|
|
|
dictionary containing different descriptors of the harvester. The
|
|
|
|
returned dictionary should contain:
|
|
|
|
|
|
|
|
* name: machine-readable name. This will be the value stored in the
|
|
|
|
database, and the one used by ckanext-harvest to call the appropiate
|
|
|
|
harvester.
|
|
|
|
* title: human-readable name. This will appear in the form's select box
|
|
|
|
in the WUI.
|
|
|
|
* description: a small description of what the harvester does. This will
|
|
|
|
appear on the form as a guidance to the user.
|
|
|
|
|
|
|
|
A complete example may be::
|
|
|
|
|
|
|
|
{
|
|
|
|
'name': 'csw',
|
|
|
|
'title': 'CSW Server',
|
|
|
|
'description': 'A server that implements OGC's Catalog Service
|
|
|
|
for the Web (CSW) standard'
|
|
|
|
}
|
|
|
|
|
|
|
|
returns: A dictionary with the harvester descriptors
|
2011-04-15 16:36:53 +02:00
|
|
|
'''
|
|
|
|
|
|
|
|
def gather_stage(self, harvest_job):
|
|
|
|
'''
|
|
|
|
The gather stage will recieve a HarvestJob object and will be
|
|
|
|
responsible for:
|
|
|
|
- gathering all the necessary objects to fetch on a later.
|
|
|
|
stage (e.g. for a CSW server, perform a GetRecords request)
|
|
|
|
- creating the necessary HarvestObjects in the database, specifying
|
|
|
|
the guid and a reference to its source and job.
|
|
|
|
- creating and storing any suitable HarvestGatherErrors that may
|
|
|
|
occur.
|
|
|
|
- returning a list with all the ids of the created HarvestObjects.
|
|
|
|
|
|
|
|
:param harvest_job: HarvestJob object
|
|
|
|
:returns: A list of HarvestObject ids
|
|
|
|
'''
|
|
|
|
|
|
|
|
def fetch_stage(self, harvest_object):
|
|
|
|
'''
|
|
|
|
The fetch stage will receive a HarvestObject object and will be
|
|
|
|
responsible for:
|
|
|
|
- getting the contents of the remote object (e.g. for a CSW server,
|
|
|
|
perform a GetRecordById request).
|
|
|
|
- saving the content in the provided HarvestObject.
|
|
|
|
- creating and storing any suitable HarvestObjectErrors that may
|
|
|
|
occur.
|
|
|
|
- returning True if everything went as expected, False otherwise.
|
|
|
|
|
|
|
|
:param harvest_object: HarvestObject object
|
|
|
|
:returns: True if everything went right, False if errors were found
|
|
|
|
'''
|
|
|
|
|
|
|
|
def import_stage(self, harvest_object):
|
|
|
|
'''
|
|
|
|
The import stage will receive a HarvestObject object and will be
|
|
|
|
responsible for:
|
2011-05-13 19:39:36 +02:00
|
|
|
- performing any necessary action with the fetched object (e.g
|
2011-04-15 16:36:53 +02:00
|
|
|
create a CKAN package).
|
|
|
|
Note: if this stage creates or updates a package, a reference
|
|
|
|
to the package should be added to the HarvestObject.
|
|
|
|
- creating the HarvestObject - Package relation (if necessary)
|
|
|
|
- creating and storing any suitable HarvestObjectErrors that may
|
|
|
|
occur.
|
|
|
|
- returning True if everything went as expected, False otherwise.
|
|
|
|
|
|
|
|
:param harvest_object: HarvestObject object
|
|
|
|
:returns: True if everything went right, False if errors were found
|
|
|
|
'''
|
|
|
|
|
|
|
|
See ckanext-inspire for a an example on how to implement the harvesting
|
|
|
|
interface:
|
|
|
|
|
|
|
|
https://bitbucket.org/okfn/ckanext-inspire/src/
|
|
|
|
|
|
|
|
|
|
|
|
Running the harvest jobs
|
|
|
|
========================
|
|
|
|
|
|
|
|
The harvesting extension uses two different queues, one that handles the
|
|
|
|
gathering and another one that handles the fetching and importing. To start
|
2011-05-13 19:39:36 +02:00
|
|
|
the consumers run the following command from the ckanext-harvest directory
|
2011-04-15 16:36:53 +02:00
|
|
|
(make sure you have your python environment activated)::
|
|
|
|
|
|
|
|
paster harvester gather_consumer --config=../ckan/development.ini
|
|
|
|
|
|
|
|
On another terminal, run the following command::
|
|
|
|
|
|
|
|
paster harvester fetch_consumer --config=../ckan/development.ini
|
|
|
|
|
|
|
|
Finally, on a third console, run the following command to start any
|
|
|
|
pending harvesting jobs::
|
|
|
|
|
|
|
|
paster harvester run --config=../ckan/development.ini
|