harvester-d4science/README.rst

240 lines
7.9 KiB
ReStructuredText
Raw Normal View History

=============================================
2011-03-16 12:06:25 +01:00
ckanext-harvest - Remote harvesting extension
=============================================
This extension provides a common harvesting framework for ckan extensions
and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
Dependencies
============
The harvest extension uses Message Queuing to handle the different gather
stages.
You will need to install the RabbitMQ server::
sudo apt-get install rabbitmq-server
The extension uses `carrot` as messaging library::
http://ask.github.com/carrot/
2011-03-11 10:49:28 +01:00
Configuration
=============
Run the following command (in the ckanext-harvest directory) to create
the necessary tables in the database::
paster harvester initdb --config=../ckan/development.ini
The extension needs a user with sysadmin privileges to perform the
2011-03-16 12:06:25 +01:00
harvesting jobs. You can create such a user running these two commands in
the ckan directory::
paster user add harvest
paster sysadmin add harvest
Tests
=====
To run the tests, this is the basic command::
$ nosetests --ckan tests/
Or with postgres::
$ nosetests --ckan --with-pylons=../ckan/test-core.ini tests/
(See the Ckan README for more information.)
2011-03-16 12:06:25 +01:00
Command line interface
======================
The following operations can be run from the command line using the
2011-03-16 12:06:25 +01:00
``paster harvester`` command::
harvester initdb
- Creates the necessary tables in the database
harvester source {url} {type} [{active}] [{user-id}] [{publisher-id}]
2011-03-16 12:06:25 +01:00
- create new harvest source
harvester rmsource {id}
- remove (inactivate) a harvester source
2011-03-16 12:06:25 +01:00
harvester sources [all]
2011-03-16 12:06:25 +01:00
- lists harvest sources
If 'all' is defined, it also shows the Inactive sources
2011-03-16 12:06:25 +01:00
harvester job {source-id}
- create new harvest job
2011-03-16 12:06:25 +01:00
harvester jobs
- lists harvest jobs
2011-03-16 12:06:25 +01:00
harvester run
- runs harvest jobs
harvester gather_consumer
- starts the consumer for the gathering queue
harvester fetch_consumer
- starts the consumer for the fetching queue
2011-03-16 12:06:25 +01:00
The commands should be run from the ckanext-harvest directory and expect
a development.ini file to be present. Most of the time you will specify
2011-03-16 12:06:25 +01:00
the config explicitly though::
2011-03-11 10:49:28 +01:00
2011-03-16 12:06:25 +01:00
paster harvester sources --config=../ckan/development.ini
2011-05-13 18:08:21 +02:00
The CKAN haverster
==================
TODO
The harvesting interface
========================
Extensions can implement the harvester interface to perform harvesting
operations. The harvesting process takes place on three stages:
1. The **gather** stage compiles all the resource identifiers that need to
be fetched in the next stage (e.g. in a CSW server, it will perform a
`GetRecords` operation).
2. The **fetch** stage gets the contents of the remote objects and stores
them in the database (e.g. in a CSW server, it will perform n
`GetRecordById` operations).
3. The **import** stage performs any necessary actions on the fetched
resource (generally creating a CKAN package, but it can be anything the
extension needs).
Plugins willing to implement the harvesting interface must provide the
following methods::
from ckan.plugins.core import SingletonPlugin, implements
from ckanext.harvest.interfaces import IHarvester
class MyHarvester(SingletonPlugin):
'''
A Test Harvester
'''
implements(IHarvester)
def info(self):
'''
Harvesting implementations must provide this method, which will return a
dictionary containing different descriptors of the harvester. The
returned dictionary should contain:
* name: machine-readable name. This will be the value stored in the
database, and the one used by ckanext-harvest to call the appropiate
harvester.
* title: human-readable name. This will appear in the form's select box
in the WUI.
* description: a small description of what the harvester does. This will
appear on the form as a guidance to the user.
* form_config_interface [optional]: Harvesters willing to store configuration
values in the database must provide this key. The only supported value is
'Text'. This will enable the configuration text box in the form. See also
the ``validate_config`` method.
A complete example may be::
{
'name': 'csw',
'title': 'CSW Server',
'description': 'A server that implements OGC's Catalog Service
for the Web (CSW) standard'
}
returns: A dictionary with the harvester descriptors
'''
def validate_config(self, config):
'''
Harvesters can provide this method to validate the configuration entered in the
form. It should return a single string, which will be stored in the database.
Exceptions raised will be shown in the form's error messages.
returns A string with the validated configuration options
'''
def gather_stage(self, harvest_job):
'''
The gather stage will recieve a HarvestJob object and will be
responsible for:
- gathering all the necessary objects to fetch on a later.
stage (e.g. for a CSW server, perform a GetRecords request)
- creating the necessary HarvestObjects in the database, specifying
the guid and a reference to its source and job.
- creating and storing any suitable HarvestGatherErrors that may
occur.
- returning a list with all the ids of the created HarvestObjects.
:param harvest_job: HarvestJob object
:returns: A list of HarvestObject ids
'''
def fetch_stage(self, harvest_object):
'''
The fetch stage will receive a HarvestObject object and will be
responsible for:
- getting the contents of the remote object (e.g. for a CSW server,
perform a GetRecordById request).
- saving the content in the provided HarvestObject.
- creating and storing any suitable HarvestObjectErrors that may
occur.
- returning True if everything went as expected, False otherwise.
:param harvest_object: HarvestObject object
:returns: True if everything went right, False if errors were found
'''
def import_stage(self, harvest_object):
'''
The import stage will receive a HarvestObject object and will be
responsible for:
- performing any necessary action with the fetched object (e.g
create a CKAN package).
Note: if this stage creates or updates a package, a reference
to the package should be added to the HarvestObject.
- creating the HarvestObject - Package relation (if necessary)
- creating and storing any suitable HarvestObjectErrors that may
occur.
- returning True if everything went as expected, False otherwise.
:param harvest_object: HarvestObject object
:returns: True if everything went right, False if errors were found
'''
See ckanext-inspire for a an example on how to implement the harvesting
interface:
https://bitbucket.org/okfn/ckanext-inspire/src/
Running the harvest jobs
========================
The harvesting extension uses two different queues, one that handles the
gathering and another one that handles the fetching and importing. To start
the consumers run the following command from the ckanext-harvest directory
(make sure you have your python environment activated)::
paster harvester gather_consumer --config=../ckan/development.ini
On another terminal, run the following command::
paster harvester fetch_consumer --config=../ckan/development.ini
Finally, on a third console, run the following command to start any
pending harvesting jobs::
paster harvester run --config=../ckan/development.ini