diff --git a/README.rst b/README.rst index bcc9cdb..f9c20a3 100644 --- a/README.rst +++ b/README.rst @@ -8,13 +8,17 @@ and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs. Installation ============ -The harvest extension uses Message Queuing to handle the different gather -stages. +The harvest extension can use two different backends. You can choose whichever +you prefer depending on your needs: -You will need to install the RabbitMQ server:: +* `RabbitMQ `_: To install it, run:: sudo apt-get install rabbitmq-server +* `Redis `_: To install it, run:: + + sudo apt-get install redis-server + Clone the repository and set up the extension:: git clone https://github.com/okfn/ckanext-harvest @@ -27,6 +31,11 @@ well as the harvester for CKAN instances (included with the extension):: ckan.plugins = harvest ckan_harvester +Also define the backend that you are using with the ``ckan.harvest.mq.type`` +option (it defaults to ``rabbitmq``):: + + ckan.harvest.mq.type = redis + Configuration ============= @@ -35,13 +44,7 @@ Run the following command to create the necessary tables in the database:: paster --plugin=ckanext-harvest harvester initdb --config=mysite.ini -The extension needs a user with sysadmin privileges to perform the -harvesting jobs. You can create such a user running this command:: - - paster --plugin=ckan sysadmin add harvest - -After installation, the harvest interface should be available under /harvest -if you're logged in with sysadmin permissions, eg. +After installation, the harvest source listing should be available under /harvest, eg: http://localhost:5000/harvest @@ -55,7 +58,7 @@ The following operations can be run from the command line using the harvester initdb - Creates the necessary tables in the database - harvester source {url} {type} [{active}] [{user-id}] [{publisher-id}] + harvester source {url} {type} [{config}] [{active}] [{user-id}] [{publisher-id}] [{frequency}] - create new harvest source harvester rmsource {id} @@ -80,16 +83,26 @@ The following operations can be run from the command line using the harvester fetch_consumer - starts the consumer for the fetching queue - harvester import [{source-id}] - - perform the import stage with the last fetched objects, optionally - belonging to a certain source. - Please note that no objects will be fetched from the remote server. - It will only affect the last fetched objects already present in the - database. + harvester purge_queues + - removes all jobs from fetch and gather queue + + harvester [-j] [--segments={segments}] import [{source-id}] + - perform the import stage with the last fetched objects, optionally belonging to a certain source. + Please note that no objects will be fetched from the remote server. It will only affect + the last fetched objects already present in the database. + + If the -j flag is provided, the objects are not joined to existing datasets. This may be useful + when importing objects for the first time. + + The --segments flag allows to define a string containing hex digits that represent which of + the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f harvester job-all - create new harvest jobs for all active sources. + harvester reindex + - reindexes the harvest source datasets + The commands should be run with the pyenv activated and refer to your sites configuration file (mysite.ini in this example):: paster --plugin=ckanext-harvest harvester sources --config=mysite.ini @@ -97,8 +110,15 @@ The commands should be run with the pyenv activated and refer to your sites conf Authorization ============= -TODO +Starting from CKAN 2.0, harvest sources behave exactly the same as datasets +(they are actually internally implemented as a dataset type). That means that +can be searched and faceted, and that the same authorization rules can be +applied to them. The default authorization settings are based on organizations +(equivalent to the `publisher profile` found in old versions). +Have a look at the `Authorization `_ +documentation on CKAN core to see how to configure your instance depending on +your needs. The CKAN harvester =================== @@ -347,11 +367,12 @@ pending harvesting jobs:: paster --plugin=ckanext-harvest harvester run --config=mysite.ini -Note: If you don't have the `synchronous_search` plugin loaded, you will need -to update the search index after the harvesting in order for the packages to -appear in search results:: - - paster --plugin=ckan search-index rebuild +The ``run`` command not only starts any pending harvesting jobs, but also +flags those that are finished, allowing new jobs to be created on that particular +source and refreshing the source statistics. That means that you will need to run +this command before being able to create a new job on a source that was being +harvested (On a production site you will tipically have a cron job that runs the +command regularly, see next section). Setting up the harvesters on a production server