Update README for 2.0

General clean-up, mention redis, new auth, run command
2013-05-14 17:00:20 +01:00 · 2013-05-14 17:00:20 +01:00 · b316cc26a2
parent 1714e55110
commit b316cc26a2
1 changed files with 44 additions and 23 deletions
--- a/README.rst
+++ b/README.rst
@ -8,13 +8,17 @@ and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
 Installation
 ============

-The harvest extension uses Message Queuing to handle the different gather
-stages.
+The harvest extension can use two different backends. You can choose whichever
+you prefer depending on your needs:

-You will need to install the RabbitMQ server::
+* `RabbitMQ <http://www.rabbitmq.com/>`_: To install it, run::

    sudo apt-get install rabbitmq-server

+* `Redis <http://redis.io/>`_: To install it, run::
+
+    sudo apt-get install redis-server
+
 Clone the repository and set up the extension::

    git clone https://github.com/okfn/ckanext-harvest
@ -27,6 +31,11 @@ well as the harvester for CKAN instances (included with the extension)::

    ckan.plugins = harvest ckan_harvester

+Also define the backend that you are using with the ``ckan.harvest.mq.type``
+option (it defaults to ``rabbitmq``)::
+
+    ckan.harvest.mq.type = redis
+

 Configuration
 =============
@ -35,13 +44,7 @@ Run the following command to create the necessary tables in the database::

    paster --plugin=ckanext-harvest harvester initdb --config=mysite.ini

-The extension needs a user with sysadmin privileges to perform the
-harvesting jobs. You can create such a user running this command::
-
-    paster --plugin=ckan sysadmin add harvest
-
-After installation, the harvest interface should be available under /harvest
-if you're logged in with sysadmin permissions, eg.
+After installation, the harvest source listing should be available under /harvest, eg:

 	http://localhost:5000/harvest

@ -55,7 +58,7 @@ The following operations can be run from the command line using the
      harvester initdb
        - Creates the necessary tables in the database

-      harvester source {url} {type} [{active}] [{user-id}] [{publisher-id}]
+      harvester source {url} {type} [{config}] [{active}] [{user-id}] [{publisher-id}] [{frequency}]
        - create new harvest source

      harvester rmsource {id}
@ -80,16 +83,26 @@ The following operations can be run from the command line using the
      harvester fetch_consumer
        - starts the consumer for the fetching queue

-      harvester import [{source-id}]
-        - perform the import stage with the last fetched objects, optionally
-          belonging to a certain source.
-          Please note that no objects will be fetched from the remote server.
-          It will only affect the last fetched objects already present in the
-          database.
+      harvester purge_queues
+        - removes all jobs from fetch and gather queue
+
+      harvester [-j] [--segments={segments}] import [{source-id}]
+        - perform the import stage with the last fetched objects, optionally belonging to a certain source.
+          Please note that no objects will be fetched from the remote server. It will only affect
+          the last fetched objects already present in the database.
+
+          If the -j flag is provided, the objects are not joined to existing datasets. This may be useful
+          when importing objects for the first time.
+
+          The --segments flag allows to define a string containing hex digits that represent which of
+          the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f

      harvester job-all
        - create new harvest jobs for all active sources.

+      harvester reindex
+        - reindexes the harvest source datasets
+
 The commands should be run with the pyenv activated and refer to your sites configuration file (mysite.ini in this example)::

        paster --plugin=ckanext-harvest harvester sources --config=mysite.ini
@ -97,8 +110,15 @@ The commands should be run with the pyenv activated and refer to your sites conf
 Authorization
 =============

-TODO
+Starting from CKAN 2.0, harvest sources behave exactly the same as datasets
+(they are actually internally implemented as a dataset type). That means that
+can be searched and faceted, and that the same authorization rules can be
+applied to them. The default authorization settings are based on organizations
+(equivalent to the `publisher profile` found in old versions).

+Have a look at the `Authorization <http://docs.ckan.org/en/latest/authorization.html>`_ 
+documentation on CKAN core to see how to configure your instance depending on
+your needs.

 The CKAN harvester
 ===================
@ -347,11 +367,12 @@ pending harvesting jobs::

      paster --plugin=ckanext-harvest harvester run --config=mysite.ini

-Note: If you don't have the `synchronous_search` plugin loaded, you will need
-to update the search index after the harvesting in order for the packages to
-appear in search results::
-
-      paster --plugin=ckan search-index rebuild
+The ``run`` command not only starts any pending harvesting jobs, but also
+flags those that are finished, allowing new jobs to be created on that particular
+source and refreshing the source statistics. That means that you will need to run
+this command before being able to create a new job on a source that was being
+harvested (On a production site you will tipically have a cron job that runs the
+command regularly, see next section).


 Setting up the harvesters on a production server