Update README for 2.0

General clean-up, mention redis, new auth, run command
2013-05-14 17:00:20 +01:00 · 2013-05-14 17:00:20 +01:00 · b316cc26a2
parent 1714e55110
commit b316cc26a2
1 changed files with 44 additions and 23 deletions
--- a/README.rst
+++ b/README.rst
@ -8,13 +8,17 @@ and adds a CLI and a WUI to CKAN to manage harvesting sources and jobs.
 Installation
 ============
-The harvest extension uses Message Queuing to handle the different gather
+The harvest extension can use two different backends. You can choose whichever
-stages.
+you prefer depending on your needs:
-You will need to install the RabbitMQ server::
+* `RabbitMQ <http://www.rabbitmq.com/>`_: To install it, run::
    sudo apt-get install rabbitmq-server
 * `Redis <http://redis.io/>`_: To install it, run::
    sudo apt-get install redis-server
 Clone the repository and set up the extension::
    git clone https://github.com/okfn/ckanext-harvest
@ -27,6 +31,11 @@ well as the harvester for CKAN instances (included with the extension)::
    ckan.plugins = harvest ckan_harvester
 Also define the backend that you are using with the ``ckan.harvest.mq.type``
 option (it defaults to ``rabbitmq``)::
    ckan.harvest.mq.type = redis
 Configuration
 =============
@ -35,13 +44,7 @@ Run the following command to create the necessary tables in the database::
    paster --plugin=ckanext-harvest harvester initdb --config=mysite.ini
-The extension needs a user with sysadmin privileges to perform the
+After installation, the harvest source listing should be available under /harvest, eg:
 harvesting jobs. You can create such a user running this command::
    paster --plugin=ckan sysadmin add harvest
 After installation, the harvest interface should be available under /harvest
 if you're logged in with sysadmin permissions, eg.
 	http://localhost:5000/harvest
@ -55,7 +58,7 @@ The following operations can be run from the command line using the
      harvester initdb
        - Creates the necessary tables in the database
-      harvester source {url} {type} [{active}] [{user-id}] [{publisher-id}]
+      harvester source {url} {type} [{config}] [{active}] [{user-id}] [{publisher-id}] [{frequency}]
        - create new harvest source
      harvester rmsource {id}
@ -80,16 +83,26 @@ The following operations can be run from the command line using the
      harvester fetch_consumer
        - starts the consumer for the fetching queue
-      harvester import [{source-id}]
+      harvester purge_queues
-        - perform the import stage with the last fetched objects, optionally
+        - removes all jobs from fetch and gather queue
-          belonging to a certain source.
+
-          Please note that no objects will be fetched from the remote server.
+      harvester [-j] [--segments={segments}] import [{source-id}]
-          It will only affect the last fetched objects already present in the
+        - perform the import stage with the last fetched objects, optionally belonging to a certain source.
-          database.
+          Please note that no objects will be fetched from the remote server. It will only affect
          the last fetched objects already present in the database.
          If the -j flag is provided, the objects are not joined to existing datasets. This may be useful
          when importing objects for the first time.
          The --segments flag allows to define a string containing hex digits that represent which of
          the 16 harvest object segments to import. e.g. 15af will run segments 1,5,a,f
      harvester job-all
        - create new harvest jobs for all active sources.
      harvester reindex
        - reindexes the harvest source datasets
 The commands should be run with the pyenv activated and refer to your sites configuration file (mysite.ini in this example)::
        paster --plugin=ckanext-harvest harvester sources --config=mysite.ini
@ -97,8 +110,15 @@ The commands should be run with the pyenv activated and refer to your sites conf
 Authorization
 =============
-TODO
+Starting from CKAN 2.0, harvest sources behave exactly the same as datasets
 (they are actually internally implemented as a dataset type). That means that
 can be searched and faceted, and that the same authorization rules can be
 applied to them. The default authorization settings are based on organizations
 (equivalent to the `publisher profile` found in old versions).
 Have a look at the `Authorization <http://docs.ckan.org/en/latest/authorization.html>`_ 
 documentation on CKAN core to see how to configure your instance depending on
 your needs.
 The CKAN harvester
 ===================
@ -347,11 +367,12 @@ pending harvesting jobs::
      paster --plugin=ckanext-harvest harvester run --config=mysite.ini
-Note: If you don't have the `synchronous_search` plugin loaded, you will need
+The ``run`` command not only starts any pending harvesting jobs, but also
-to update the search index after the harvesting in order for the packages to
+flags those that are finished, allowing new jobs to be created on that particular
-appear in search results::
+source and refreshing the source statistics. That means that you will need to run
-
+this command before being able to create a new job on a source that was being
-      paster --plugin=ckan search-index rebuild
+harvested (On a production site you will tipically have a cron job that runs the
 command regularly, see next section).
 Setting up the harvesters on a production server