Go to file
Adrià Mercader 56d5acc867 [refactoring] Add code to handle queuing and the new IHarvester interface. Add
new commands in the CLI to start the queue consumers and fire the harvesting
process.
2011-04-06 12:45:00 +01:00
ckanext [refactoring] Add code to handle queuing and the new IHarvester interface. Add 2011-04-06 12:45:00 +01:00
public/ckanext/harvest Fix some design issues in IE. To make it look perfect, the OL LayerSwitcher control has to be modified 2011-03-23 13:38:01 +00:00
templates/ckanext/harvest [refactoring] Use the common functions in the web interface. 2011-04-05 13:39:23 +01:00
tests [controllers,tests]: Added requirement to be sysadmin for all controller operations. Provided test for this and some useful test infrastrure. 2011-03-25 17:01:26 +00:00
.hgignore Ignore swap files 2011-03-10 09:44:31 +00:00
MANIFEST.in First draft of the Harvesting extension 2011-03-09 18:56:55 +00:00
README.rst [controllers,tests]: Added requirement to be sysadmin for all controller operations. Provided test for this and some useful test infrastrure. 2011-03-25 17:01:26 +00:00
setup.cfg [controllers,tests]: Added requirement to be sysadmin for all controller operations. Provided test for this and some useful test infrastrure. 2011-03-25 17:01:26 +00:00
setup.py Moving the extension points code to package as a .deb file 2011-03-28 15:52:43 +01:00
test-core.ini [controllers,tests]: Added requirement to be sysadmin for all controller operations. Provided test for this and some useful test infrastrure. 2011-03-25 17:01:26 +00:00
test.ini [controllers,tests]: Added requirement to be sysadmin for all controller operations. Provided test for this and some useful test infrastrure. 2011-03-25 17:01:26 +00:00

README.rst

ckanext-harvest - Remote harvesting extension

This extension will contain all harvesting related code, now present in ckan core, ckanext-dgu and ckanext-csw.

Dependencies

You will need ckan installed, as well as the ckanext-dgu and ckanext-csw plugins activated.

If you want to use the spatial search API, you will need PostGIS installed and enable the spatial features of your PostgreSQL database. See the "Setting up PostGIS" section for details.

Tests

To run the tests, this is the basic command:

$ nosetests --ckan tests/

Or with postgres:

$ nosetests --ckan --with-pylons=../ckan/test-core.ini tests/

(See the Ckan README for more information.)

Configuration

The extension needs a user with sysadmin privileges to perform the harvesting jobs. You can create such a user running these two commands in the ckan directory:

paster user add harvest

paster sysadmin add harvest

The user's API key must be defined in the CKAN configuration file (.ini) in the [app:main] section:

ckan.harvesting.api_key = 4e1dac58-f642-4e54-bbc4-3ea262271fe2

The API URL used can be also defined in the ini file (it defaults to http://localhost:5000/):

ckan.api_url = <api_url>

If you are using the spatial search feature, you can define the projection in which extents are stored in the database with the following option. Use the EPSG code as an integer (e.g 4326, 4258, 27700, etc). It defaults to 4258:

ckan.harvesting.srid = 4258

Command line interface

The following operations can be run from the command line using the paster harvester command:

harvester source {url} [{user-ref} [{publisher-ref}]]     
  - create new harvest source

harvester rmsource {url}
  - remove a harvester source (and associated jobs)

harvester sources                                 
  - lists harvest sources

harvester job {source-id} [{user-ref}]
  - create new harvesting job

harvester rmjob {job-id}
  - remove a harvesting job

harvester jobs
  - lists harvesting jobs

harvester run
  - runs harvesting jobs

harvester extents
  - creates or updates the extent geometry column for packages with
    a bounding box defined in extras

The commands should be run from the ckanext-harvest directory and expect a development.ini file to be present. Most of the time you will specify the config explicitly though:

paster harvester sources --config=../ckan/development.ini

API

The extension adds the following call to the CKAN search API, which returns packages with an extent that intersects with the bounding box provided:

/api/2/search/package/geo?bbox={minx,miny,maxx,maxy}[&crs={srid}]

If the bounding box coordinates are not in the same projection as the one defined in the database, a CRS must be provided, in one of the following forms:

  • urn:ogc:def:crs:EPSG::4258
  • EPSG:4258
  • 4258

Setting up PostGIS =================

Configuration

  • Install PostGIS:

    sudo apt-get install postgresql-8.4-postgis
  • Create a new PostgreSQL database:

    sudo -u postgres createdb [database]

    (If you just want to spatially enable an exisiting database, you can ignore this point, but it's a good idea to create a template to make easier to create new databases)

  • Many of the PostGIS functions are written in the PL/pgSQL language, so we need to enable it in our database:

    sudo -u postgres createlang plpgsql [database]
  • Run the following commands. The first one will create the necessary tables and functions in the database, and the second will populate the spatial reference table:

    sudo -u postgres psql -d [database] -f /usr/share/postgresql/8.4/contrib/postgis-1.5/postgis.sql
    sudo -u postgres psql -d [database] -f /usr/share/postgresql/8.4/contrib/postgis-1.5/spatial_ref_sys.sql    
* Execute the following command to see if PostGIS was properly

installed:

sudo -u postgres psql -d [database] -c "SELECT postgis_full_version()"

You should get something like:

postgis_full_version                                          

------------------------------------------------------------------------------------------------------- POSTGIS="1.5.2" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009" LIBXML="2.7.7" USE_STATS (1 row)

Also, if you log into the database, you should see two tables, geometry_columns and spatial_ref_sys (and probably a view called geography_columns).

Note: This commands will create the two tables owned by the postgres user. You probably should make owner the user that will access the database from ckan:

ALTER TABLE spatial_ref_sys OWNER TO [your_user];
ALTER TABLE geometry_columns OWNER TO [your_user];

More information on PostGIS installation can be found here:

http://postgis.refractions.net/docs/ch02.html#PGInstall

Setting up a spatial table

To be able to store geometries and perform spatial operations, PostGIS needs to work with geometry fields. Geometry fields should always be added via the AddGeometryColumn function:

CREATE TABLE package_extent(
    package_id text PRIMARY KEY
);

ALTER TABLE package_extent OWNER TO [your_user];

SELECT AddGeometryColumn('package_extent','the_geom', 4258, 'POLYGON', 2);

This will add a geometry column in the package_extent table called the_geom, with the spatial reference system EPSG:4258. The stored geometries will be polygons, with 2 dimensions.

Have a look a the table definition, and see how PostGIS has created three constraints to ensure that the geometries follow the parameters defined in the geometry column creation:

# \d package_extent

Table "public.package_extent" Column | Type | Modifiers

------------+----------+----------- package_id | text | not null

the_geom | geometry |

Indexes:

"package_extent_pkey" PRIMARY KEY, btree (package_id)

Check constraints:

"enforce_dims_the_geom" CHECK (st_ndims(the_geom) = 2) "enforce_geotype_the_geom" CHECK (geometrytype(the_geom) = 'POLYGON'::text OR the_geom IS NULL) "enforce_srid_the_geom" CHECK (st_srid(the_geom) = 4258)