Commit Graph

79 Commits

Author SHA1 Message Date
David Read 78933fb775 [#253] Fix default_groups by saving the dicts to the config object, since saving it to the harvester object doesnt work in the real world. This is a lot more efficient than doing group_show for every dataset imported. 2016-06-27 12:01:35 +01:00
Jardel Weyrich e8f539a45e Don't let the user specify mutually exclusive configuration options:
- organizations_filter_include
- organizations_filter_exclude
2016-06-14 11:35:38 -03:00
David Read f1742fb51a Fix default_groups. It accepted a list of package_name/ids and was trying to add this to the package, but the package needs a dict. Added test. 2016-06-10 09:16:32 +00:00
David Read bfc9b8e0d9 [#249] Test and fix docs for default_tags. Needed to improve error handling when saving ValidationError in a HOE. 2016-06-09 22:11:03 +00:00
amercader 5e1512f717 Don't reuse contexts on ckan harvester
Reusing the same context on all calls can lead to hard to debug failures
like

Action function organization_show did not call its auth function

In this case that was caused because the first organization/group_show
raised a NotFound so the auth audit was still in the context. When
organization/group_show was called again at the end of
organization/group_create the auth audit exception was raised.

This commit makes sure that each call has its own context.
2016-05-23 12:20:08 +01:00
Petar Efnushev c16ecea7f0 reverted change in default groups validation 2016-05-20 20:15:54 +02:00
Petar Efnushev c154365371 Fixed creation/import of groups and organizations when harvesting from remote ckan instance 2016-05-20 16:38:48 +02:00
amercader 9dfeb154eb [#158] Tone down log message 2016-02-17 10:05:57 +00:00
David Read 84b0462979 No need to go back twice 2016-02-15 15:36:02 +00:00
David Read f22100e6c2 Merge remote-tracking branch 'origin' into 157-version-three-apify 2016-02-15 15:20:33 +00:00
David Read 4516bfe44e PEP8 and lint, extracted from PR158 2016-02-15 13:50:18 +00:00
David Read 385b369148 Error-free jobs now include ones where an object was not modified. 2016-02-15 13:16:23 +00:00
David Read f63140354d Fix logic error in previous commit 2016-02-15 12:28:46 +00:00
David Read 52c071dbe9 Improved error handling. e.g. if the site it harvests just returns errors. 2016-02-15 12:10:44 +00:00
David Read 331ad84272 Deal with worry about datasets on the remote CKAN being added/removed during harvest. 2016-02-12 18:00:00 +00:00
David Read 7096b7ddf2 Merge branch 'master' of github.com:ckan/ckanext-harvest into 157-version-three-apify 2016-02-12 16:51:26 +00:00
David Read 392c13d828 If not revisions then we get a 404, so deal with that better. 2015-11-23 21:36:45 +00:00
David Read 4405066fab Catch exceptions from urllib2.urlopen more comprehensively. I think 400 errors were from CKAN 0.6 or something like that - ignore now. 2015-11-23 21:26:32 +00:00
David Read ae7c500745 Merge branch 'master' into yhteentoimivuuspalvelut-job-reporting-fixes 2015-11-17 12:35:59 +00:00
Raphael Stolt 084723abb7 Catch JSONDecodeError when no JSON content 2015-11-16 10:59:18 +01:00
David Read c7fac36c1c [#107] "unchanged" response tested and related fixes
* fix "existing_package_dict" which wasn't containing metadata_modified (because of the schema in the context) so you never skipped an object.
* fix IntegrityError due to resource revision_id being harvested. No idea why this hasn't caused errors before now.
* "unchanged" is now checked in base instead of ckanharvester - makes sense. Looking at other harvesters, it's normal to return from the import_stage with the value returned from base._create_or_update_package so I've continued with that.
* "unchanged" response is now documented
* better report_status tests in test_queue2.
2015-11-03 00:22:53 +00:00
David Read e59760fefe Merge branch 'job-reporting-fixes' of https://github.com/yhteentoimivuuspalvelut/ckanext-harvest into yhteentoimivuuspalvelut-job-reporting-fixes 2015-11-02 21:25:32 +00:00
David Read 24415844e0 [#158] Fix revision_id problem in second harvest. 2015-11-02 18:13:29 +00:00
David Read b7552ba700 [#158] Try harder to use the "get datasets since time X" method of harvesting. Go back to the last completely successful harvest, rather than just consider the previous one. And that had a bug, because fetch errors were ignored, meaning one fetch error could mean that dataset never got harvested again. 2015-11-02 16:59:19 +00:00
David Read 1a680f3fd3 [#158] Fix spaces encoding broken in previous merge. Tested with data.gov.uk. 2015-10-29 17:31:04 +00:00
David Read e2ab9e58e7 Merge remote-tracking branch 'origin/master' into 157-version-three-apify
Conflicts:
	ckanext/harvest/harvesters/ckanharvester.py
2015-10-28 14:34:27 +00:00
David Read 55245b5091 [#158] PEP8/formatting. 2015-10-27 17:43:11 +00:00
David Read 2a79873855 [#158] Use package search to get all datasets. Add paging search results. Store pkg_dict from search in the object rather than request it again in fetch_stage. 2015-10-27 17:33:22 +00:00
David Read b56fae8aed Fixes and tests
* Fix extras as a list of dicts
* Fix SOLR dates syntax - needed a Z
* Basic tests for this updated ckan harvester
* Now require CKAN 2.0 to be able to be able to save these packages in package_show form. Take advantage of this now we are such various imports from are definitely available, such as munge_tag.
* Add back compatibility for other harvesters supplying restful-like package_dicts to _create_or_update_package

TODO add back in the ability to harvest pre 2.0 CKANs with the RESTful calls (fallback or maybe configurable)
2015-10-23 17:30:28 +00:00
David Read eb9aa17862 Include/exclude orgs funcationality based on work by memaldi and ross. 2015-10-21 16:33:16 +00:00
Ross Jones 6dd40bfcf9 Changes the gather state to use v3 API
Rather than using the revisions in v2 API this now uses the
package_search API so that we can extend it with proper filters in
future.
2015-09-10 18:53:16 +01:00
amercader 673dfc9882 [#127] Use site user on the CKAN harvester
Add missing call
2015-06-11 10:38:33 +01:00
Jari Voutilainen 859133fe36 move detecting unchanged datasets to ckanharvester and queue.py 2015-03-10 14:48:41 +02:00
Stefan Oderbolz c1bcee9684 Use str() to get the error message 2015-01-15 11:36:15 +01:00
Stefan Oderbolz 191c39ce5c Catch the more general URLError instead of HTTPError
HTTPError is a subclass of URLError, so catch URLError is enough. I
think the HTTP error code is not as important in this situation, so
catching the more generic error seems like the best solution.
2015-01-15 10:57:24 +01:00
Stefan Oderbolz b978c26e70 Use ContentFetchError instead of generic Exception 2015-01-15 00:49:11 +01:00
Stefan Oderbolz 935b9dda01 Munge group name before fetching remote group
The API call /api/2/rest/package/<id> returns the display name of the
group instead of its ID. To properly match the group, munge the name
before calling /api/2/rest/group
2015-01-15 00:44:53 +01:00
Stefan Oderbolz ef35c21e2a Improve exception handling with custom exception
1. Try whenever possible to catch specific exceptions
2. Raise custom exception where appropriate
3. Fix the exception handling in _get_group and _get_organization
2015-01-15 00:44:45 +01:00
Stefan Oderbolz 0fd38e0e54 Use _get_group as a fallback for remote orgs
First try to get a remote org from the remote Action API, if this fails
try to use the old rest api call, which works on older CKAN versions.

Only if both options fail, its currently not possible to get the remote
organization.
2015-01-14 00:10:27 +01:00
Stefan Oderbolz f214577872 Fetch remote organization via action api
Organizations used to be returned by /api/2/rest/group, this is what the
old implementation used to fetch the information to create the remote
organization on the local instance of CKAN.

With this commit the Action API is used to fetch the same information.
2015-01-13 14:46:53 +01:00
Jari Voutilainen 97f09913cf fix job reporting all datasets deleted when actually nothing changed during last two harvests 2014-09-10 09:22:44 +03:00
amercader fbde0b8dc1 [#87] Remove remote url_type from resources
Otherwise CKAN thinks they are uploads, datastore resources, etc, which
it can cause problems eg when displaying the URL of the resource. We
are just linking to the remote resource URL.
2014-02-11 17:27:19 +00:00
amercader 5b677b6099 [#83] Fix key error when using default_groups 2014-02-10 13:16:58 +00:00
Stefan Oderbolz c52085006a [#61] Truly ignore harvest sources
The currently implementation returns False when a harvest source is being harvested. This leads to an error on the harvesting job, which in turn tends to confuse users that have no idea of this special implementation. This fix ensures that harvest sources are still ignored, but silently.
2013-10-23 07:40:55 +02:00
amercader c18d9dc3af [#71] CKAN harvester: Add datasets to source organization
If the harvest source belongs to an organization, new datasets should be added
to it. This is already the case in the spatial harvesters.

The remote orgs logic has been kept, with the difference that if for
some reason the remote org can not be assigned, the local one is used.

If the source does not have an organization, none is added.
2013-10-22 16:24:43 +01:00
Stefan Oderbolz 8b5d70c6fe Only try to create/match a organization if there is a remote_org 2013-10-11 18:08:32 +02:00
Stefan Oderbolz dd1acd0c6b Use remote_orgs for organizations 2013-10-07 11:22:19 +02:00
Stefan Oderbolz d50eb6fca8 Harvesting of remote organisations similar to remote groups 2013-10-04 16:37:52 +02:00
amercader 01ca5c0dfd [#61] Ignore harvest sources on the CKAN harvester 2013-08-15 14:38:33 +01:00
amercader b25fffda93 [#36] Fix bug on API version checking 2013-08-15 14:37:55 +01:00