* run_test - for running a whole harvest on the command-line
* job_abort - for aborting a limbo job
* source - for showing a single harvest source
* allowing a source to be specified by name in several commands
* Fix extras as a list of dicts
* Fix SOLR dates syntax - needed a Z
* Basic tests for this updated ckan harvester
* Now require CKAN 2.0 to be able to be able to save these packages in package_show form. Take advantage of this now we are such various imports from are definitely available, such as munge_tag.
* Add back compatibility for other harvesters supplying restful-like package_dicts to _create_or_update_package
TODO add back in the ability to harvest pre 2.0 CKANs with the RESTful calls (fallback or maybe configurable)
* Harvesters that change the name when the title changes have had a
problem when the change is small and a number was unnecessarily
appended. e.g. "Trees "->"Trees" meant _gen_new_name("Trees") returned
"trees1". Now you can specify the existing value and it will return
that if it still holds.
* Maximum dataset name length is now adhered to.
* To make a name unique, a sequential number is now added, since for
users that is more understandable and pleasant. However hex digits are
still an option, for those that want to harvest concurrently.
If there are no harvesting jobs to run, there was always an ugly
exception message when using the paster command. This replaces the ugly
output with a proper message and uses a custom exception to allow others
to deal with this error differently.
To avoid having to create a 'harvest' sysadmin explicitly. It will still
be used if present, but if not the site user will be used. You can also
define to user to use via a config option.
Apparently on package installs this is not well supported
from ckan.plugins.toolkit import check_ckan_version
But this works:
from ckan.plugins import toolkit
toolkit.check_ckan_version(...
Otherwise if there was eg an actual ImportError we jut got
2015-03-19 12:30:08,430 DEBUG [ckanext.harvest.plugin] No auth module
for action "update"
on the log
HTTPError is a subclass of URLError, so catch URLError is enough. I
think the HTTP error code is not as important in this situation, so
catching the more generic error seems like the best solution.
The API call /api/2/rest/package/<id> returns the display name of the
group instead of its ID. To properly match the group, munge the name
before calling /api/2/rest/group
1. Try whenever possible to catch specific exceptions
2. Raise custom exception where appropriate
3. Fix the exception handling in _get_group and _get_organization
First try to get a remote org from the remote Action API, if this fails
try to use the old rest api call, which works on older CKAN versions.
Only if both options fail, its currently not possible to get the remote
organization.
Organizations used to be returned by /api/2/rest/group, this is what the
old implementation used to fetch the information to create the remote
organization on the local instance of CKAN.
With this commit the Action API is used to fetch the same information.
Up until now we where relying on `for_edit` being present in the
context, but this is only added on the controllers. It's better to be
safe and remove them always. If needed (at index time) they will be
added afterwards.
Otherwise CKAN thinks they are uploads, datastore resources, etc, which
it can cause problems eg when displaying the URL of the resource. We
are just linking to the remote resource URL.
This was caused by a combination of the auth audit leaking and the
harvester reusing the context for the package_show and package_create
actions. If the package is not found, package_show does not call
check_access, and the auth audit does not pass. This is stored in the
context (`__auth_audit`) and is raised next time that we call
get_action (when we call package_create with the same context)
It could potentially be fixed on master, but it is probably quite rare.
Starting from 2.2 you need to explicitly flag auth functions that
allow anonymous access with the p.toolkit.auth_allow_anonymous_access
decorator. A local version of the decorator is used to ensure we only
use it on CKAN>=2.2
Starting from 2.2, resource_update calls package_show before updating
the resource via a package_update call. The dict passed had the harvest
extras (eg harvest_object_id) added which made the update call fails due
to duplicated extra keys. To fix it we now remove any harvest extras
on after_show if there is a 'for_edit' property on the context.
Due to changes in the templates starting on 2.1 the add source button
was not showing. The whole search template has been simplified,
separating in a separate file the 2.0 only code.
Tested in 2.0, 2.1 and 2.2
The currently implementation returns False when a harvest source is being harvested. This leads to an error on the harvesting job, which in turn tends to confuse users that have no idea of this special implementation. This fix ensures that harvest sources are still ignored, but silently.
If the harvest source belongs to an organization, new datasets should be added
to it. This is already the case in the spatial harvesters.
The remote orgs logic has been kept, with the difference that if for
some reason the remote org can not be assigned, the local one is used.
If the source does not have an organization, none is added.
return the model in the validator instead of checking that it exists in
the validator, returning the id and then fetching it again in the action
function
It's hard for someone outside CKAN to make sure they're sending it in the format
we expect. And they'll also have to keep track of our name format, to keep in
sync whenever we change.
To fix this, we simply do what we already do when creating packages: use a
default name. In this case, the current one.
This prevents exceptions from appearing in the log from Jinja:
[error] [client 1.2.3.4] Error - <class 'jinja2.exceptions.UndefinedError'>: 'dict object' has no attribute 'status'
This is especially needed if you create a new harvest source which does not have all the optional arguments. Before this lead to a KeyError after the creation of the source. Now this simply output 'None'.
At some point we may want to transform these to local time at the
dictization level. We will need a library like dateutil to handle it
properly though.
Remove extras whose values are not strings (e.g. dicts, lists..) from
packages before attempting to create or update the packages on the
target site.
In CKAN 1 it was possible for the values of extras to be other types,
but in CKAN 2 they must be strings, so when harvesting from a CKAN 1 site
into a CKAN 2 site SQLAlchemy would crash when trying to create packages
with non-string extras.
The fix in this commit is to simply remove any non-string extras from
the harvested package. (Alternatively, we could try to convert them to a
string using JSON.)
Fixes#42.
If neither 'only_local' or 'create' are used the remote groups property
needs to be removed, otherwise it causes an exception when the group is
not found.
I have forgotten to update one check for the api_version 1 in the code
responsible for the remote group import feature. This commit fixes that.
Signed-off-by: Konrad Reiche <konrad.reiche@fokus.fraunhofer.de>
I have added try-except clauses in order to prevent the process from
crashing if a non-parsable integer is used for the api_version option.
Signed-off-by: Konrad Reiche <konrad.reiche@fokus.fraunhofer.de>
The CKAN logic uses integers when dealing with the API version, e.g.
making checks which API version is in use. Currently, the harvester
uses strings to identify the API version. Instead of dealing with
type conversion the harvester could use integers directly.
This commit fixesokfn/ckanext-harvest#36. When the API version is
parsed from the configuration it is passed through the int() function.
This way the harvesting will still work even if a harvest source was
configured with a string API version which makes this commit backward
compatible.
Signed-off-by: Konrad Reiche <konrad.reiche@fokus.fraunhofer.de>
Current implementation only checked for the first source to exist and
didn't allow to rerun the migration for other sources if there was an
error. With the new one, all non existing sources are migrated each
time.
When deleting a source, if clear_source equals true in the context,
harvest_source_clear will be called. Default is false. The UI shows a
select with the two options.