Priority to records from delegated authorities #187
No reviewers
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
3 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#187
Loading…
Reference in New Issue
No description provided.
Delete Branch "delegated_authorities"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
When a record is aggregated from multiple sources considered authoritative for minting specific PIDs, different mappings could be applied to them, depending on the case, thus resulting in inconsistencies in the attribution of the field values. To overcome the issue, the intuition is to include such records only once in the graph.
For the time being, this case seems to involve only Zenodo as delegated authority from Datacite and the policy we're going to implement assumes to pick the version from Zenodo (as it is assumed to be richer).
This "selection" can be performed when the entitites in the graph sharing the same identifier are grouped together, but the graph pipeline does not currently include any of such operation between the raw graph is materialised and before the deduplication workflow takes place.
This implies that we must introduce a new grouping phase, producing a new graph materialization. The implementation for the procedure can share the same code, extended to support this further businness logic; to this aim, the grouping spark job was factored in the
dhp-common
module.Note that the project temporarily depends on
dhp-schemas 2.10.26-SNAPSHOT
until it will be released.The PR is fine with me. I think it can be integrated.
Note: I have skipped the modifications for the provision and iis workflows
The PR is fine with me. I think it can be integrated.