Graph stability - ID creation policy #37
Labels
No Label
bug
duplicate
enhancement
help wanted
invalid
question
RDGraph
RSAC
wontfix
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: D-Net/dnet-hadoop#37
Loading…
Reference in New Issue
No description provided.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Description originally from the OpenAIRE Roadmap document.
One of the main challenges we need to tackle is the "stability of the Graph” that is making its identifiers and records stable over time. The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations. Records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. Not only, but our mappings may also change and improve over time.
One of the fronts to work on regards the the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some of the authoritative sources, while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, Zenodo (which mints DOIs from DataCite), institutional repositories, etc. Authoritative sources PIDs would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction.
This implies we need to introduce an ID creation policy to be used
Such a policy must define, for each main entity type, a priority level for each PID type that might be found in the original metadata records. The implementation of the policy will then build the identifier according to that priority (e.g. if a record exposes both a DOI and a HANDLE, the DOI will be preferred to build its identifier).
The implementation of the policy should be placed as close as possible to the schema definition, possilbly in the module
dhp-schemas
.The discussion was continued in #6486 and the schema module
dhp-schemas-2.7.15
includes the utility classes providing a factory class that is used to build the internal OpenAIRE identifier according to the updated policy.The new definition assumes that OpenAIRE IDs should depend on persistent IDs when they are provided by the authority responsible to create them. This implies also that PIDs are included in the graph according to a tighter criterion: the PID Types declared in that table below are considered to be mapped as
result.pid
andresult.instance.pid
only when they are collected from the relative PID authority datasource.One more relevant concept is represented by the delegated authorities, i.e. content providers that assign PIDs on behalf of another authority. These are considered as they were authorities in the PID creation procedure and as for today (Sept 2021) the list is defined as follows.
There is an exception though: Handle(s) are minted by several repositories and it would be quite inconvenient (not to mention unmaintainable) to list them all, so to avoid losing them, Handles bypass the PID authority filtering rule.
In all other cases, PIDs must be included in the graph within
result.instance.alternateIdentifier
.