Graph stability - ID creation policy #37

Closed
opened 2020-08-05 12:21:38 +02:00 by claudio.atzori · 1 comment

Description originally from the OpenAIRE Roadmap document.

One of the main challenges we need to tackle is the "stability of the Graph” that is making its identifiers and records stable over time. The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations. Records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. Not only, but our mappings may also change and improve over time.

One of the fronts to work on regards the the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some of the authoritative sources, while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, Zenodo (which mints DOIs from DataCite), institutional repositories, etc. Authoritative sources PIDs would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction.

This implies we need to introduce an ID creation policy to be used

  • In the mapping layer part of the graph construction process, used to import the content from the aggregator backends;
  • In the construction of the actionsets; as a consequence, all of them will need to be re-created to assign the new identity to each object/relationship.

Such a policy must define, for each main entity type, a priority level for each PID type that might be found in the original metadata records. The implementation of the policy will then build the identifier according to that priority (e.g. if a record exposes both a DOI and a HANDLE, the DOI will be preferred to build its identifier).

The implementation of the policy should be placed as close as possible to the schema definition, possilbly in the module dhp-schemas.

Description originally from the [OpenAIRE Roadmap document](https://docs.google.com/document/d/1CEDmLS8QmyyYt7_xg7l-qAmodRhqYWVgSqWRfo9NGBk/edit?usp=sharing). One of the main challenges we need to tackle is the "stability of the Graph” that is making its identifiers and records stable over time. The barriers to this scenario are many, as the Graph keeps a map of data sources that is subject to constant variations. Records in repositories vary in content, original IDs, and PIDs, may disappear or reappear, and the same holds for the repository or the metadata collection it exposes. Not only, but our mappings may also change and improve over time. One of the fronts to work on regards the the attribution of the identity to the objects populating the graph. The basic idea is to build the identifiers of the objects in the graph from the PIDs available in some of the authoritative sources, while considering all the other sources as by definition “unstable”. Examples of authoritative sources are Crossref and DataCite. Examples of non-authoritative ones are institutional repositories, Zenodo (which mints DOIs from DataCite), institutional repositories, etc. Authoritative sources PIDs would form the stable OpenAIRE ID skeleton of the Graph, precisely because they are immutable by construction. This implies we need to introduce an ID creation policy to be used * In the mapping layer part of the graph construction process, used to import the content from the aggregator backends; * In the construction of the actionsets; as a consequence, all of them will need to be re-created to assign the new identity to each object/relationship. Such a policy must define, for each main entity type, a priority level for each PID type that might be found in the original metadata records. The implementation of the policy will then build the identifier according to that priority (e.g. if a record exposes both a DOI and a HANDLE, the DOI will be preferred to build its identifier). The implementation of the policy should be placed as close as possible to the schema definition, possilbly in the module `dhp-schemas`.
claudio.atzori added the
enhancement
label 2020-08-05 12:21:38 +02:00
sandro.labruzzo was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
claudio.atzori self-assigned this 2020-08-05 12:21:38 +02:00
alessia.bardi was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
michele.artini was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
miriam.baglioni was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
michele.debonis was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
enrico.ottonello was assigned by claudio.atzori 2020-08-05 12:21:38 +02:00
Author
Owner

The discussion was continued in #6486 and the schema module dhp-schemas-2.7.15 includes the utility classes providing a factory class that is used to build the internal OpenAIRE identifier according to the updated policy.

The new definition assumes that OpenAIRE IDs should depend on persistent IDs when they are provided by the authority responsible to create them. This implies also that PIDs are included in the graph according to a tighter criterion: the PID Types declared in that table below are considered to be mapped as result.pid and result.instance.pid only when they are collected from the relative PID authority datasource.

PID Type Authority
doi Crossref, Datacite
pmc, pmid Europe PubMed Central, PubMed Central
arXiv arXiv.org e-Print Archive

One more relevant concept is represented by the delegated authorities, i.e. content providers that assign PIDs on behalf of another authority. These are considered as they were authorities in the PID creation procedure and as for today (Sept 2021) the list is defined as follows.

PID Type Delegated Authority
doi Zenodo

There is an exception though: Handle(s) are minted by several repositories and it would be quite inconvenient (not to mention unmaintainable) to list them all, so to avoid losing them, Handles bypass the PID authority filtering rule.

In all other cases, PIDs must be included in the graph within result.instance.alternateIdentifier.

The discussion was continued in [#6486](https://support.openaire.eu/issues/6486) and the schema module `dhp-schemas-2.7.15` includes the utility classes providing a factory class that is used to build the internal OpenAIRE identifier according to the updated policy. The new definition assumes that OpenAIRE IDs should depend on persistent IDs when they are provided by the authority responsible to create them. This implies also that PIDs are included in the graph according to a tighter criterion: the PID Types declared in that table below are considered to be mapped as `result.pid` and `result.instance.pid` only when they are collected from the relative PID authority datasource. | PID Type | Authority | | -------- | -------- | | doi | Crossref, Datacite | | pmc, pmid | Europe PubMed Central, PubMed Central | | arXiv | arXiv.org e-Print Archive | One more relevant concept is represented by the _delegated authorities_, i.e. content providers that assign PIDs on behalf of another authority. These are considered as they were authorities in the PID creation procedure and as for today (Sept 2021) the list is defined as follows. | PID Type | Delegated Authority | | -------- | -------- | | doi | Zenodo | There is an exception though: Handle(s) are minted by several repositories and it would be quite inconvenient (not to mention unmaintainable) to list them all, so to avoid losing them, Handles bypass the PID authority filtering rule. In all other cases, PIDs must be included in the graph within `result.instance.alternateIdentifier`.
Sign in to join this conversation.
No Milestone
No project
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: D-Net/dnet-hadoop#37
No description provided.