openaire-graph-docs/docs/data-model/pids-and-identifiers.md

114 lines
6.1 KiB
Markdown
Raw Normal View History

2022-10-06 12:10:53 +02:00
# PIDs and identifiers
2024-04-22 14:22:29 +02:00
One of the challenges towards the stability of the contents in the OpenAIRE
Graph consists of making its identifiers and records stable over time.
The barriers to this scenario are many, as the Graph keeps a map of data sources
that is subject to constant variations: records in repositories vary in content,
original IDs, and PIDs, may disappear or reappear, and the same holds for the
repository or the metadata collection it exposes.
Not only, but the mappings applied to the original contents may also change and
improve over time to catch up with the changes in the input records.
2022-10-06 12:10:53 +02:00
## PID Authorities
2024-04-22 14:22:29 +02:00
One of the fronts regards the attribution of the identity to the objects
populating the graph. The basic idea is to build the identifiers of the objects
in the graph from the PIDs available in some authoritative sources while
considering all the other sources as by definition “unstable”. Examples of
authoritative sources are Crossref and DataCite. Examples of non-authoritative
ones are institutional repositories, aggregators, etc. PIDs from the
authoritative sources would form the stable OpenAIRE ID skeleton of the Graph,
precisely because they are immutable by construction.
2022-10-06 12:10:53 +02:00
2024-04-22 14:22:29 +02:00
Such a policy defines a list of data sources that are considered authoritative
for a specific type of PID they provide, whose effect is twofold:
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority
responsible to create them;
* PIDs are included in the graph according to a tight criterion: the PID Types
declared in the table below are considered to be mapped as PIDs only when they
are collected from the relative PID authority data source.
2022-10-06 12:10:53 +02:00
| PID Type | Authority |
|-----------|-----------------------------------------------------------------------------------------------------|
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
| uniprot | [Protein Data Bank](http://www.pdb.org/) |
| ena | [Protein Data Bank](http://www.pdb.org/) |
| pdb | [Protein Data Bank](http://www.pdb.org/) |
2024-04-22 14:22:29 +02:00
There is an exception though: Handle(s) are minted by several repositories; as
listing them all would not be a viable option, to avoid losing them as PIDs,
Handles bypass the PID authority filtering rule.
2022-10-06 12:10:53 +02:00
In all other cases, PIDs are be included in the graph as alternate Identifiers.
## Delegated authorities
2024-04-22 14:22:29 +02:00
When a record is aggregated from multiple sources considered authoritative for
minting specific PIDs, different mappings could be applied to them and,
depending on the case,
2022-10-06 12:10:53 +02:00
this could result in inconsistencies in the attribution of the field values.
2024-04-22 14:22:29 +02:00
To overcome the issue, the intuition is to include such records only once in the
graph. To do so, the concept of "delegated authorities" defines a list of
datasources that
2022-10-06 12:10:53 +02:00
assigns PIDs to their scientific products from a given PID minter.
2024-04-22 14:22:29 +02:00
This "selection" can be performed when the entities in the graph sharing the
same identifier are grouped together. The list of the delegated authorities
currently includes
2022-10-06 12:10:53 +02:00
2024-04-22 14:22:29 +02:00
| Datasource delegated | Datasource delegating | Pid Type |
|--------------------------------------|----------------------------------|----------|
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
2022-10-06 12:10:53 +02:00
## Identifiers in the Graph
OpenAIRE assigns internal identifiers for each object it collects.
2024-04-22 14:22:29 +02:00
By default, the internal identifier is generated as `sourcePrefix::md5(localId)`
where:
2022-10-06 12:10:53 +02:00
2024-04-22 14:22:29 +02:00
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source
at registration time
2024-01-15 18:40:20 +01:00
* `localΙd` is the identifier assigned to the object by the data source
2022-10-06 12:10:53 +02:00
After years of operation, we can say that:
* `localId` are generally unstable
* objects can disappear from sources
2024-04-22 14:22:29 +02:00
* PIDs provided by sources that are not PID agencies (authoritative sources for
a specific type of PID) are often wrong (e.g. pre-print with the DOI of the
published version, DOIs with typos)
2022-10-06 12:10:53 +02:00
Therefore, when the record is collected from an authoritative source:
2024-04-22 14:22:29 +02:00
* the identity of the record is forged using the PID,
like `pidTypePrefix::md5(lowercase(doi))`
2022-10-06 12:10:53 +02:00
* the PID is added in a `pid` element of the data model
2024-04-22 14:22:29 +02:00
When the record is collected from a source which is not authoritative for any
type of PID:
2022-10-06 12:10:53 +02:00
* the identity of the record is forged as usual using the local identifier
* the PID, if available, is added as `alternateIdentifier`
Currently, the following data sources are used as "PID authorities":
2024-04-22 14:22:29 +02:00
| PID Type | Prefix (12 chars) | Authority |
|----------|-----------------------|-----------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see
the [dedicated section for details](/graph-production-workflow/deduplication)).
All duplicates are **merged** together in a **representative record** which must
be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier
of one of the aggregated record).