openaire-graph-docs/docs/data-model/entities/entity-identifiers.md

39 lines
2.0 KiB
Markdown

---
sidebar_position: 8
---
# OpenAIRE entity identifier and PID mapping policy
OpenAIRE assigns internal identifiers for each object it collects.
By default, the internal identifier is generated as `sourcePrefix::md5(localId)` where:
* `sourcePrefix` is a namespace prefix of 12 chars assigned to the data source at registration time
* `localid` is the identifier assigned to the object by the data source
After years of operation, we can say that:
* `localId` are unstable
* objects can disappear from sources
* PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)
Therefore, when the record is collected from an authoritative source:
* the identity of the record is forged using the PID, like `pidTypePrefix::md5(lowercase(doi))`
* the PID is added in a `pid` element of the data model
When the record is collected from a source which is not authoritative for any type of PID:
* the identity of the record is forged as usual using the local identifier
* the PID, if available, is added as `alternateIdentifier`
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|---------- |------------------- |--------------------------------------- |
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](../../data-provision/deduplication/)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).