openaire-graph-docs/docs/data-model/entities/entity-identifiers.md

2.0 KiB

sidebar_position
8

OpenAIRE entity identifier and PID mapping policy

OpenAIRE assigns internal identifiers for each object it collects. By default, the internal identifier is generated as sourcePrefix::md5(localId) where:

  • sourcePrefix is a namespace prefix of 12 chars assigned to the data source at registration time
  • localid is the identifier assigned to the object by the data source

After years of operation, we can say that:

  • localId are unstable
  • objects can disappear from sources
  • PIDs provided by sources that are not PID agencies (authoritative sources for a specific type of PID) are often wrong (e.g. pre-print with the DOI of the published version, DOIs with typos)

Therefore, when the record is collected from an authoritative source:

  • the identity of the record is forged using the PID, like pidTypePrefix::md5(lowercase(doi))
  • the PID is added in a pid element of the data model

When the record is collected from a source which is not authoritative for any type of PID:

  • the identity of the record is forged as usual using the local identifier
  • the PID, if available, is added as alternateIdentifier

Currently, the following data sources are used as "PID authorities":

PID Type Prefix (12 chars) Authority
doi doi_________ Crossref, Datacite, Zenodo
pmc pmc_________ Europe PubMed Central, PubMed Central
pmid pmid________ Europe PubMed Central, PubMed Central
arXiv arXiv_______ arXiv.org e-Print Archive
handle handle______ any repository

OpenAIRE also perform duplicate identification (see the dedicated section for details). All duplicates are merged together in a representative record which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).