openaire-graph-docs/docs/graph-production-workflow/aggregation/non-compatible-sources/crossref_unpaywall.md

52 KiB
Raw Blame History

Crossref & Unpaywall

This section describes the procedure used to integrate the contents from Crossref and Unpaywall in the OpenAIRE Graph.

Data acquisition

The dataset containing all the Crossref records is obtained via a complete data dump on a monthly basis. The Unpaywall dataset is no longer updated anymore but its latest snapshot (Dec 2021) is used to enrich the Crossref contents.

Process

In the following we describe the process applied to the Crossref & the Unpaywall contents.

Crossref filtering

Records in Crossref are ruled out according to the following criteria

  • have blank title, examples:
    • 10.1093/rheumatology/41.7.837
    • 10.1093/qjmed/95.7.430
    • 10.1371/journal.pone.0171434.g005
  • have one of the following publishers: "Test accounts", "CrossRef Test Account"
  • have authors matching the following invalid names: ",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na"
  • have "Addie Jackson" as author and "Elsevier BV" as publisher (empirically we say they are test records)
  • have not one of the following values in the field type : "book-section", "book", "book-chapter", "book-part", "book-series", "book-set", "book-track", "edited-book", "reference-book", "monograph", "journal-article", "dissertation", "other", "peer-review", "proceedings", "proceedings-article", "reference-entry", "report", "report-series", "standard", "standard-series", "posted-content", "dataset",
    • Example:
      • 10.1371/journal.pone.0171434.g005
      • 10.7554/elife.21052.049
      • 10.1371/journal.pcbi.1005379.s006

Records with type=dataset are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication.

Mapping Crossref properties into the OpenAIRE Graph

Properties in OpenAIRE research products are set based on the logic described in the following table:

OpenAIRE Research Product field path Crossref path(s) Notes
id doi id in the form doi_________::md5(doi)
dateofcollection indexed.datetime
lastupdatetimestamp indexed.timestamp
type type Using the dnet:result_typologies vocabulary, we look up the instance.type synonym to generate one of the following main entities:
  • publication
  • dataset
originalId doi, clinical-trial-number, alternative-id
pid The scheme tells the type of PID, the value contains the actual value
pid.scheme Default value: doi
pid.value doi The doi is normalised and lower-cased
maintitle title
subtitle subtitle
author author if available the sequence is mapped to rank and the ORCID is also mapped
author.name author.given
author.surname author.family
author.fullname author.given author.family
author.rank based on the order, starts from 1
author.pid only if the ORCID is available
author.pid.id.scheme Default 'pending_orcid' (meaning that it is not an id confirmed by ORCID)
author.pid.id.value author.ORCID
author.pid.provenance.provenance Default 'Harvested'
author.pid.provenance.trust Default '0.9'
description abstract
subject subject with classid='keywords', i.e. no controlled vocabularies for Crossref subjects
publicationdate issued.datetime or, if not available, created.datetime
publisher publisher
source source only if the record is not of type book
source concatenation of container-title.head + "ISBN: " + ISBN.head only if the record is of type book
container It is set only for publications with information about the journal it was published in.
container.name container-title.head
container.issnOnline issn-type.value if issn-type.type='electronic'
container.issnPrinted issn-type.value if issn-type.type='print'
container.vol volume
container.sp page before '-'
container.ep page after '-'
instance One instance is created with the DOI URL
instance.accessright Values in instance.accessright.code and instance.accessright.label are set based on license and dateofacceptance:
- UNKNOWN: if the license is blank
- OPEN ACCESS: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see Unpaywall FAQ for details) or if OUP license, but only after 12 months from the publication date
- EMBARGO: OUP license, before 12 months from the publication date
- CLOSED: if there is a license not covered by the previous cases
instance.accessright.code Code from the COAR vocabulary for access right
instance.accessright.label One of: OPEN, RESTRICTED, CLOSED, EMBARGO
instance.accessright.scheme Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right
instance.accessright.openAccessRoute only if instance.accessright.value = 'OPEN ACCESS'. Default is hybrid. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list.
instance.license license.URL If there is a license.content-version='vor', then this is used. Otherwise the first license entry is used.
instance.pid The scheme tells the type of PID, the value contains the actual value
instance.pid.scheme Default value: doi
instance.pid.value doi The doi is normalised and lower-cased
instance.publicationdate issued.datetime or, if not available, created.datetime
instance.refereed set to peerReviewed only if relation.has-review.id is not empty, UNKNOWN otherwise.
instance.type subtype mapped using the OpenAIRE vocabulary for research products typologies
instance.url doi Full URL of the DOI

All other fields of the Json schema not mentioned in the table contain empty values.

All the records from Crossref are related to the datasource with name=Crossref and id=openaire____::081b82f96300b6a6e3d282bad31cb6e2

Possible improvements:

  • map clinical-trial-number and alternative-id in alternateIdentifiers?
  • Verify if Crossref has a property for language, country, container.issnLinking, container.iss, container.edition, container.conferenceplace and container.conferencedate
  • Different approach to set the refereed field and improve its coverage?

Links to funding available in Crossref are mapped as funding relationships (ResearchProduct -- isProducedBy -- Project) applying the following mapping:

Funder Grant code Link to
DOI: {10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665} or name: 'European Unions Horizon 2020 research and innovation program' series of 4-9 digits in award Link to H2020 project
DOI: {10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780} series of 4-9 digits in award Link to FP7 project
DOI: 10.13039/501100000781 OR name: 'European Union's' series of 4-9 digits in award Link to FP7 or H2020 project
DOI: 10.13039/100000001 award Link to NSF project
DOI: 10.13039/501100001665 OR name: {'The French National Research Agency (ANR)', 'The French National Research Agency'} award Link to ANR project
DOI: 10.13039/501100002341 award Link to Academy of Finland project
DOI: 10.13039/501100001602 award, removing the initial 'SFI' if present Link to SFI project
DOI: 10.13039/501100000923 award Link to ARC project
DOI: 10.13039/501100000038 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to NSERC (unidentified project)
DOI: 10.13039/501100000155 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to SSHRC (unidentified project)
DOI: 10.13039/501100000024 award ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE Link to CIHR (unidentified project)
DOI: 10.13039/501100002848 OR name :'CONICYT, Programa de Formación de Capital Humano Avanzado' award Link to CONICYT project
DOI: 10.13039/501100003448 series of 4-9 digits in award Link to GSRT project
DOI: 10.13039/501100010198 award Link to SGOV project
DOI: 10.13039/501100004564 series of 4-9 digits in award Link to MESTD project
DOI: 10.13039/501100003407 award Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (unidentified
project) is also generated
DOI: {10.13039/501100006588, 10.13039/501100004488} award, removing 'Project No' and 'HRZZ' prefix, if present Link to HRZZ or MZOS project
DOI: 10.13039/501100006769 award Link to Russian Science Foundation project
DOI: 10.13039/501100001711 award after '_' and before '/' Link to SNSF project
DOI: 10.13039/501100004410 award Link to TUBITAK project
DOI: 10.10.13039/100004440 or name: Wellcome Trust Masters Fellowship award Link to Wellcome Trust specific project and to the unidentified project.

Intersect Crossref with UnpayWall by DOI

The fields we consider from UnpayWall are:

  • is_oa
  • best_oa_location
  • oa_status

The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional instance with the following properties:

OpenAIRE Research Product field path Unpaywall field path Notes
instance created only if is_oa and a best_oa_location is available
instance.accessright default value Open Access: we do not add instances if UnpayWall says there is no open version
instance.accessright.code Open Access code from the COAR vocabulary for access right
instance.accessright.label Always OPEN
instance.accessright.scheme Scheme that defines the code and label, i.e. the URL to the COAR vocabulary for access right
instance.accessright.openAccessRoute oa_status
instance.url best_oa_location
instance.license best_oa_location.license
instance.pid The scheme tells the type of PID, the value contains the actual value
instance.pid.scheme Default value: doi
instance.pid.value doi The doi is normalised and lower-cased

For the definition of UnpayWall's oa_status refer to the Unpaywall FAQ

The record will also feature a relation to the UnpayWall data source: name="UnpayWall", id=openaire____::8ac8380272269217cb09a928c8caa993.