# Crossref & Unpaywall This section describes the procedure used to integrate the contents from [Crossref](https://www.crossref.org) and [Unpaywall](https://unpaywall.org) in the OpenAIRE Graph. ## Data acquisition The dataset containing all the Crossref records is obtained via a complete data dump on a monthly basis. The Unpaywall dataset is no longer updated anymore but its latest snapshot (Dec 2021) is used to enrich the Crossref contents. ## Process In the following we describe the process applied to the Crossref & the Unpaywall contents. ### Crossref filtering Records in Crossref are ruled out according to the following criteria * have blank title, examples: * `10.1093/rheumatology/41.7.837` * `10.1093/qjmed/95.7.430` * `10.1371/journal.pone.0171434.g005` * have one of the following publishers: `"Test accounts"`, `"CrossRef Test Account"` * Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22 * `10.1007/bf00344543` * `10.1007/bf00186154` * `10.1306/64ed947a-1724-11d7-8645000102c1865d` * have authors matching the following invalid names: `",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na"` * Examples for `"none"` author from https://api.crossref.org/works?query.author=%22none%22 * `10.4007/annals.2016.184.3.11` * `10.4007/annals.2012.176.1.6` * `10.2172/6393585` * Examples for `"test"` author from https://api.crossref.org/works?query.author=%22test%22 * `10.5116/ijme.54ca.a5ae` * `10.5755/j01.ss.71.2.544` * `10.5755/j01.ee.22.2.319` * have `"Addie Jackson"` as author and `"Elsevier BV"` as publisher (empirically we say they are test records) * Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22 * `10.2139/ssrn.2082156` * `10.2139/ssrn.2202300` * `10.2139/ssrn.2255657` * have not one of the following values in the field `type` : `"book-section"`, `"book"`, `"book-chapter"`, `"book-part"`, `"book-series"`, `"book-set"`, `"book-track"`, `"edited-book"`, `"reference-book"`, `"monograph"`, `"journal-article"`, `"dissertation"`, `"other"`, `"peer-review"`, `"proceedings"`, `"proceedings-article"`, `"reference-entry"`, `"report"`, `"report-series"`, `"standard"`, `"standard-series"`, `"posted-content"`, `"dataset"`, * Example: * `10.1371/journal.pone.0171434.g005` * `10.7554/elife.21052.049` * `10.1371/journal.pcbi.1005379.s006` Records with `type=dataset` are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication. ### Mapping Crossref properties into the OpenAIRE Graph Properties in OpenAIRE research products are set based on the logic described in the following table: | OpenAIRE Research Product field path | Crossref path(s) | Notes | |----------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `id` | `doi` | id in the form `doi_________::md5(doi)` | | `dateofcollection` | `indexed.datetime` | | | `lastupdatetimestamp` | `indexed.timestamp` | | | `type` | `type` | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: | | `originalId` | `doi, clinical-trial-number, alternative-id` | | | `pid` | | The scheme tells the type of PID, the value contains the actual value | | `pid.scheme` | | Default value: doi | | `pid.value` | `doi` | The doi is normalised and lower-cased | | `maintitle` | `title` | | | `subtitle` | `subtitle` | | | `author` | `author` | if available the sequence is mapped to rank and the ORCID is also mapped | | `author.name` | `author.given` | | | `author.surname` | `author.family` | | | `author.fullname` | `author.given author.family` | | | `author.rank` | | based on the order, starts from 1 | | `author.pid` | | only if the ORCID is available | | `author.pid.id.scheme` | | Default `'pending_orcid'` (meaning that it is not an id confirmed by ORCID) | | `author.pid.id.value` | `author.ORCID` | | | `author.pid.provenance.provenance` | | Default 'Harvested' | | `author.pid.provenance.trust` | | Default '0.9' | | `description` | `abstract` | | | `subject` | `subject` | with `classid='keywords'`, i.e. no controlled vocabularies for Crossref subjects | | `publicationdate` | `issued.datetime` or, if not available, `created.datetime` | | | `publisher` | `publisher` | | | `source` | `source` | only if the record is not of type `book` | | `source` | concatenation of `container-title.head` + `"ISBN: "` + `ISBN.head` | only if the record is of type `book` | | `container` | | It is set only for publications with information about the journal it was published in. | | `container.name` | `container-title.head` | | | `container.issnOnline` | `issn-type.value` | if `issn-type.type='electronic'` | | `container.issnPrinted` | `issn-type.value` | if `issn-type.type='print'` | | `container.vol` | `volume` | | | `container.sp` | `page` | before `'-'` | | `container.ep` | `page` | after `'-'` | | `instance` | | One instance is created with the DOI URL | | `instance.accessright` | | Values in `instance.accessright.code` and `instance.accessright.label` are set based on license and dateofacceptance:
- `UNKNOWN`: if the license is blank
- `OPEN ACCESS`: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license-) for details) or if OUP license, but only after 12 months from the publication date
- `EMBARGO`: OUP license, before 12 months from the publication date
- `CLOSED`: if there is a license not covered by the previous cases | | `instance.accessright.code` | | Code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) | | `instance.accessright.label` | | One of: `OPEN`, `RESTRICTED`, `CLOSED`, `EMBARGO` | | `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) | | `instance.accessright.openAccessRoute` | | only if `instance.accessright.value = 'OPEN ACCESS'`. Default is `hybrid`. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. | | `instance.license` | `license.URL ` | If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used. | | `instance.pid` | | The scheme tells the type of PID, the value contains the actual value | | `instance.pid.scheme` | | Default value: `doi` | | `instance.pid.value` | `doi` | The doi is normalised and lower-cased | | `instance.publicationdate` | `issued.datetime` or, if not available, `created.datetime` | | | `instance.refereed` | | set to `peerReviewed` only if `relation.has-review.id` is not empty, `UNKNOWN` otherwise. | | `instance.type` | `subtype` | mapped using the [OpenAIRE vocabulary for research products typologies](https://api.openaire.eu/vocabularies/dnet:result_typologies) | | `instance.url` | `doi` | Full URL of the DOI | All other fields of the Json schema not mentioned in the table contain empty values. All the records from Crossref are related to the datasource with `name=Crossref` and `id=openaire____::081b82f96300b6a6e3d282bad31cb6e2` Possible improvements: * map `clinical-trial-number` and `alternative-id` in `alternateIdentifiers`? * Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate` * Different approach to set the `refereed` field and improve its coverage? ### Map Crossref links to projects/funders Links to funding available in Crossref are mapped as funding relationships (`ResearchProduct -- isProducedBy -- Project`) applying the following mapping: | Funder | Grant code | Link to | |----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------| | DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Union’s Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project | | DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project | | DOI: `10.13039/501100000781` OR name: `'European Union's'` | series of `4-9` digits in `award` | Link to FP7 or H2020 project | | DOI: `10.13039/100000001` | `award` | Link to NSF project | | DOI: `10.13039/501100001665` OR name: `{'The French National Research Agency (ANR)', 'The French National Research Agency'}` | `award` | Link to ANR project | | DOI: `10.13039/501100002341` | `award` | Link to Academy of Finland project | | DOI: `10.13039/501100001602` | `award`, removing the initial 'SFI' if present | Link to SFI project | | DOI: `10.13039/501100000923` | `award` | Link to ARC project | | DOI: `10.13039/501100000038` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (`unidentified` project) | | DOI: `10.13039/501100000155` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (`unidentified` project) | | DOI: `10.13039/501100000024` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (`unidentified` project) | | DOI: `10.13039/501100002848` OR name :`'CONICYT, Programa de Formación de Capital Humano Avanzado'` | `award` | Link to CONICYT project | | DOI: `10.13039/501100003448` | series of `4-9` digits in award | Link to GSRT project | | DOI: `10.13039/501100010198` | `award` | Link to SGOV project | | DOI: `10.13039/501100004564` | series of `4-9` digits in award | Link to MESTD project | | DOI: `10.13039/501100003407` | `award` | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified`
project) is also generated | | DOI: `{10.13039/501100006588, 10.13039/501100004488}` | `award`, removing `'Project No'` and `'HRZZ'` prefix, if present | Link to HRZZ or MZOS project | | DOI: `10.13039/501100006769` | `award` | Link to Russian Science Foundation project | | DOI: `10.13039/501100001711` | `award` after `'_'` and before `'/'` | Link to SNSF project | | DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project | | DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. | ### Intersect Crossref with UnpayWall by DOI The fields we consider from UnpayWall are: * `is_oa` * `best_oa_location` * `oa_status` The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties: | OpenAIRE Research Product field path | Unpaywall field path | Notes | |----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `instance` | | created only if `is_oa` and a `best_oa_location` is available | | `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version | | `instance.accessright.code` | | Open Access code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) | | `instance.accessright.label` | | Always `OPEN` | | `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) | | `instance.accessright.openAccessRoute` | `oa_status` | | | `instance.url` | `best_oa_location` | | | `instance.license` | `best_oa_location.license` | | | `instance.pid` | | The scheme tells the type of PID, the value contains the actual value | | `instance.pid.scheme` | | Default value: `doi` | | `instance.pid.value` | `doi` | The doi is normalised and lower-cased | For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-) The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.