updated changelog, doiboost dismission

This commit is contained in:
Claudio Atzori 2024-07-15 15:04:16 +02:00
parent f06b3324f9
commit b00c45af1c
5 changed files with 209 additions and 1 deletions

View File

@ -19,6 +19,23 @@ This section documents all notable changes for each graph version.
---
### v8.0.0
_Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: **no**_
#### Added
- Introduced new Field of Science classifications for publications, reaching a total of ~77.2Mi publications classified
- General increase of the affiliations +20% (from 162Mi to 195Mi)
- General increase of the scientific products with ORCID identified authors +10% (from 3.09Mi to 3.39Mi)
#### Changed
- Revised deduplication configuration to better exploit resource types
- The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a +15% increase (from 127Mi to 146Mi records). Included contents until April 2023
- Updated ORCID contents until April 2024
- Updated Datacite contents until April 2024
### v7.1.3
_Start Date: 2024-04-10 • Release Date: 2024-04-22 • Dataset release: **no**_

View File

@ -0,0 +1,165 @@
# Crossref & Unpaywall
This section describes the procedure used to integrate the contents from [Crossref](https://www.crossref.org) and [Unpaywall](https://unpaywall.org) in the OpenAIRE Graph.
## Data acquisition
The dataset containing all the Crossref records is obtained via a complete data dump on a monthly basis.
The Unpaywall dataset is no longer updated anymore but its latest snapshot (Dec 2021) is used to enrich the Crossref contents.
## Process
In the following we describe the process applied to the Crossref & the Unpaywall contents.
### Crossref filtering
Records in Crossref are ruled out according to the following criteria
* have blank title, examples:
* `10.1093/rheumatology/41.7.837`
* `10.1093/qjmed/95.7.430`
* `10.1371/journal.pone.0171434.g005`
* have one of the following publishers: `"Test accounts"`, `"CrossRef Test Account"`
* Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
* `10.1007/bf00344543`
* `10.1007/bf00186154`
* `10.1306/64ed947a-1724-11d7-8645000102c1865d`
* have authors matching the following invalid names: `",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na"`
* Examples for `"none"` author from https://api.crossref.org/works?query.author=%22none%22
* `10.4007/annals.2016.184.3.11`
* `10.4007/annals.2012.176.1.6`
* `10.2172/6393585`
* Examples for `"test"` author from https://api.crossref.org/works?query.author=%22test%22
* `10.5116/ijme.54ca.a5ae`
* `10.5755/j01.ss.71.2.544`
* `10.5755/j01.ee.22.2.319`
* have `"Addie Jackson"` as author and `"Elsevier BV"` as publisher (empirically we say they are test records)
* Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
* `10.2139/ssrn.2082156`
* `10.2139/ssrn.2202300`
* `10.2139/ssrn.2255657`
* have not one of the following values in the field `type` : `"book-section"`, `"book"`, `"book-chapter"`, `"book-part"`, `"book-series"`, `"book-set"`, `"book-track"`, `"edited-book"`, `"reference-book"`, `"monograph"`, `"journal-article"`, `"dissertation"`, `"other"`, `"peer-review"`, `"proceedings"`, `"proceedings-article"`, `"reference-entry"`, `"report"`, `"report-series"`, `"standard"`, `"standard-series"`, `"posted-content"`, `"dataset"`,
* Example:
* `10.1371/journal.pone.0171434.g005`
* `10.7554/elife.21052.049`
* `10.1371/journal.pcbi.1005379.s006`
Records with `type=dataset` are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication.
### Mapping Crossref properties into the OpenAIRE Graph
Properties in OpenAIRE research products are set based on the logic described in the following table:
| OpenAIRE Research Product field path | Crossref path(s) | Notes |
|----------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `id` | `doi` | id in the form `doi_________::md5(doi)` |
| `dateofcollection` | `indexed.datetime` | |
| `lastupdatetimestamp` | `indexed.timestamp` | |
| `type` | `type` | `dataset` if the Crossref type is dataset, `publication` otherwise (based on the filtering logics described above) |
| `originalId` | `doi, clinical-trial-number, alternative-id` | |
| `pid` | | The scheme tells the type of PID, the value contains the actual value |
| `pid.scheme` | | Default value: doi |
| `pid.value` | `doi` | The doi is normalised and lower-cased |
| `maintitle` | `title` | |
| `subtitle` | `subtitle` | |
| `author` | `author` | if available the sequence is mapped to rank and the ORCID is also mapped |
| `author.name` | `author.given` | |
| `author.surname` | `author.family` | |
| `author.fullname` | `author.given author.family` | |
| `author.rank` | | based on the order, starts from 1 |
| `author.pid` | | only if the ORCID is available |
| `author.pid.id.scheme` | | Default `'pending_orcid'` (meaning that it is not an id confirmed by ORCID) |
| `author.pid.id.value` | `author.ORCID` | |
| `author.pid.provenance.provenance` | | Default 'Harvested' |
| `author.pid.provenance.trust` | | Default '0.9' |
| `description` | `abstract` | |
| `subject` | `subject` | with `classid='keywords'`, i.e. no controlled vocabularies for Crossref subjects |
| `publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `publisher` | `publisher` | |
| `source` | `source` | only if the record is not of type `book` |
| `source` | concatenation of `container-title.head` + `"ISBN: "` + `ISBN.head` | only if the record is of type `book` |
| `container` | | It is set only for publications with information about the journal it was published in. |
| `container.name` | `container-title.head` | |
| `container.issnOnline` | `issn-type.value` | if `issn-type.type='electronic'` |
| `container.issnPrinted` | `issn-type.value` | if `issn-type.type='print'` |
| `container.vol` | `volume` | |
| `container.sp` | `page` | before `'-'` |
| `container.ep` | `page` | after `'-'` |
| `instance` | | One instance is created with the DOI URL |
| `instance.accessright` | | Values in `instance.accessright.code` and `instance.accessright.label` are set based on license and dateofacceptance:<br/>- `UNKNOWN`: if the license is blank<br/>- `OPEN ACCESS`: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license-) for details) or if OUP license, but only after 12 months from the publication date<br/>- `EMBARGO`: OUP license, before 12 months from the publication date<br/>- `CLOSED`: if there is a license not covered by the previous cases |
| `instance.accessright.code` | | Code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | One of: `OPEN`, `RESTRICTED`, `CLOSED`, `EMBARGO` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | | only if `instance.accessright.value = 'OPEN ACCESS'`. Default is `hybrid`. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. |
| `instance.license` | `license.URL ` | If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used. |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
| `instance.publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
| `instance.refereed` | | set to `peerReviewed` only if `relation.has-review.id` is not empty, `UNKNOWN` otherwise. |
| `instance.type` | `subtype` | mapped using the [OpenAIRE vocabulary for research products typologies](https://api.openaire.eu/vocabularies/dnet:result_typologies) |
| `instance.url` | `doi` | Full URL of the DOI |
All other fields of the Json schema not mentioned in the table contain empty values.
All the records from Crossref are related to the datasource with `name=Crossref` and `id=openaire____::081b82f96300b6a6e3d282bad31cb6e2`
Possible improvements:
* map `clinical-trial-number` and `alternative-id` in `alternateIdentifiers`?
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
* Different approach to set the `refereed` field and improve its coverage?
### Map Crossref links to projects/funders
Links to funding available in Crossref are mapped as funding relationships (`ResearchProduct -- isProducedBy -- Project`) applying the following mapping:
| Funder | Grant code | Link to |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
| DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Unions Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project |
| DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project |
| DOI: `10.13039/501100000781` OR name: `'European Union's'` | series of `4-9` digits in `award` | Link to FP7 or H2020 project |
| DOI: `10.13039/100000001` | `award` | Link to NSF project |
| DOI: `10.13039/501100001665` OR name: `{'The French National Research Agency (ANR)', 'The French National Research Agency'}` | `award` | Link to ANR project |
| DOI: `10.13039/501100002341` | `award` | Link to Academy of Finland project |
| DOI: `10.13039/501100001602` | `award`, removing the initial 'SFI' if present | Link to SFI project |
| DOI: `10.13039/501100000923` | `award` | Link to ARC project |
| DOI: `10.13039/501100000038` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (`unidentified` project) |
| DOI: `10.13039/501100000155` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (`unidentified` project) |
| DOI: `10.13039/501100000024` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (`unidentified` project) |
| DOI: `10.13039/501100002848` OR name :`'CONICYT, Programa de Formación de Capital Humano Avanzado'` | `award` | Link to CONICYT project |
| DOI: `10.13039/501100003448` | series of `4-9` digits in award | Link to GSRT project |
| DOI: `10.13039/501100010198` | `award` | Link to SGOV project |
| DOI: `10.13039/501100004564` | series of `4-9` digits in award | Link to MESTD project |
| DOI: `10.13039/501100003407` | `award` | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified`<br/> project) is also generated |
| DOI: `{10.13039/501100006588, 10.13039/501100004488}` | `award`, removing `'Project No'` and `'HRZZ'` prefix, if present | Link to HRZZ or MZOS project |
| DOI: `10.13039/501100006769` | `award` | Link to Russian Science Foundation project |
| DOI: `10.13039/501100001711` | `award` after `'_'` and before `'/'` | Link to SNSF project |
| DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project |
| DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. |
### Intersect Crossref with UnpayWall by DOI
The fields we consider from UnpayWall are:
* `is_oa`
* `best_oa_location`
* `oa_status`
The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties:
| OpenAIRE Research Product field path | Unpaywall field path | Notes |
|----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `instance` | | created only if `is_oa` and a `best_oa_location` is available |
| `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version |
| `instance.accessright.code` | | Open Access code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.label` | | Always `OPEN` |
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
| `instance.accessright.openAccessRoute` | `oa_status` | |
| `instance.url` | `best_oa_location` | |
| `instance.license` | `best_oa_location.license` | |
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
| `instance.pid.scheme` | | Default value: `doi` |
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-)
The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.

View File

@ -0,0 +1,14 @@
# Microsoft Academic Graph
## Data acquisition
## Process
When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:
* `PaperAbstractsInvertedIndex`: used to map the paper abstracts
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId and then used to map the MAG author identifier.
* `Affiliations` and `PaperAuthorAffiliations`: used to generate links between publications and organisations (paper affiliations)
* `Journals` and `ConferenceInstances`: joined with `Papers_distinct` is used to map the information about the venues where the paper was published
* TO BE REMOVED `PaperUrls`: to create one instance for the OpenAIRE publication
* TO BE REMOVED `FieldsOfStudy`: to add subjects

View File

@ -0,0 +1,10 @@
# Open Researcher and Contributor ID (ORCID)
## Data acquisition
## Process
In the following we describe the process applied to the ORCID contents.
### ...

View File

@ -138,7 +138,9 @@ const sidebars = {
label: "Non-compatible sources",
link: { type: 'generated-index' },
items: [
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/doiboost', label: 'DOIBoost' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/crossref_unpaywall', label: 'Crossref & Unpaywall' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/mag', label: 'Microsoft Academic Graph' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/orcid', label: 'ORCID' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/pubmed' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/datacite' },
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },