updated changelog, doiboost dismission #79
|
@ -19,6 +19,23 @@ This section documents all notable changes for each graph version.
|
|||
|
||||
---
|
||||
|
||||
### v8.0.0
|
||||
_Start Date: 2024-05-15 • Release Date: 2024-06-20 • Dataset release: **no**_
|
||||
|
||||
#### Added
|
||||
|
||||
- Introduced new Field of Science classifications for publications, reaching a total of ~77.2Mi publications classified
|
||||
- General increase of the affiliations +20% (from 162Mi to 195Mi)
|
||||
- General increase of the scientific products with ORCID identified authors +10% (from 3.09Mi to 3.39Mi)
|
||||
|
||||
#### Changed
|
||||
|
||||
- Revised deduplication configuration to better exploit resource types
|
||||
- The DOIBoost dataset was superseded by the direct aggregation of its datasources: Crossref, Unpaywall, Microsoft Academic Graph, ORCID
|
||||
- Relaxed Crossref publication inclusion criteria, now accepting records without author information, leading to a +15% increase (from 127Mi to 146Mi records). Included contents until April 2023
|
||||
- Updated ORCID contents until April 2024
|
||||
- Updated Datacite contents until April 2024
|
||||
|
||||
### v7.1.3
|
||||
_Start Date: 2024-04-10 • Release Date: 2024-04-22 • Dataset release: **no**_
|
||||
|
||||
|
|
|
@ -0,0 +1,165 @@
|
|||
# Crossref & Unpaywall
|
||||
|
||||
This section describes the procedure used to integrate the contents from [Crossref](https://www.crossref.org) and [Unpaywall](https://unpaywall.org) in the OpenAIRE Graph.
|
||||
|
||||
## Data acquisition
|
||||
|
||||
The dataset containing all the Crossref records is obtained via a complete data dump on a monthly basis.
|
||||
The Unpaywall dataset is no longer updated anymore but its latest snapshot (Dec 2021) is used to enrich the Crossref contents.
|
||||
|
||||
## Process
|
||||
|
||||
In the following we describe the process applied to the Crossref & the Unpaywall contents.
|
||||
|
||||
### Crossref filtering
|
||||
|
||||
Records in Crossref are ruled out according to the following criteria
|
||||
|
||||
* have blank title, examples:
|
||||
* `10.1093/rheumatology/41.7.837`
|
||||
* `10.1093/qjmed/95.7.430`
|
||||
* `10.1371/journal.pone.0171434.g005`
|
||||
* have one of the following publishers: `"Test accounts"`, `"CrossRef Test Account"`
|
||||
* Examples from https://api.crossref.org/works?query.publisher-name=%22Test%20accounts%22
|
||||
* `10.1007/bf00344543`
|
||||
* `10.1007/bf00186154`
|
||||
* `10.1306/64ed947a-1724-11d7-8645000102c1865d`
|
||||
* have authors matching the following invalid names: `",", "none none", "none, none", "none &na;", "(:null)", "test test test", "test test", "test", "&na; &na"`
|
||||
* Examples for `"none"` author from https://api.crossref.org/works?query.author=%22none%22
|
||||
* `10.4007/annals.2016.184.3.11`
|
||||
* `10.4007/annals.2012.176.1.6`
|
||||
* `10.2172/6393585`
|
||||
* Examples for `"test"` author from https://api.crossref.org/works?query.author=%22test%22
|
||||
* `10.5116/ijme.54ca.a5ae`
|
||||
* `10.5755/j01.ss.71.2.544`
|
||||
* `10.5755/j01.ee.22.2.319`
|
||||
* have `"Addie Jackson"` as author and `"Elsevier BV"` as publisher (empirically we say they are test records)
|
||||
* Examples from https://api.crossref.org/works?query.author=Addie+Jackson&query.publisher-name=%22Elsevier%20BV%22
|
||||
* `10.2139/ssrn.2082156`
|
||||
* `10.2139/ssrn.2202300`
|
||||
* `10.2139/ssrn.2255657`
|
||||
* have not one of the following values in the field `type` : `"book-section"`, `"book"`, `"book-chapter"`, `"book-part"`, `"book-series"`, `"book-set"`, `"book-track"`, `"edited-book"`, `"reference-book"`, `"monograph"`, `"journal-article"`, `"dissertation"`, `"other"`, `"peer-review"`, `"proceedings"`, `"proceedings-article"`, `"reference-entry"`, `"report"`, `"report-series"`, `"standard"`, `"standard-series"`, `"posted-content"`, `"dataset"`,
|
||||
* Example:
|
||||
* `10.1371/journal.pone.0171434.g005`
|
||||
* `10.7554/elife.21052.049`
|
||||
* `10.1371/journal.pcbi.1005379.s006`
|
||||
|
||||
Records with `type=dataset` are mapped into OpenAIRE research products of type dataset. All others are mapped as OpenAIRE research products of type publication.
|
||||
|
||||
### Mapping Crossref properties into the OpenAIRE Graph
|
||||
|
||||
Properties in OpenAIRE research products are set based on the logic described in the following table:
|
||||
|
||||
| OpenAIRE Research Product field path | Crossref path(s) | Notes |
|
||||
|----------------------------------------|--------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `id` | `doi` | id in the form `doi_________::md5(doi)` |
|
||||
| `dateofcollection` | `indexed.datetime` | |
|
||||
| `lastupdatetimestamp` | `indexed.timestamp` | |
|
||||
| `type` | `type` | `dataset` if the Crossref type is dataset, `publication` otherwise (based on the filtering logics described above) |
|
||||
| `originalId` | `doi, clinical-trial-number, alternative-id` | |
|
||||
| `pid` | | The scheme tells the type of PID, the value contains the actual value |
|
||||
| `pid.scheme` | | Default value: doi |
|
||||
| `pid.value` | `doi` | The doi is normalised and lower-cased |
|
||||
| `maintitle` | `title` | |
|
||||
| `subtitle` | `subtitle` | |
|
||||
| `author` | `author` | if available the sequence is mapped to rank and the ORCID is also mapped |
|
||||
| `author.name` | `author.given` | |
|
||||
| `author.surname` | `author.family` | |
|
||||
| `author.fullname` | `author.given author.family` | |
|
||||
| `author.rank` | | based on the order, starts from 1 |
|
||||
| `author.pid` | | only if the ORCID is available |
|
||||
| `author.pid.id.scheme` | | Default `'pending_orcid'` (meaning that it is not an id confirmed by ORCID) |
|
||||
| `author.pid.id.value` | `author.ORCID` | |
|
||||
| `author.pid.provenance.provenance` | | Default 'Harvested' |
|
||||
| `author.pid.provenance.trust` | | Default '0.9' |
|
||||
| `description` | `abstract` | |
|
||||
| `subject` | `subject` | with `classid='keywords'`, i.e. no controlled vocabularies for Crossref subjects |
|
||||
| `publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
|
||||
| `publisher` | `publisher` | |
|
||||
| `source` | `source` | only if the record is not of type `book` |
|
||||
| `source` | concatenation of `container-title.head` + `"ISBN: "` + `ISBN.head` | only if the record is of type `book` |
|
||||
| `container` | | It is set only for publications with information about the journal it was published in. |
|
||||
| `container.name` | `container-title.head` | |
|
||||
| `container.issnOnline` | `issn-type.value` | if `issn-type.type='electronic'` |
|
||||
| `container.issnPrinted` | `issn-type.value` | if `issn-type.type='print'` |
|
||||
| `container.vol` | `volume` | |
|
||||
| `container.sp` | `page` | before `'-'` |
|
||||
| `container.ep` | `page` | after `'-'` |
|
||||
| `instance` | | One instance is created with the DOI URL |
|
||||
| `instance.accessright` | | Values in `instance.accessright.code` and `instance.accessright.label` are set based on license and dateofacceptance:<br/>- `UNKNOWN`: if the license is blank<br/>- `OPEN ACCESS`: if the license is a CC license or an ACS license or an APA license (considered OPEN also by Unpaywall, see [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44002063718-what-is-an-oa-license-) for details) or if OUP license, but only after 12 months from the publication date<br/>- `EMBARGO`: OUP license, before 12 months from the publication date<br/>- `CLOSED`: if there is a license not covered by the previous cases |
|
||||
| `instance.accessright.code` | | Code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
|
||||
| `instance.accessright.label` | | One of: `OPEN`, `RESTRICTED`, `CLOSED`, `EMBARGO` |
|
||||
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
|
||||
| `instance.accessright.openAccessRoute` | | only if `instance.accessright.value = 'OPEN ACCESS'`. Default is `hybrid`. The route is fixed in subsequent phases of DOIBoost, namely when intersecting with Unpaywall and patching the hostedby via DOAJ and the Gold-ISSN list. |
|
||||
| `instance.license` | `license.URL ` | If there is a `license.content-version='vor'`, then this is used. Otherwise the first license entry is used. |
|
||||
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
|
||||
| `instance.pid.scheme` | | Default value: `doi` |
|
||||
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
|
||||
| `instance.publicationdate` | `issued.datetime` or, if not available, `created.datetime` | |
|
||||
| `instance.refereed` | | set to `peerReviewed` only if `relation.has-review.id` is not empty, `UNKNOWN` otherwise. |
|
||||
| `instance.type` | `subtype` | mapped using the [OpenAIRE vocabulary for research products typologies](https://api.openaire.eu/vocabularies/dnet:result_typologies) |
|
||||
| `instance.url` | `doi` | Full URL of the DOI |
|
||||
|
||||
All other fields of the Json schema not mentioned in the table contain empty values.
|
||||
|
||||
All the records from Crossref are related to the datasource with `name=Crossref` and `id=openaire____::081b82f96300b6a6e3d282bad31cb6e2`
|
||||
|
||||
Possible improvements:
|
||||
* map `clinical-trial-number` and `alternative-id` in `alternateIdentifiers`?
|
||||
* Verify if Crossref has a property for `language`, `country`, `container.issnLinking`, `container.iss`, `container.edition`, `container.conferenceplace` and `container.conferencedate`
|
||||
* Different approach to set the `refereed` field and improve its coverage?
|
||||
|
||||
### Map Crossref links to projects/funders
|
||||
|
||||
Links to funding available in Crossref are mapped as funding relationships (`ResearchProduct -- isProducedBy -- Project`) applying the following mapping:
|
||||
|
||||
| Funder | Grant code | Link to |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Union’s Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project |
|
||||
| DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project |
|
||||
| DOI: `10.13039/501100000781` OR name: `'European Union's'` | series of `4-9` digits in `award` | Link to FP7 or H2020 project |
|
||||
| DOI: `10.13039/100000001` | `award` | Link to NSF project |
|
||||
| DOI: `10.13039/501100001665` OR name: `{'The French National Research Agency (ANR)', 'The French National Research Agency'}` | `award` | Link to ANR project |
|
||||
| DOI: `10.13039/501100002341` | `award` | Link to Academy of Finland project |
|
||||
| DOI: `10.13039/501100001602` | `award`, removing the initial 'SFI' if present | Link to SFI project |
|
||||
| DOI: `10.13039/501100000923` | `award` | Link to ARC project |
|
||||
| DOI: `10.13039/501100000038` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to NSERC (`unidentified` project) |
|
||||
| DOI: `10.13039/501100000155` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to SSHRC (`unidentified` project) |
|
||||
| DOI: `10.13039/501100000024` | `award` ignore: we cannot map the project codes in Crossref to project codes in OpenAIRE | Link to CIHR (`unidentified` project) |
|
||||
| DOI: `10.13039/501100002848` OR name :`'CONICYT, Programa de Formación de Capital Humano Avanzado'` | `award` | Link to CONICYT project |
|
||||
| DOI: `10.13039/501100003448` | series of `4-9` digits in award | Link to GSRT project |
|
||||
| DOI: `10.13039/501100010198` | `award` | Link to SGOV project |
|
||||
| DOI: `10.13039/501100004564` | series of `4-9` digits in award | Link to MESTD project |
|
||||
| DOI: `10.13039/501100003407` | `award` | Link to MIUR project. Since OpenAIRE has a small subset of MIUR projects, a link to the MIUR funder (`unidentified`<br/> project) is also generated |
|
||||
| DOI: `{10.13039/501100006588, 10.13039/501100004488}` | `award`, removing `'Project No'` and `'HRZZ'` prefix, if present | Link to HRZZ or MZOS project |
|
||||
| DOI: `10.13039/501100006769` | `award` | Link to Russian Science Foundation project |
|
||||
| DOI: `10.13039/501100001711` | `award` after `'_'` and before `'/'` | Link to SNSF project |
|
||||
| DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project |
|
||||
| DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. |
|
||||
|
||||
### Intersect Crossref with UnpayWall by DOI
|
||||
|
||||
The fields we consider from UnpayWall are:
|
||||
* `is_oa`
|
||||
* `best_oa_location`
|
||||
* `oa_status`
|
||||
|
||||
The records of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties:
|
||||
|
||||
| OpenAIRE Research Product field path | Unpaywall field path | Notes |
|
||||
|----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `instance` | | created only if `is_oa` and a `best_oa_location` is available |
|
||||
| `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version |
|
||||
| `instance.accessright.code` | | Open Access code from the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
|
||||
| `instance.accessright.label` | | Always `OPEN` |
|
||||
| `instance.accessright.scheme` | | Scheme that defines the code and label, i.e. the URL to the [COAR vocabulary for access right](http://vocabularies.coar-repositories.org/documentation/access_rights/) |
|
||||
| `instance.accessright.openAccessRoute` | `oa_status` | |
|
||||
| `instance.url` | `best_oa_location` | |
|
||||
| `instance.license` | `best_oa_location.license` | |
|
||||
| `instance.pid` | | The scheme tells the type of PID, the value contains the actual value |
|
||||
| `instance.pid.scheme` | | Default value: `doi` |
|
||||
| `instance.pid.value` | `doi` | The doi is normalised and lower-cased |
|
||||
|
||||
For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https://support.unpaywall.org/support/solutions/articles/44001777288-what-do-the-types-of-oa-status-green-gold-hybrid-and-bronze-mean-)
|
||||
|
||||
The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.
|
|
@ -0,0 +1,14 @@
|
|||
# Microsoft Academic Graph
|
||||
|
||||
## Data acquisition
|
||||
|
||||
## Process
|
||||
|
||||
When mapping MAG records to the OpenAIRE Graph, we consider the following MAG tables:
|
||||
* `PaperAbstractsInvertedIndex`: used to map the paper abstracts
|
||||
* `Authors`: for the authors. The MAG data is pre-processed by grouping authors by PaperId and then used to map the MAG author identifier.
|
||||
* `Affiliations` and `PaperAuthorAffiliations`: used to generate links between publications and organisations (paper affiliations)
|
||||
* `Journals` and `ConferenceInstances`: joined with `Papers_distinct` is used to map the information about the venues where the paper was published
|
||||
* TO BE REMOVED `PaperUrls`: to create one instance for the OpenAIRE publication
|
||||
* TO BE REMOVED `FieldsOfStudy`: to add subjects
|
||||
|
|
@ -0,0 +1,10 @@
|
|||
# Open Researcher and Contributor ID (ORCID)
|
||||
|
||||
## Data acquisition
|
||||
|
||||
## Process
|
||||
|
||||
In the following we describe the process applied to the ORCID contents.
|
||||
|
||||
### ...
|
||||
|
|
@ -138,7 +138,9 @@ const sidebars = {
|
|||
label: "Non-compatible sources",
|
||||
link: { type: 'generated-index' },
|
||||
items: [
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/doiboost', label: 'DOIBoost' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/crossref_unpaywall', label: 'Crossref & Unpaywall' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/mag', label: 'Microsoft Academic Graph' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/orcid', label: 'ORCID' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/pubmed' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/datacite' },
|
||||
{ type: 'doc', id: 'graph-production-workflow/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },
|
||||
|
|
Loading…
Reference in New Issue