forked from D-Net/openaire-graph-docs
addressing comments from the code review
This commit is contained in:
parent
e9296f1a40
commit
12263fca62
|
@ -13,11 +13,11 @@ Such a policy defines a list of data sources that are considered authoritative f
|
|||
* OpenAIRE IDs depend on persistent IDs when they are provided by the authority responsible to create them;
|
||||
* PIDs are included in the graph according to a tight criterion: the PID Types declared in the table below are considered to be mapped as PIDs only when they are collected from the relative PID authority data source.
|
||||
|
||||
| *PID Type* | *Authority* |
|
||||
|------------|-----------------------------------------------------------------------------------------------------|
|
||||
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
|
||||
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
|
||||
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
|
||||
| PID Type | Authority |
|
||||
|-----------|-----------------------------------------------------------------------------------------------------|
|
||||
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
|
||||
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
|
||||
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
|
||||
|
||||
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
|
||||
In all other cases, PIDs are be included in the graph as alternate Identifiers.
|
||||
|
@ -31,10 +31,10 @@ assigns PIDs to their scientific products from a given PID minter.
|
|||
|
||||
This "selection" can be performed when the entities in the graph sharing the same identifier are grouped together. The list of the delegated authorities currently includes
|
||||
|
||||
| *Datasource delegated* | *Datasource delegating* | *Pid Type* |
|
||||
|--------------------------------------|----------------------------------|------------|
|
||||
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
|
||||
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
|
||||
| Datasource delegated | Datasource delegating | Pid Type |
|
||||
|--------------------------------------|----------------------------------|-----------|
|
||||
| [Zenodo](https://zenodo.org) | [Datacite](https://datacite.org) | doi |
|
||||
| [RoHub](https://reliance.rohub.org/) | [W3ID](https://w3id.org/) | w3id |
|
||||
|
||||
|
||||
## Identifiers in the Graph
|
||||
|
|
|
@ -10,14 +10,14 @@ OpenAIRE materializes an open, participatory research graph (the OpenAIRE Resear
|
|||
|
||||
OpenAIRE aggregates metadata records describing objects of the research life-cycle from content providers compliant to the [OpenAIRE guidelines](https://guidelines.openaire.eu/) and from entity registries (i.e. data sources offering authoritative lists of entities, like [OpenDOAR](https://v2.sherpa.ac.uk/opendoar/), [re3data](https://www.re3data.org/), [DOAJ](https://doaj.org/), and various funder databases). After collection, metadata are transformed according to the OpenAIRE internal metadata model, which is used to generate the final OpenAIRE Research Graph, accessible from the [OpenAIRE EXPLORE portal](https://explore.openaire.eu) and the [APIs](https://graph.openaire.eu/develop/).
|
||||
|
||||
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at http://api.openaire.eu/vocabularies. Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
|
||||
The transformation process includes the application of cleaning functions whose goal is to ensure that values are harmonised according to a common format (e.g. dates as YYYY-MM-dd) and, whenever applicable, to a common controlled vocabulary. The controlled vocabularies used for cleansing are accessible at [api.openaire.eu/vocabularies](https://api.openaire.eu/vocabularies/). Each vocabulary features a set of controlled terms, each with one code, one label, and a set of synonyms. If a synonym is found as field value, the value is updated with the corresponding term.
|
||||
Also, the OpenAIRE Research Graph is extended with other relevant scholarly communication sources that do not follow the OpenAIRE Guidelines and/or are too large to be integrated via the “normal” aggregation mechanism: DOIBoost (which merges Crossref, ORCID, Microsoft Academic Graph, and Unpaywall).
|
||||
|
||||
<p align="center">
|
||||
<img loading="lazy" alt="Aggregation" src="/img/docs/aggregation.png" width="65%" className="img_node_modules-@docusaurus-theme-classic-lib-theme-MDXComponents-Img-styles-module"/>
|
||||
</p>
|
||||
|
||||
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-aquisition-policy1) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
|
||||
The OpenAIRE aggregation system collects information about objects of the research life-cycle compliant to the [OpenAIRE acquisition policy](https://www.openaire.eu/content-acquisition-policy) from [different types of data sources](https://explore.openaire.eu/search/find/dataproviders):
|
||||
|
||||
1. Scientific literature metadata and full-texts from institutional and thematic repositories, CRIS (Common Research Information Systems), Open Access journals and publishers;
|
||||
2. Dataset metadata from data repositories and data journals;
|
||||
|
|
|
@ -4,10 +4,6 @@ DOIBoost is a dataset that combines research outputs and links among them from a
|
|||
It enriches the records available on Crossref with what's available on Unpaywall, Microsoft Academic Graph, ORCID intersecting all those datasets by DOI.
|
||||
As consequence, DOIBoost does not contain any record from MAG, Unpaywall, or ORCID that doesn't provide a DOI available in Crossref.
|
||||
|
||||
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
|
||||
|
||||
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: [10.5281/zenodo.1441071](https://doi.org/10.5281/zenodo.1441071)
|
||||
|
||||
Each Crossref record is enriched with:
|
||||
* ORCID identifiers of authors from ORCID
|
||||
* Open Access instance (with OA color/route and license) from Unpaywall
|
||||
|
@ -29,7 +25,11 @@ The Open Access status is also set by intersecting the journal information of a
|
|||
|
||||
The construction of the DOIBoost dataset consists of the following phases:
|
||||
|
||||
## 1. Crossref filtering
|
||||
## Process
|
||||
|
||||
The following section describes the processing steps needed to build DOIBoost starting from the input data.
|
||||
|
||||
### Crossref filtering
|
||||
|
||||
Records in Crossref are ruled out according to the following criteria
|
||||
|
||||
|
@ -68,7 +68,7 @@ Records in Crossref are ruled out according to the following criteria
|
|||
|
||||
Records with `type=dataset` are mapped into OpenAIRE results of type dataset. All others are mapped as OpenAIRE results of type publication.
|
||||
|
||||
## 2. Mapping Crossref properties into the OpenAIRE Research Graph
|
||||
### Mapping Crossref properties into the OpenAIRE Research Graph
|
||||
|
||||
Properties in OpenAIRE results are set based on the logic described in the following table:
|
||||
|
||||
|
@ -133,9 +133,9 @@ Possible improvements:
|
|||
|
||||
h3. 2 Map Crossref links to projects/funders
|
||||
|
||||
Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy --> project`) applying the following mapping:
|
||||
Links to funding available in Crossref are mapped as funding relationships (`result -- isProducedBy -- project`) applying the following mapping:
|
||||
|
||||
| *funder* | *grant code* | *Link to* |
|
||||
| Funder | Grant code | Link to |
|
||||
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| DOI: `{10.13039/100010663, 10.13039/100010661, 10.13039/501100007601, 10.13039/501100000780, 10.13039/100010665}` or name: `'European Union’s Horizon 2020 research and innovation program'` | series of `4-9` digits in `award` | Link to H2020 project |
|
||||
| DOI: `{10.13039/100011199, 10.13039/100004431, 10.13039/501100004963, 10.13039/501100000780}` | series of `4-9` digits in `award` | Link to FP7 project |
|
||||
|
@ -159,7 +159,7 @@ Links to funding available in Crossref are mapped as funding relationships (`res
|
|||
| DOI: `10.13039/501100004410` | `award` | Link to TUBITAK project |
|
||||
| DOI: `10.10.13039/100004440` or name: `Wellcome Trust Masters Fellowship` | `award` | Link to Wellcome Trust specific project and to the `unidentified` project. |
|
||||
|
||||
## 3. Intersect Crossref with UnpayWall by DOI
|
||||
### Intersect Crossref with UnpayWall by DOI
|
||||
|
||||
The fields we consider from UnpayWall are:
|
||||
* `is_oa`
|
||||
|
@ -168,7 +168,7 @@ The fields we consider from UnpayWall are:
|
|||
|
||||
The results of Crossref that intersect by DOI with UnpayWall records are enriched with one additional `instance` with the following properties:
|
||||
|
||||
| *OpenAIRE Result field path* | *Unpaywall field path* | *Notes* |
|
||||
| OpenAIRE Result field path | Unpaywall field path | Notes |
|
||||
|----------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `instance` | | created only if `is_oa` and a `best_oa_location` is available |
|
||||
| `instance.accessright` | | default value `Open Access`: we do not add instances if UnpayWall says there is no open version |
|
||||
|
@ -186,23 +186,23 @@ For the definition of UnpayWall's `oa_status` refer to the [Unpaywall FAQ](https
|
|||
|
||||
The record will also feature a relation to the UnpayWall data source: `name="UnpayWall"`, `id=openaire____::8ac8380272269217cb09a928c8caa993`.
|
||||
|
||||
## 4. Intersect with ORCID
|
||||
### Intersect with ORCID
|
||||
|
||||
The fields we consider from ORCID are:
|
||||
* `doi`
|
||||
* `authors`, a list of authors, each with optional `name`, `surname`, `creditName`, `oid`
|
||||
|
||||
| *OpenAIRE field path* | *ORCID path* | *Notes* |
|
||||
|-------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `pid` | `doi` | |
|
||||
| `author.name` | `capitalize(name)` | only mapped if not blank |
|
||||
| `author.surname` | `capitalize(surname)` | only mapped if not blank |
|
||||
| `author.fullname` | | if name and surname are not blank, they are concatenated (`capitalize(name) capitalize(surname)`), otherwise we use the `creditName` |
|
||||
| `author.pid` | | only if the `ORCID` is available |
|
||||
| `author.pid.id.scheme` | | Default `orcid` (meaning that it is confirmed by ORCID, (in contrast to the `orcid_pending` set from Crossref and Unpaywall) |
|
||||
| `author.pid.id.value` | `oid` | |
|
||||
| `author.pid.provenance.provenance` | | Default `Harvested` |
|
||||
| `author.pid.provenance.trust` | | Default `0.9` |
|
||||
| OpenAIRE field path | ORCID path | Notes |
|
||||
|------------------------------------|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| `pid` | `doi` | |
|
||||
| `author.name` | `capitalize(name)` | only mapped if not blank |
|
||||
| `author.surname` | `capitalize(surname)` | only mapped if not blank |
|
||||
| `author.fullname` | | if name and surname are not blank, they are concatenated (`capitalize(name) capitalize(surname)`), otherwise we use the `creditName` |
|
||||
| `author.pid` | | only if the `ORCID` is available |
|
||||
| `author.pid.id.scheme` | | Default `orcid` (meaning that it is confirmed by ORCID, (in contrast to the `orcid_pending` set from Crossref and Unpaywall) |
|
||||
| `author.pid.id.value` | `oid` | |
|
||||
| `author.pid.provenance.provenance` | | Default `Harvested` |
|
||||
| `author.pid.provenance.trust` | | Default `0.9` |
|
||||
|
||||
The records are enriched with the ORCID identifiers of their authors.
|
||||
|
||||
|
@ -216,7 +216,7 @@ Miriam will modify the process to ensure that:
|
|||
* the list of authors from Crossred always "win"
|
||||
* the identifiers from ORCID "win"
|
||||
|
||||
## 5. Intersect with Microsoft Academic Graph
|
||||
### Intersect with Microsoft Academic Graph
|
||||
|
||||
*Important Notes*
|
||||
* Only papers with DOI are considered
|
||||
|
@ -238,10 +238,16 @@ The records are enriched with:
|
|||
* conference or journal information (in the `journal` field) TODO: or `container`, in case of the dump?
|
||||
* [TO BE REMOVED] instances with URL from MAG
|
||||
|
||||
## 6. Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information
|
||||
### Enrich DOIBoost3 with hosting data sources (`hostedby`) and access right information
|
||||
|
||||
In this phase, we intersect DOIBoost3 with a dataset composed of journals from OpenAIRE, Crossref, and the ISSN gold list. Each journal comes with its International Standard Serial Numbers (`issn`, `eissn`, `lissn`) and, when available, a flag that tells if the journal is open access. The intersection is done on the basis of the International Standard Serial Numbers. The records with a `journal.[l|e]issn` that match are enriched as follows:
|
||||
* Each instance gain the `hostedby` information corresponding to the journal
|
||||
* If the journal is open access, the access rights of the instances are also set to `Open Access` with `gold` route (because by construction, the journals we know are open are from DOAJ or Gold ISSN list)
|
||||
|
||||
The hostedby of records that do not match are set to the `Unknown Repository`.
|
||||
|
||||
## References
|
||||
|
||||
The idea behind DOIBoost and its origin can be found in the paper (and related resources) at:
|
||||
|
||||
* La Bruzzo S., Manghi P., Mannocci A. (2019) OpenAIRE's DOIBoost - Boosting CrossRef for Research. In: Manghi P., Candela L., Silvello G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, doi:10.1007/978-3-030-11226-4_11 . Open Access version available at: [10.5281/zenodo.1441071](https://doi.org/10.5281/zenodo.1441071)
|
||||
|
|
|
@ -65,339 +65,7 @@ curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks
|
|||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
},
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "MT121216",
|
||||
"IDScheme": "ENA",
|
||||
"IDURL": "http://identifiers.org/ebi/ena.embl:MT121216"
|
||||
},
|
||||
"Title": "MT121216",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
},
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "KF367457",
|
||||
"IDScheme": "ENA",
|
||||
"IDURL": "http://identifiers.org/ebi/ena.embl:KF367457"
|
||||
},
|
||||
"Title": "KF367457",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
},
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "MN996532",
|
||||
"IDScheme": "ENA",
|
||||
"IDURL": "http://identifiers.org/ebi/ena.embl:MN996532"
|
||||
},
|
||||
"Title": "MN996532",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
},
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "MT072864",
|
||||
"IDScheme": "ENA",
|
||||
"IDURL": "http://identifiers.org/ebi/ena.embl:MT072864"
|
||||
},
|
||||
"Title": "MT072864",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"Name": "Protein Structures",
|
||||
"NameLong": "Protein structures in PDBe",
|
||||
"CategoryLinkCount": 2,
|
||||
"Section": [
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"Tags": [
|
||||
"supporting_data"
|
||||
],
|
||||
"SectionLinkCount": 2,
|
||||
"Linklist": {
|
||||
"Link": [
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "6VW1",
|
||||
"IDScheme": "PDB",
|
||||
"IDURL": "http://identifiers.org/pdbe/pdb:6VW1"
|
||||
},
|
||||
"Title": "6VW1",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
},
|
||||
{
|
||||
"ObtainedBy": "tm_accession",
|
||||
"PublicationDate": "04-11-2022",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "References"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "MED"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "2AJF",
|
||||
"IDScheme": "PDB",
|
||||
"IDURL": "http://identifiers.org/pdbe/pdb:2AJF"
|
||||
},
|
||||
"Title": "2AJF",
|
||||
"Publisher": {
|
||||
"Name": "Europe PMC"
|
||||
}
|
||||
},
|
||||
"Frequency": 1
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"Name": "Altmetric",
|
||||
"CategoryLinkCount": 1,
|
||||
"Section": [
|
||||
{
|
||||
"ObtainedBy": "ext_links",
|
||||
"Tags": [
|
||||
"altmetrics"
|
||||
],
|
||||
"SectionLinkCount": 1,
|
||||
"Linklist": {
|
||||
"Link": [
|
||||
{
|
||||
"ObtainedBy": "ext_links",
|
||||
"PublicationDate": "15-10-2020",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "IsReferencedBy"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "PMID"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "https://www.altmetric.com/details/91880755",
|
||||
"IDScheme": "URL",
|
||||
"IDURL": "https://www.altmetric.com/details/91880755"
|
||||
},
|
||||
"Title": "Characteristics of SARS-CoV-2 and COVID-19",
|
||||
"Publisher": {
|
||||
"Name": "Altmetric"
|
||||
},
|
||||
"ImageURL": "https://api.altmetric.com/v1/donut/91880755_64.png"
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
},
|
||||
{
|
||||
"Name": "BioStudies: supplemental material and supporting data",
|
||||
"CategoryLinkCount": 1,
|
||||
"Section": [
|
||||
{
|
||||
"ObtainedBy": "ext_links",
|
||||
"Tags": [
|
||||
"supporting_data"
|
||||
],
|
||||
"SectionLinkCount": 1,
|
||||
"Linklist": {
|
||||
"Link": [
|
||||
{
|
||||
"ObtainedBy": "ext_links",
|
||||
"PublicationDate": "11-03-2021",
|
||||
"LinkProvider": {
|
||||
"Name": "Europe PMC"
|
||||
},
|
||||
"RelationshipType": {
|
||||
"Name": "IsReferencedBy"
|
||||
},
|
||||
"Source": {
|
||||
"Type": {
|
||||
"Name": "literature"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "33024307",
|
||||
"IDScheme": "PMID"
|
||||
}
|
||||
},
|
||||
"Target": {
|
||||
"Type": {
|
||||
"Name": "dataset"
|
||||
},
|
||||
"Identifier": {
|
||||
"ID": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true",
|
||||
"IDScheme": "URL",
|
||||
"IDURL": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true"
|
||||
},
|
||||
"Title": "Characteristics of SARS-CoV-2 and COVID-19.",
|
||||
"Publisher": {
|
||||
"Name": "BioStudies: supplemental material and supporting data"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
[...]
|
||||
```
|
||||
|
||||
## Mapping
|
||||
|
@ -406,21 +74,21 @@ We filter all the target links with pid type **ena**, **pdb** or **uniprot**
|
|||
For each target we construct a Bioentity with the following mapping
|
||||
|
||||
|
||||
| *OpenAIRE Result field path* | EBI record field xpath | Notes |
|
||||
|------------------------------|----------------------------------------------------------|---------------------------------------------------------------|
|
||||
| `id` | `target/identifier/ID` and `target/identifier/IDScheme` | id in the form `SCHEMA_________::md5(pid)` |
|
||||
| `pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
|
||||
| `publicationdate` | `target/PublicationDate` | clean and normalize the format of the date to be `YYYY-mm-dd` |
|
||||
| `maintitle` | `target/Title` | |
|
||||
| **Instance Mapping** | | |
|
||||
| `instance.type` | | `Bioentity` |
|
||||
| `type` | | `Dataset` |
|
||||
| `instance.pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
|
||||
| `instance.url` | `target/identifier/IDURL` | Copy the value as it is |
|
||||
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
||||
| OpenAIRE Result field path | EBI record field xpath | Notes |
|
||||
|-----------------------------|----------------------------------------------------------|---------------------------------------------------------------|
|
||||
| `id` | `target/identifier/ID` and `target/identifier/IDScheme` | id in the form `SCHEMA_________::md5(pid)` |
|
||||
| `pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
|
||||
| `publicationdate` | `target/PublicationDate` | clean and normalize the format of the date to be `YYYY-mm-dd` |
|
||||
| `maintitle` | `target/Title` | |
|
||||
| **Instance Mapping** | | |
|
||||
| `instance.type` | | `Bioentity` |
|
||||
| `type` | | `Dataset` |
|
||||
| `instance.pid` | `target/identifier/ID` and `target/identifier/IDScheme` | `classid = classname = schema` |
|
||||
| `instance.url` | `target/identifier/IDURL` | Copy the value as it is |
|
||||
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
||||
|
||||
|
||||
### Relation Mapping
|
||||
| OpenAIRE Relation Semantic and inverse | Source/Target type | #Notes |
|
||||
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|
||||
|----------------------------------------|---------------------|--------------------------------------------------------------------------|
|
||||
| `IsRelatedTo` | `result/result` | we create relationships between the BioEntity and the pubmed publication |
|
||||
|
|
|
@ -5,7 +5,7 @@ This section describes the mapping implemented for [MEDLINE/PubMed](https://pubm
|
|||
## Input
|
||||
|
||||
The native data is collected from the [ftp baseline](https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/) site.
|
||||
It contains XML records compliant with the schema available at https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html.
|
||||
It contains XML records compliant with the schema available at [www.nlm.nih.gov](https://www.nlm.nih.gov/bsd/licensee/elements_descriptions.html).
|
||||
|
||||
## Incremental harvesting
|
||||
Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item.
|
||||
|
@ -14,32 +14,31 @@ Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseli
|
|||
|
||||
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
|
||||
|
||||
|
||||
| *OpenAIRE Result field path* | PubMed record field xpath | Notes |
|
||||
|--------------------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **Publication Mapping** | | |
|
||||
| `id` | ?? | id in the form `pmid_________::md5(pmid)` |
|
||||
| `pid` | `//PMID` | `classid = classname = pmid` |
|
||||
| `publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
||||
| `maintitle` | `//Title` | |
|
||||
| `description` | `//AbstractText` | |
|
||||
| `language` | `//Language` | cleaning vocabulary -> dnet:languages |
|
||||
| `subjects` | `//DescriptorName` | classId, className = keyword |
|
||||
| **Author Mapping** | | |
|
||||
| `author.surname` | `//Author/LastName` | |
|
||||
| `author.name` | `//Author/ForeName` | |
|
||||
| `author.fullname` | `//Author/FullName` | Concatenation of forename + lastName if exist |
|
||||
| `author.rank` | FOR ALL AUTHORS | sequential number starting from 1 |
|
||||
| **Journal Mapping** | | |
|
||||
| `container.conferencedate` | `//Journal/PubDate` | map the date of the Journal |
|
||||
| `container.name` | `//Journal/Title` | name of the journal |
|
||||
| `container.vol` | `//Journal/Volume` | journal volume |
|
||||
| `container.issPrinted` | `//Journal/ISSN` | the journal issn |
|
||||
| `container.iss` | `//Journal/Issue` | The journal issue |
|
||||
| **Instance Mapping** | | |
|
||||
| `instance.type` | `//PublicationType` | if the article contains the typology `Journal Article` then we apply this type else We have to find a terms that match the vocabulary otherwise we discard it |
|
||||
|`type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
|
||||
| `instance.pid` | `//PMID` | map the pmid in the pid in the instance |
|
||||
| `instance.url` | `//PMID` | creates the URL by prepending `https://pubmed.ncbi.nlm.nih.gov/` to the PMId |
|
||||
| `instance.alternateIdentifier` | `//ArticleId[./@IdType="doi"]` | |
|
||||
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
||||
| OpenAIRE Result field path | PubMed record field xpath | Notes |
|
||||
|--------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
| **Publication Mapping** | | |
|
||||
| `id` | `//PMID` | id in the form `pmid_________::md5(pmid)` |
|
||||
| `pid` | `//PMID` | `classid = classname = pmid` |
|
||||
| `publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
||||
| `maintitle` | `//Title` | |
|
||||
| `description` | `//AbstractText` | |
|
||||
| `language` | `//Language` | cleaning vocabulary -> dnet:languages |
|
||||
| `subjects` | `//DescriptorName` | classId, className = keyword |
|
||||
| **Author Mapping** | | |
|
||||
| `author.surname` | `//Author/LastName` | |
|
||||
| `author.name` | `//Author/ForeName` | |
|
||||
| `author.fullname` | `//Author/FullName` | Concatenation of forename + lastName if exist |
|
||||
| `author.rank` | FOR ALL AUTHORS | sequential number starting from 1 |
|
||||
| **Journal Mapping** | | |
|
||||
| `container.conferencedate` | `//Journal/PubDate` | map the date of the Journal |
|
||||
| `container.name` | `//Journal/Title` | name of the journal |
|
||||
| `container.vol` | `//Journal/Volume` | journal volume |
|
||||
| `container.issPrinted` | `//Journal/ISSN` | the journal issn |
|
||||
| `container.iss` | `//Journal/Issue` | The journal issue |
|
||||
| **Instance Mapping** | | |
|
||||
| `instance.type` | `//PublicationType` | if the article contains the typology `Journal Article` then we apply this type else We have to find a terms that match the vocabulary otherwise we discard it |
|
||||
| `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
|
||||
| `instance.pid` | `//PMID` | map the pmid in the pid in the instance |
|
||||
| `instance.url` | `//PMID` | creates the URL by prepending `https://pubmed.ncbi.nlm.nih.gov/` to the PMId |
|
||||
| `instance.alternateIdentifier` | `//ArticleId[./@IdType="doi"]` | |
|
||||
| `instance.publicationdate` | `//PubmedPubDate` | clean and normalize the format of the date to be YYYY-mm-dd |
|
Loading…
Reference in New Issue