datacite tables and EBI texts

This commit is contained in:
Claudio Atzori 2022-11-08 09:18:04 +01:00
parent b5e23b9605
commit 77ad2700b2
2 changed files with 428 additions and 39 deletions

View File

@ -32,46 +32,45 @@ The metadata collection process identifies the most recent record date available
## Datacite Mapping ## Datacite Mapping
The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format. The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
| OpenAIRE Result field path | Datacite record JSON path | # Notes |
| OpenAIRE Result field path | Datacite record JSON path | # Notes | |--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` |
| `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` | | <ul><li>`instance`</li> <li>`instance.type`</li></ul> | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. |
| <ul><li>`instance`</li> <li>`instance.type`</li></ul> | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. | | `type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> |
|`type` | <ul><li>`\attributes\types\resourceType`</li> <li> `\attributes\types\resourceTypeGeneral` </li> <li>`attributes\types\schemaOrg`</li></ul> | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: <ul><li>`publication`</li> <li>`dataset`</li> <li> `software`</li> <li>`otherresearchproduct`</li></ul> | | `pid` | `\attributes\doi` | `scheme = doi` |
| `pid` | `\attributes\doi` | `scheme = doi` | | `originalid` | `\attributes\doi` | |
| `originalid` | `\attributes\doi` | | | `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format |
| `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format | | `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** |
| `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** | | `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name |
| `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name | | `author.rank` | | Incremental index starting from 1 |
| `author.rank` | | Incremental index starting from 1 | | `author.name` | `\attributes\creators\givenName` | |
| `author.name` | `\attributes\creators\givenName` | | | `author.surname` | `\attributes\creators\familyName` | |
| `author.surname` | `\attributes\creators\familyName` | | | `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator |
| `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator | | `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** |
| `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** | | `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value |
| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value | | `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main |
| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main | | `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary |
| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary | | **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) |
| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) | | `publicationdate` | `\attributes\dates` | where `dateType` is **issued** |
| `publicationdate` | `\attributes\dates` | where `dateType` is **issued** | | `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` |
| `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` | | `embargoenddate` | `\attributes\dates` | where `dateType` is **available** |
| `embargoenddate` | `\attributes\dates` | where `dateType` is **available** | | `subjects` | `\attributes\subject` | `scheme=keywords` |
| `subjects` | `\attributes\subject` | `scheme=keywords` | | `description` | `\attributes\descriptions` | |
| `description` | `\attributes\descriptions` | | | `publisher` | `\attributes\publisher` | |
| `publisher` | `\attributes\publisher` | | | `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` |
| `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` | | `publisher` | `\attributes\publisher` | |
| `publisher` | `\attributes\publisher` | | | `instance.license` | `\attributes\rightsList` | if the rights value starts with http and matches a particular regex |
| `instance.license` | `\attributes\rightsList` | if right value starts with http and matches a particular regex | | `instance.accessright` | `\attributes\rightsList` | <ul><li>if not present :`unknown`</li><li>if datasource is Figshare:`open`</li><li>If `embargo_date < today()`: OPEN</li></ul> |
| `instance.accessright` | `\attributes\rightsList` | <ul> <li>if not present :`unknown`</li><li>if datasource is _figshare_:`open`</li><li>If `embargo_date < today()`: _OPEN_ </li> </ul> |
### Mapping Relation ### Mapping Relation
| OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Tartget type | #Notes | | OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Tartget type | #Notes |
|-------------------------------------------|-------------------------------|-------------------------------|---------| |----------------------------------------|---------------------------------------|----------------------|---------------------------------------------------------------------------------------------------|
| `isProducedBy` |`attributes\fundingReferences` | `Result/Project`| we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`| | `isProducedBy` | `attributes\fundingReferences` | `Result/Project` | we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)` |
| `IsProvidedBy` | | `Result/DataSource` | Datasource is always Datacite| | `IsProvidedBy` | | `Result/DataSource` | Datasource is always Datacite |
| `IsHostedBy` | `\attributes\relationships\client\id` | `Result/DataSource` |we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ | | `IsHostedBy` | `\attributes\relationships\client\id` | `Result/DataSource` | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
### Relation Resolution ### Relation Resolution

View File

@ -2,13 +2,403 @@
This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank in Europe](https://www.ebi.ac.uk/). This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank in Europe](https://www.ebi.ac.uk/).
The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API)to retrieve data-literature links in Scholix format . The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API) to retrieve data-literature links in Scholix format.
## how data is collected ## How the data is collected
Starting from the Pubmed collection, we exploit this API to get all the related bioentities related to a Publication with a specific PubMed identifier.
Following this request: `https://www.ebi.ac.uk/europepmc/webservices/rest/MED/$PMID/datalinks?format=json` we store for each pubmedID the links related. Starting from the Pubmed collection, the API below is used to obtain the bioentities related to publications for each PubMed identifier.
Example:
```commandline
curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks?format=json" | jq '.'
{
"version": "6.8",
"hitCount": 9,
"request": {
"id": "33024307",
"source": "MED"
},
"dataLinkList": {
"Category": [
{
"Name": "Nucleotide Sequences",
"CategoryLinkCount": 5,
"Section": [
{
"ObtainedBy": "tm_accession",
"Tags": [
"supporting_data"
],
"SectionLinkCount": 5,
"Linklist": {
"Link": [
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "AY278488",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:AY278488"
},
"Title": "AY278488",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
},
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "MT121216",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:MT121216"
},
"Title": "MT121216",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
},
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "KF367457",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:KF367457"
},
"Title": "KF367457",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
},
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "MN996532",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:MN996532"
},
"Title": "MN996532",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
},
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "MT072864",
"IDScheme": "ENA",
"IDURL": "http://identifiers.org/ebi/ena.embl:MT072864"
},
"Title": "MT072864",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
}
]
}
}
]
},
{
"Name": "Protein Structures",
"NameLong": "Protein structures in PDBe",
"CategoryLinkCount": 2,
"Section": [
{
"ObtainedBy": "tm_accession",
"Tags": [
"supporting_data"
],
"SectionLinkCount": 2,
"Linklist": {
"Link": [
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "6VW1",
"IDScheme": "PDB",
"IDURL": "http://identifiers.org/pdbe/pdb:6VW1"
},
"Title": "6VW1",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
},
{
"ObtainedBy": "tm_accession",
"PublicationDate": "04-11-2022",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "References"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "MED"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "2AJF",
"IDScheme": "PDB",
"IDURL": "http://identifiers.org/pdbe/pdb:2AJF"
},
"Title": "2AJF",
"Publisher": {
"Name": "Europe PMC"
}
},
"Frequency": 1
}
]
}
}
]
},
{
"Name": "Altmetric",
"CategoryLinkCount": 1,
"Section": [
{
"ObtainedBy": "ext_links",
"Tags": [
"altmetrics"
],
"SectionLinkCount": 1,
"Linklist": {
"Link": [
{
"ObtainedBy": "ext_links",
"PublicationDate": "15-10-2020",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "IsReferencedBy"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "PMID"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "https://www.altmetric.com/details/91880755",
"IDScheme": "URL",
"IDURL": "https://www.altmetric.com/details/91880755"
},
"Title": "Characteristics of SARS-CoV-2 and COVID-19",
"Publisher": {
"Name": "Altmetric"
},
"ImageURL": "https://api.altmetric.com/v1/donut/91880755_64.png"
}
}
]
}
}
]
},
{
"Name": "BioStudies: supplemental material and supporting data",
"CategoryLinkCount": 1,
"Section": [
{
"ObtainedBy": "ext_links",
"Tags": [
"supporting_data"
],
"SectionLinkCount": 1,
"Linklist": {
"Link": [
{
"ObtainedBy": "ext_links",
"PublicationDate": "11-03-2021",
"LinkProvider": {
"Name": "Europe PMC"
},
"RelationshipType": {
"Name": "IsReferencedBy"
},
"Source": {
"Type": {
"Name": "literature"
},
"Identifier": {
"ID": "33024307",
"IDScheme": "PMID"
}
},
"Target": {
"Type": {
"Name": "dataset"
},
"Identifier": {
"ID": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true",
"IDScheme": "URL",
"IDURL": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true"
},
"Title": "Characteristics of SARS-CoV-2 and COVID-19.",
"Publisher": {
"Name": "BioStudies: supplemental material and supporting data"
}
}
}
]
}
}
]
}
]
}
}
```
## Mapping ## Mapping
The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format. The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format.