From 77ad2700b29d84cbb0a3afa7b09f9cfd64705ffa Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Tue, 8 Nov 2022 09:18:04 +0100 Subject: [PATCH 1/3] datacite tables and EBI texts --- docs/data-provision/aggregation/datacite.md | 69 ++-- docs/data-provision/aggregation/ebi.md | 398 +++++++++++++++++++- 2 files changed, 428 insertions(+), 39 deletions(-) diff --git a/docs/data-provision/aggregation/datacite.md b/docs/data-provision/aggregation/datacite.md index b268e14..0de7b98 100644 --- a/docs/data-provision/aggregation/datacite.md +++ b/docs/data-provision/aggregation/datacite.md @@ -32,46 +32,45 @@ The metadata collection process identifies the most recent record date available ## Datacite Mapping The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format. - -| OpenAIRE Result field path | Datacite record JSON path | # Notes | -|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` | -| | | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. | -|`type` | | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: | -| `pid` | `\attributes\doi` | `scheme = doi` | -| `originalid` | `\attributes\doi` | | -| `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format | -| `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** | -| `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name | -| `author.rank` | | Incremental index starting from 1 | -| `author.name` | `\attributes\creators\givenName` | | -| `author.surname` | `\attributes\creators\familyName` | | -| `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator | -| `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** | -| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value | -| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main | -| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary | -| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) | -| `publicationdate` | `\attributes\dates` | where `dateType` is **issued** | -| `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` | -| `embargoenddate` | `\attributes\dates` | where `dateType` is **available** | -| `subjects` | `\attributes\subject` | `scheme=keywords` | -| `description` | `\attributes\descriptions` | | -| `publisher` | `\attributes\publisher` | | -| `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` | -| `publisher` | `\attributes\publisher` | | -| `instance.license` | `\attributes\rightsList` | if right value starts with http and matches a particular regex | -| `instance.accessright` | `\attributes\rightsList` | | +| OpenAIRE Result field path | Datacite record JSON path | # Notes | +|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| `id` | `\attributes\doi` | id in the form `doi_________::md5(doi)` | +| | | Use the vocabulary **_dnet:publication_resource_** to find a synonym to one of these terms and get the `instance.type`. | +| `type` | | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to generate one of the following main entities: | +| `pid` | `\attributes\doi` | `scheme = doi` | +| `originalid` | `\attributes\doi` | | +| `dateofcollection` | `attributes\updated` | the timestamp is defined in milliseconds we convert to "yyyy-MM-dd'T'HH:mm:ssZ" format | +| `author` | `\attributes\creators` | Each creator field will be mapped in the author entity below the subfield. **If the record has no Creator it will be skipped** | +| `author.fullname` | `\attributes\creators\name` | if name is not defined, we construct from given and family name | +| `author.rank` | | Incremental index starting from 1 | +| `author.name` | `\attributes\creators\givenName` | | +| `author.surname` | `\attributes\creators\familyName` | | +| `author.pid` | `\attributes\creators\nameIdentifiers` | this is a list of pids associated to the creator | +| `author.pid.scheme` | `\attributes\creators\nameIdentifiers` | mapping with vocabulary **dnet:pid_types** | +| `author.pid.value` | `\attributes\creators\nameIdentifiers/nameIdentifier` | the pid value | +| `maintitle` | `\attributes\titles` | Titles whose title type is null or title type is Main | +| `subtitle` | `\attributes\titles` | Titles whose title type is Subtitle since the title type vocabulary in OpenAIRE use the datacite title type vocabulary | +| **date section** | | for each date in particular for DOI starting with _10.14457_ we Apply a fix thai date convert a date to ThaiBuddhistDate and reformat to local one see ticket [#6791](https://support.openaire.eu/issues/6791) | +| `publicationdate` | `\attributes\dates` | where `dateType` is **issued** | +| `publicationdate` | `\attributes\publicationYear` | we create this date format `01-01-publicationYear` | +| `embargoenddate` | `\attributes\dates` | where `dateType` is **available** | +| `subjects` | `\attributes\subject` | `scheme=keywords` | +| `description` | `\attributes\descriptions` | | +| `publisher` | `\attributes\publisher` | | +| `language` | `\attributes\language` | cleaned by using vocabulary `dnet:languages` | +| `publisher` | `\attributes\publisher` | | +| `instance.license` | `\attributes\rightsList` | if the rights value starts with http and matches a particular regex | +| `instance.accessright` | `\attributes\rightsList` | | ### Mapping Relation -| OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Tartget type | #Notes | -|-------------------------------------------|-------------------------------|-------------------------------|---------| -| `isProducedBy` |`attributes\fundingReferences` | `Result/Project`| we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`| -| `IsProvidedBy` | | `Result/DataSource` | Datasource is always Datacite| -| `IsHostedBy` | `\attributes\relationships\client\id` | `Result/DataSource` |we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ | +| OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Tartget type | #Notes | +|----------------------------------------|---------------------------------------|----------------------|---------------------------------------------------------------------------------------------------| +| `isProducedBy` | `attributes\fundingReferences` | `Result/Project` | we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)` | +| `IsProvidedBy` | | `Result/DataSource` | Datasource is always Datacite | +| `IsHostedBy` | `\attributes\relationships\client\id` | `Result/DataSource` | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ | ### Relation Resolution diff --git a/docs/data-provision/aggregation/ebi.md b/docs/data-provision/aggregation/ebi.md index 11a8507..fdbcc7a 100644 --- a/docs/data-provision/aggregation/ebi.md +++ b/docs/data-provision/aggregation/ebi.md @@ -2,13 +2,403 @@ This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank in Europe](https://www.ebi.ac.uk/). -The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API)to retrieve data-literature links in Scholix format . +The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API) to retrieve data-literature links in Scholix format. -## how data is collected -Starting from the Pubmed collection, we exploit this API to get all the related bioentities related to a Publication with a specific PubMed identifier. +## How the data is collected -Following this request: `https://www.ebi.ac.uk/europepmc/webservices/rest/MED/$PMID/datalinks?format=json` we store for each pubmedID the links related. +Starting from the Pubmed collection, the API below is used to obtain the bioentities related to publications for each PubMed identifier. +Example: + +```commandline +curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks?format=json" | jq '.' +{ + "version": "6.8", + "hitCount": 9, + "request": { + "id": "33024307", + "source": "MED" + }, + "dataLinkList": { + "Category": [ + { + "Name": "Nucleotide Sequences", + "CategoryLinkCount": 5, + "Section": [ + { + "ObtainedBy": "tm_accession", + "Tags": [ + "supporting_data" + ], + "SectionLinkCount": 5, + "Linklist": { + "Link": [ + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "AY278488", + "IDScheme": "ENA", + "IDURL": "http://identifiers.org/ebi/ena.embl:AY278488" + }, + "Title": "AY278488", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + }, + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "MT121216", + "IDScheme": "ENA", + "IDURL": "http://identifiers.org/ebi/ena.embl:MT121216" + }, + "Title": "MT121216", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + }, + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "KF367457", + "IDScheme": "ENA", + "IDURL": "http://identifiers.org/ebi/ena.embl:KF367457" + }, + "Title": "KF367457", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + }, + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "MN996532", + "IDScheme": "ENA", + "IDURL": "http://identifiers.org/ebi/ena.embl:MN996532" + }, + "Title": "MN996532", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + }, + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "MT072864", + "IDScheme": "ENA", + "IDURL": "http://identifiers.org/ebi/ena.embl:MT072864" + }, + "Title": "MT072864", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + } + ] + } + } + ] + }, + { + "Name": "Protein Structures", + "NameLong": "Protein structures in PDBe", + "CategoryLinkCount": 2, + "Section": [ + { + "ObtainedBy": "tm_accession", + "Tags": [ + "supporting_data" + ], + "SectionLinkCount": 2, + "Linklist": { + "Link": [ + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "6VW1", + "IDScheme": "PDB", + "IDURL": "http://identifiers.org/pdbe/pdb:6VW1" + }, + "Title": "6VW1", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + }, + { + "ObtainedBy": "tm_accession", + "PublicationDate": "04-11-2022", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "References" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "MED" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "2AJF", + "IDScheme": "PDB", + "IDURL": "http://identifiers.org/pdbe/pdb:2AJF" + }, + "Title": "2AJF", + "Publisher": { + "Name": "Europe PMC" + } + }, + "Frequency": 1 + } + ] + } + } + ] + }, + { + "Name": "Altmetric", + "CategoryLinkCount": 1, + "Section": [ + { + "ObtainedBy": "ext_links", + "Tags": [ + "altmetrics" + ], + "SectionLinkCount": 1, + "Linklist": { + "Link": [ + { + "ObtainedBy": "ext_links", + "PublicationDate": "15-10-2020", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "IsReferencedBy" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "PMID" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "https://www.altmetric.com/details/91880755", + "IDScheme": "URL", + "IDURL": "https://www.altmetric.com/details/91880755" + }, + "Title": "Characteristics of SARS-CoV-2 and COVID-19", + "Publisher": { + "Name": "Altmetric" + }, + "ImageURL": "https://api.altmetric.com/v1/donut/91880755_64.png" + } + } + ] + } + } + ] + }, + { + "Name": "BioStudies: supplemental material and supporting data", + "CategoryLinkCount": 1, + "Section": [ + { + "ObtainedBy": "ext_links", + "Tags": [ + "supporting_data" + ], + "SectionLinkCount": 1, + "Linklist": { + "Link": [ + { + "ObtainedBy": "ext_links", + "PublicationDate": "11-03-2021", + "LinkProvider": { + "Name": "Europe PMC" + }, + "RelationshipType": { + "Name": "IsReferencedBy" + }, + "Source": { + "Type": { + "Name": "literature" + }, + "Identifier": { + "ID": "33024307", + "IDScheme": "PMID" + } + }, + "Target": { + "Type": { + "Name": "dataset" + }, + "Identifier": { + "ID": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true", + "IDScheme": "URL", + "IDURL": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true" + }, + "Title": "Characteristics of SARS-CoV-2 and COVID-19.", + "Publisher": { + "Name": "BioStudies: supplemental material and supporting data" + } + } + } + ] + } + } + ] + } + ] + } +} +``` ## Mapping The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format. From 1f1e5c8d4913f25d61b563696d5b5322e89d9f17 Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Tue, 8 Nov 2022 09:35:10 +0100 Subject: [PATCH 2/3] minor --- docs/data-provision/aggregation/datacite.md | 5 ++++- docs/data-provision/aggregation/pubmed.md | 3 ++- 2 files changed, 6 insertions(+), 2 deletions(-) diff --git a/docs/data-provision/aggregation/datacite.md b/docs/data-provision/aggregation/datacite.md index 0de7b98..c5b549f 100644 --- a/docs/data-provision/aggregation/datacite.md +++ b/docs/data-provision/aggregation/datacite.md @@ -30,6 +30,9 @@ The collection workflow is responsible for aggregating new records. Each record The metadata collection process identifies the most recent record date available locally and uses such date to requests the records to the Datacite API, populating the **FROM_DATE_TIMESAMP** variable. The records in the API response are included in the local storage in upsert mode. ## Datacite Mapping + +### Entity Mapping + The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format. | OpenAIRE Result field path | Datacite record JSON path | # Notes | @@ -63,7 +66,7 @@ The table below describes the mapping from the XML baseline records to the OpenA | `instance.accessright` | `\attributes\rightsList` | | -### Mapping Relation +### Relation Mapping | OpenAIRE Relation Semantic and inverse | Datacite record JSON path | Source/Tartget type | #Notes | diff --git a/docs/data-provision/aggregation/pubmed.md b/docs/data-provision/aggregation/pubmed.md index a223c35..f45a974 100644 --- a/docs/data-provision/aggregation/pubmed.md +++ b/docs/data-provision/aggregation/pubmed.md @@ -9,7 +9,8 @@ It contains XML records compliant with the schema available at https://www.nlm.n ## Incremental harvesting Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item. -## Mapping + +## Entity Mapping The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format. From 3b3115fc38561622ac905d58b5e5adcb23fa5043 Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Tue, 4 Oct 2022 14:00:54 +0200 Subject: [PATCH 3/3] more git ignores --- .gitignore | 2 ++ 1 file changed, 2 insertions(+) diff --git a/.gitignore b/.gitignore index b2d6de3..e005ae6 100644 --- a/.gitignore +++ b/.gitignore @@ -18,3 +18,5 @@ npm-debug.log* yarn-debug.log* yarn-error.log* + +.idea/ \ No newline at end of file