merged commit

2022-11-08 15:42:05 +01:00 · 2022-11-08 15:42:05 +01:00 · f05888e637
parent 268bb23545 92baad5acb
commit f05888e637
3 changed files with 441 additions and 36 deletions
--- a/docs/data-provision/aggregation/datacite.md
+++ b/docs/data-provision/aggregation/datacite.md
@ -30,11 +30,13 @@ The collection workflow is responsible for aggregating new records. Each record
 The metadata collection process identifies the most recent record date available locally and uses such date to requests the records to the Datacite API, populating the **FROM_DATE_TIMESAMP** variable. The records in the API response are included in the local storage in upsert mode.
 ## Datacite Mapping
 ### Entity Mapping
 The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.
 | OpenAIRE Result field path                             | Datacite record JSON path                                                                                                                       | # Notes                                                                                                                                                                                                                                              |
-|--------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+|--------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
 | `id`                                                   | `\attributes\doi`                                                                                                                               | id in the form `doi_________::md5(doi)`                                                                                                                                                                                                              |
 | <ul><li>`instance`</li>  <li>`instance.type`</li></ul> | <ul><li>`\attributes\types\resourceType`</li>  <li> `\attributes\types\resourceTypeGeneral` </li>  <li>`attributes\types\schemaOrg`</li></ul>   | Use the vocabulary **_dnet:publication_resource_**  to find a synonym to one of these terms and get the `instance.type`.                                                                                                                             |
 | `type`                                                 | <ul><li>`\attributes\types\resourceType`</li>  <li> `\attributes\types\resourceTypeGeneral` </li>  <li>`attributes\types\schemaOrg`</li></ul>   | Using the **_dnet:result_typologies_** vocabulary, we look up the `instance.type` synonym to  generate one of the following main entities: <ul><li>`publication`</li>  <li>`dataset`</li> <li> `software`</li>  <li>`otherresearchproduct`</li></ul> | 
@ -60,18 +62,30 @@ The table below describes the mapping from the XML baseline records to the OpenA
 | `publisher`                                            | `\attributes\publisher`                                                                                                                         |                                                                                                                                                                                                                                                      |
 | `language`                                             | `\attributes\language`                                                                                                                          | cleaned by using vocabulary `dnet:languages`                                                                                                                                                                                                         |
 | `publisher`                                            | `\attributes\publisher`                                                                                                                         |                                                                                                                                                                                                                                                      |
-| `instance.license`                                     | `\attributes\rightsList`                                                                                                                      | if right value starts with http and matches a particular regex                                                                                                                                                                                                                                                                                                                |
+| `instance.license`                                     | `\attributes\rightsList`                                                                                                                        | if the rights value starts with http and matches a particular regex                                                                                                                                                                                  |
-| `instance.accessright`                                 | `\attributes\rightsList`                                                                                                                      | <ul> <li>if not present :`unknown`</li><li>if datasource is _figshare_:`open`</li><li>If `embargo_date < today()`: _OPEN_ </li> </ul>                                                                                                                                                                                                                                         |
+| `instance.accessright`                                 | `\attributes\rightsList`                                                                                                                        | <ul><li>if not present :`unknown`</li><li>if datasource is Figshare:`open`</li><li>If `embargo_date < today()`: OPEN</li></ul>                                                                                                                       |
-### Mapping Relation
+### Relation Mapping
 <<<<<<< HEAD
 | OpenAIRE Relation Semantic and inverse    | Datacite record JSON path     | Source/Tartget type           | #Notes  |
 |-------------------------------------------|-------------------------------|-------------------------------|---------|
 | `isProducedBy`      |`attributes\fundingReferences` | `Result/Project`|  we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`|
 | `IsProvidedBy`   | | `Result/DataSource` | Datasource is always Datacite|
 | `IsHostedBy`   | `\attributes\relationships\client\id` | `Result/DataSource` |we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
 |            |      `\attribute\relatedIdentifiers`                | result/result                 | we create relationships whenever the pid of the target is resolved on the Research Graph          |
 =======
 | OpenAIRE Relation Semantic and inverse | Datacite record JSON path             | Source/Tartget type  | #Notes                                                                                            |
 |----------------------------------------|---------------------------------------|----------------------|---------------------------------------------------------------------------------------------------|
 | `isProducedBy`                         | `attributes\fundingReferences`        | `Result/Project`     | we must identifi if match this pattern `(info:eu-repo/grantagreement/ec/h2020/)(\d{6})(.*)`       |
 | `IsProvidedBy`                         |                                       | `Result/DataSource`  | Datasource is always Datacite                                                                     |
 | `IsHostedBy`                           | `\attributes\relationships\client\id` | `Result/DataSource`  | we defined a curated map clientId/Datasource if we found a match we create an _hostedBy Relation_ |
 ### Relation Resolution
 >>>>>>> 92baad5acb3ecfb774510b48fee6aeeba92738df
--- a/docs/data-provision/aggregation/ebi.md
+++ b/docs/data-provision/aggregation/ebi.md
@ -4,11 +4,401 @@ This section describes the mapping implemented for [EMBL-EBIs Protein Data Bank
 The Europe PMC RESTful Web Service gives the [datalinks API](https://europepmc.org/RestfulWebService#!/Europe32PMC32Articles32RESTful32API) to retrieve data-literature links in Scholix format.
-## how data is collected
+## How the data is collected
 Starting from the Pubmed collection, we exploit this API to get all the related bioentities related to a Publication with a specific PubMed identifier.
-Following this request: `https://www.ebi.ac.uk/europepmc/webservices/rest/MED/$PMID/datalinks?format=json` we store for each pubmedID the links related.
+Starting from the Pubmed collection, the API below is used to obtain the bioentities related to publications for each PubMed identifier.
 Example:
 ```commandline
 curl -s "https://www.ebi.ac.uk/europepmc/webservices/rest/MED/33024307/datalinks?format=json" | jq '.'
 {
  "version": "6.8",
  "hitCount": 9,
  "request": {
    "id": "33024307",
    "source": "MED"
  },
  "dataLinkList": {
    "Category": [
      {
        "Name": "Nucleotide Sequences",
        "CategoryLinkCount": 5,
        "Section": [
          {
            "ObtainedBy": "tm_accession",
            "Tags": [
              "supporting_data"
            ],
            "SectionLinkCount": 5,
            "Linklist": {
              "Link": [
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "AY278488",
                      "IDScheme": "ENA",
                      "IDURL": "http://identifiers.org/ebi/ena.embl:AY278488"
                    },
                    "Title": "AY278488",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                },
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "MT121216",
                      "IDScheme": "ENA",
                      "IDURL": "http://identifiers.org/ebi/ena.embl:MT121216"
                    },
                    "Title": "MT121216",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                },
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "KF367457",
                      "IDScheme": "ENA",
                      "IDURL": "http://identifiers.org/ebi/ena.embl:KF367457"
                    },
                    "Title": "KF367457",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                },
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "MN996532",
                      "IDScheme": "ENA",
                      "IDURL": "http://identifiers.org/ebi/ena.embl:MN996532"
                    },
                    "Title": "MN996532",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                },
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "MT072864",
                      "IDScheme": "ENA",
                      "IDURL": "http://identifiers.org/ebi/ena.embl:MT072864"
                    },
                    "Title": "MT072864",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                }
              ]
            }
          }
        ]
      },
      {
        "Name": "Protein Structures",
        "NameLong": "Protein structures in PDBe",
        "CategoryLinkCount": 2,
        "Section": [
          {
            "ObtainedBy": "tm_accession",
            "Tags": [
              "supporting_data"
            ],
            "SectionLinkCount": 2,
            "Linklist": {
              "Link": [
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "6VW1",
                      "IDScheme": "PDB",
                      "IDURL": "http://identifiers.org/pdbe/pdb:6VW1"
                    },
                    "Title": "6VW1",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                },
                {
                  "ObtainedBy": "tm_accession",
                  "PublicationDate": "04-11-2022",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "References"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "MED"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "2AJF",
                      "IDScheme": "PDB",
                      "IDURL": "http://identifiers.org/pdbe/pdb:2AJF"
                    },
                    "Title": "2AJF",
                    "Publisher": {
                      "Name": "Europe PMC"
                    }
                  },
                  "Frequency": 1
                }
              ]
            }
          }
        ]
      },
      {
        "Name": "Altmetric",
        "CategoryLinkCount": 1,
        "Section": [
          {
            "ObtainedBy": "ext_links",
            "Tags": [
              "altmetrics"
            ],
            "SectionLinkCount": 1,
            "Linklist": {
              "Link": [
                {
                  "ObtainedBy": "ext_links",
                  "PublicationDate": "15-10-2020",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "IsReferencedBy"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "PMID"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "https://www.altmetric.com/details/91880755",
                      "IDScheme": "URL",
                      "IDURL": "https://www.altmetric.com/details/91880755"
                    },
                    "Title": "Characteristics of SARS-CoV-2 and COVID-19",
                    "Publisher": {
                      "Name": "Altmetric"
                    },
                    "ImageURL": "https://api.altmetric.com/v1/donut/91880755_64.png"
                  }
                }
              ]
            }
          }
        ]
      },
      {
        "Name": "BioStudies: supplemental material and supporting data",
        "CategoryLinkCount": 1,
        "Section": [
          {
            "ObtainedBy": "ext_links",
            "Tags": [
              "supporting_data"
            ],
            "SectionLinkCount": 1,
            "Linklist": {
              "Link": [
                {
                  "ObtainedBy": "ext_links",
                  "PublicationDate": "11-03-2021",
                  "LinkProvider": {
                    "Name": "Europe PMC"
                  },
                  "RelationshipType": {
                    "Name": "IsReferencedBy"
                  },
                  "Source": {
                    "Type": {
                      "Name": "literature"
                    },
                    "Identifier": {
                      "ID": "33024307",
                      "IDScheme": "PMID"
                    }
                  },
                  "Target": {
                    "Type": {
                      "Name": "dataset"
                    },
                    "Identifier": {
                      "ID": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true",
                      "IDScheme": "URL",
                      "IDURL": "http://www.ebi.ac.uk/biostudies/studies/S-EPMC7537588?xr=true"
                    },
                    "Title": "Characteristics of SARS-CoV-2 and COVID-19.",
                    "Publisher": {
                      "Name": "BioStudies: supplemental material and supporting data"
                    }
                  }
                }
              ]
            }
          }
        ]
      }
    ]
  }
 }
 ```
 ## Mapping
 The table below describes the mapping from the EBI links records to the OpenAIRE Graph dump format.
--- a/docs/data-provision/aggregation/pubmed.md
+++ b/docs/data-provision/aggregation/pubmed.md
@ -9,7 +9,8 @@ It contains XML records compliant with the schema available at https://www.nlm.n
 ## Incremental harvesting
 Pubmed exposes an entry point FTP with all the updates for each one. [ftp baseline update](https://ftp.ncbi.nlm.nih.gov/pubmed/updatefiles/). We collect the new file and generate the new dataset by upserting the existing item.
-## Mapping
+
 ## Entity Mapping
 The table below describes the mapping from the XML baseline records to the OpenAIRE Graph dump format.