From fb5e6cd81414eef5a83689b4e7427d7c24c5f45f Mon Sep 17 00:00:00 2001 From: Claudio Atzori Date: Thu, 9 Mar 2023 15:00:45 +0100 Subject: [PATCH] integrating Scholexplorer Bio Entity Datasource documentation (PR#48) --- README.md | 3 ++ docs/changelog.md | 3 ++ .../non-compatible-sources/uniprot.md | 39 +++++++++---------- sidebars.js | 3 +- versioned_docs/version-5.2.0/changelog.md | 3 ++ .../data-model/pids-and-identifiers.md | 9 ++++- .../non-compatible-sources/uniprot.md | 31 +++++++++++++++ .../version-5.2.0-sidebars.json | 5 +++ 8 files changed, 73 insertions(+), 23 deletions(-) create mode 100644 versioned_docs/version-5.2.0/data-provision/aggregation/non-compatible-sources/uniprot.md diff --git a/README.md b/README.md index 90cc01b..690e006 100644 --- a/README.md +++ b/README.md @@ -60,3 +60,6 @@ When tagging a new version, the document versioning mechanism will: * Copy the full `docs/` folder contents into a new `versioned_docs/version-/` folder. * Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version--sidebars.json`. * Append the new version number to `versions.json`. + +Therefore, when previewing the compiled site locally with `npm run start`, ensure to visualise the `Next` version on the browser as it shows the changes under `/docs`. +To change a version that was already versioned, the source files to be modified are in the `versioned_docs/version-/` folder. diff --git a/docs/changelog.md b/docs/changelog.md index c689819..c874cb7 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -26,6 +26,9 @@ _Start Date: 2023-02-13 • Release Date: 2023-03-01 • Dump release: **n - Revised SDG classification: improved coverage (+600K classified DOIs) - General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications +- Integrated contents from + - [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi) + - [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot) #### Changed diff --git a/docs/data-provision/aggregation/non-compatible-sources/uniprot.md b/docs/data-provision/aggregation/non-compatible-sources/uniprot.md index e753084..47fc7bc 100644 --- a/docs/data-provision/aggregation/non-compatible-sources/uniprot.md +++ b/docs/data-provision/aggregation/non-compatible-sources/uniprot.md @@ -1,32 +1,31 @@ # UniProtKB/Swiss-Prot -this section describes the mapping implemented for [UniProtKB/Swiss-Prot](https://www.uniprot.org/). -The whole dump can be downloaded by [here](https://www.uniprot.org/help/downloadss) the Reviewed (Swiss-Prot). - -From this Dump we extract only the protein linked to a pubmed Publication. +This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/). +The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads). +From this dataset, only the protein records linked to a PubMed publication are extracted. ## Entity Mapping The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format. You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt) -| OpenAIRE Result field path | FASTA record field xpath| Notes| -|--------------------------------|----------------------|---------| -| **BIOEntity Mapping** | | | -| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)`| -| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` | -| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` | -| `maintitle` | `LINE START WITH GN`|main title | -| **Instance Mapping** | | | -| `instance.type` | | `Bioentity` | -| `type` | | `Dataset` | -| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` | -| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/`| -| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd | +| OpenAIRE Result field path | FASTA record field xpath | Notes | +|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------| +| **BIOEntity Mapping** | | | +| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` | +| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` | +| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` | +| `maintitle` | `LINE START WITH GN` | main title | +| **Instance Mapping** | | | +| `instance.type` | | `Bioentity` | +| `type` | | `Dataset` | +| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` | +| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` | +| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd | ### Relation Mapping -| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes | -|----------------------------------------|---------------------|--------------------------------------------------------------------------| -| `IsRelatedTo` | `LINE START WITH RX` | we create relationships between the BioEntity and the pubmed or DOI generating an unresolved target identifier | \ No newline at end of file +| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes | +|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------| +| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier | \ No newline at end of file diff --git a/sidebars.js b/sidebars.js index 0f933aa..a5c625c 100644 --- a/sidebars.js +++ b/sidebars.js @@ -88,8 +88,7 @@ const sidebars = { { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' }, - { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' }, - + { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' } ] } ] diff --git a/versioned_docs/version-5.2.0/changelog.md b/versioned_docs/version-5.2.0/changelog.md index c689819..c874cb7 100644 --- a/versioned_docs/version-5.2.0/changelog.md +++ b/versioned_docs/version-5.2.0/changelog.md @@ -26,6 +26,9 @@ _Start Date: 2023-02-13 • Release Date: 2023-03-01 • Dump release: **n - Revised SDG classification: improved coverage (+600K classified DOIs) - General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications +- Integrated contents from + - [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi) + - [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot) #### Changed diff --git a/versioned_docs/version-5.2.0/data-model/pids-and-identifiers.md b/versioned_docs/version-5.2.0/data-model/pids-and-identifiers.md index be6731b..de31912 100644 --- a/versioned_docs/version-5.2.0/data-model/pids-and-identifiers.md +++ b/versioned_docs/version-5.2.0/data-model/pids-and-identifiers.md @@ -18,6 +18,10 @@ Such a policy defines a list of data sources that are considered authoritative f | doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | | pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | | arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | +| uniprot | [Protein Data Bank](http://www.pdb.org/) | +| ena | [Protein Data Bank](http://www.pdb.org/) | +| pdb | [Protein Data Bank](http://www.pdb.org/) | + There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. In all other cases, PIDs are be included in the graph as alternate Identifiers. @@ -63,12 +67,15 @@ When the record is collected from a source which is not authoritative for any ty Currently, the following data sources are used as "PID authorities": | PID Type | Prefix (12 chars) | Authority | -|-----------|------------------------|-----------------------------------------| +|-----------|------------------------|-------------------------------------------| | doi | `doi_________` | Crossref, Datacite, Zenodo | | pmc | `pmc_________` | Europe PubMed Central, PubMed Central | | pmid | `pmid________` | Europe PubMed Central, PubMed Central | | arXiv | `arXiv_______` | arXiv.org e-Print Archive | | handle | `handle______` | any repository | +| ena | `ena_________` | EMBL-EBI | +| pdb | `pdb_________` | EMBL-EBI | +| uniprot | `uniprot_____` | EMBL-EBI | OpenAIRE also perform duplicate identification (see the [dedicated section for details](/data-provision/deduplication)). All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). diff --git a/versioned_docs/version-5.2.0/data-provision/aggregation/non-compatible-sources/uniprot.md b/versioned_docs/version-5.2.0/data-provision/aggregation/non-compatible-sources/uniprot.md new file mode 100644 index 0000000..47fc7bc --- /dev/null +++ b/versioned_docs/version-5.2.0/data-provision/aggregation/non-compatible-sources/uniprot.md @@ -0,0 +1,31 @@ +# UniProtKB/Swiss-Prot + +This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/). +The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads). + +From this dataset, only the protein records linked to a PubMed publication are extracted. + +## Entity Mapping + +The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format. +You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt) + +| OpenAIRE Result field path | FASTA record field xpath | Notes | +|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------| +| **BIOEntity Mapping** | | | +| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` | +| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` | +| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` | +| `maintitle` | `LINE START WITH GN` | main title | +| **Instance Mapping** | | | +| `instance.type` | | `Bioentity` | +| `type` | | `Dataset` | +| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` | +| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` | +| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd | + + +### Relation Mapping +| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes | +|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------| +| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier | \ No newline at end of file diff --git a/versioned_sidebars/version-5.2.0-sidebars.json b/versioned_sidebars/version-5.2.0-sidebars.json index ed84be0..382912b 100644 --- a/versioned_sidebars/version-5.2.0-sidebars.json +++ b/versioned_sidebars/version-5.2.0-sidebars.json @@ -128,6 +128,11 @@ "type": "doc", "id": "data-provision/aggregation/non-compatible-sources/ebi", "label": "EMBL-EBI" + }, + { + "type": "doc", + "id": "data-provision/aggregation/non-compatible-sources/uniprot", + "label": "UniProtKB/Swiss-Prot" } ] }