integrating Scholexplorer Bio Entity Datasource documentation (PR#48)

This commit is contained in:
Claudio Atzori 2023-03-09 15:00:45 +01:00
parent 49fc73d09d
commit fb5e6cd814
8 changed files with 73 additions and 23 deletions

View File

@ -60,3 +60,6 @@ When tagging a new version, the document versioning mechanism will:
* Copy the full `docs/` folder contents into a new `versioned_docs/version-<versionName>/` folder. * Copy the full `docs/` folder contents into a new `versioned_docs/version-<versionName>/` folder.
* Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version-<versionName>-sidebars.json`. * Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version-<versionName>-sidebars.json`.
* Append the new version number to `versions.json`. * Append the new version number to `versions.json`.
Therefore, when previewing the compiled site locally with `npm run start`, ensure to visualise the `Next` version on the browser as it shows the changes under `/docs`.
To change a version that was already versioned, the source files to be modified are in the `versioned_docs/version-<versionName>/` folder.

View File

@ -26,6 +26,9 @@ _Start Date: 2023-02-13 &bull; Release Date: 2023-03-01 &bull; Dump release: **n
- Revised SDG classification: improved coverage (+600K classified DOIs) - Revised SDG classification: improved coverage (+600K classified DOIs)
- General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications - General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications
- Integrated contents from
- [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi)
- [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot)
#### Changed #### Changed

View File

@ -1,32 +1,31 @@
# UniProtKB/Swiss-Prot # UniProtKB/Swiss-Prot
this section describes the mapping implemented for [UniProtKB/Swiss-Prot](https://www.uniprot.org/). This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The whole dump can be downloaded by [here](https://www.uniprot.org/help/downloadss) the Reviewed (Swiss-Prot). The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads).
From this Dump we extract only the protein linked to a pubmed Publication.
From this dataset, only the protein records linked to a PubMed publication are extracted.
## Entity Mapping ## Entity Mapping
The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format. The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format.
You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt) You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt)
| OpenAIRE Result field path | FASTA record field xpath| Notes| | OpenAIRE Result field path | FASTA record field xpath | Notes |
|--------------------------------|----------------------|---------| |------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **BIOEntity Mapping** | | | | **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)`| | `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` |
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` | | `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` | | `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN`|main title | | `maintitle` | `LINE START WITH GN` | main title |
| **Instance Mapping** | | | | **Instance Mapping** | | |
| `instance.type` | | `Bioentity` | | `instance.type` | | `Bioentity` |
| `type` | | `Dataset` | | `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` | | `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/`| | `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` |
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd | | `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping ### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes | | OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|---------------------|--------------------------------------------------------------------------| |----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | we create relationships between the BioEntity and the pubmed or DOI generating an unresolved target identifier | | `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier |

View File

@ -88,8 +88,7 @@ const sidebars = {
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' }, { type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' }
] ]
} }
] ]

View File

@ -26,6 +26,9 @@ _Start Date: 2023-02-13 &bull; Release Date: 2023-03-01 &bull; Dump release: **n
- Revised SDG classification: improved coverage (+600K classified DOIs) - Revised SDG classification: improved coverage (+600K classified DOIs)
- General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications - General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications
- Integrated contents from
- [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi)
- [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot)
#### Changed #### Changed

View File

@ -18,6 +18,10 @@ Such a policy defines a list of data sources that are considered authoritative f
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) | | doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) | | pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) | | arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
| uniprot | [Protein Data Bank](http://www.pdb.org/) |
| ena | [Protein Data Bank](http://www.pdb.org/) |
| pdb | [Protein Data Bank](http://www.pdb.org/) |
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule. There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are be included in the graph as alternate Identifiers. In all other cases, PIDs are be included in the graph as alternate Identifiers.
@ -63,12 +67,15 @@ When the record is collected from a source which is not authoritative for any ty
Currently, the following data sources are used as "PID authorities": Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority | | PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-----------------------------------------| |-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo | | doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central | | pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central | | pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive | | arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository | | handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/data-provision/deduplication)). OpenAIRE also perform duplicate identification (see the [dedicated section for details](/data-provision/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record). All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -0,0 +1,31 @@
# UniProtKB/Swiss-Prot
This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads).
From this dataset, only the protein records linked to a PubMed publication are extracted.
## Entity Mapping
The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format.
You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt)
| OpenAIRE Result field path | FASTA record field xpath | Notes |
|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` |
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN` | main title |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` |
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier |

View File

@ -128,6 +128,11 @@
"type": "doc", "type": "doc",
"id": "data-provision/aggregation/non-compatible-sources/ebi", "id": "data-provision/aggregation/non-compatible-sources/ebi",
"label": "EMBL-EBI" "label": "EMBL-EBI"
},
{
"type": "doc",
"id": "data-provision/aggregation/non-compatible-sources/uniprot",
"label": "UniProtKB/Swiss-Prot"
} }
] ]
} }