integrating Scholexplorer Bio Entity Datasource documentation (PR#48)

This commit is contained in:
Claudio Atzori 2023-03-09 15:00:45 +01:00
parent 49fc73d09d
commit fb5e6cd814
8 changed files with 73 additions and 23 deletions

View File

@ -60,3 +60,6 @@ When tagging a new version, the document versioning mechanism will:
* Copy the full `docs/` folder contents into a new `versioned_docs/version-<versionName>/` folder.
* Create a versioned sidebars file based from your current sidebar configuration, saved as `versioned_sidebars/version-<versionName>-sidebars.json`.
* Append the new version number to `versions.json`.
Therefore, when previewing the compiled site locally with `npm run start`, ensure to visualise the `Next` version on the browser as it shows the changes under `/docs`.
To change a version that was already versioned, the source files to be modified are in the `versioned_docs/version-<versionName>/` folder.

View File

@ -26,6 +26,9 @@ _Start Date: 2023-02-13 &bull; Release Date: 2023-03-01 &bull; Dump release: **n
- Revised SDG classification: improved coverage (+600K classified DOIs)
- General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications
- Integrated contents from
- [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi)
- [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot)
#### Changed

View File

@ -1,32 +1,31 @@
# UniProtKB/Swiss-Prot
this section describes the mapping implemented for [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The whole dump can be downloaded by [here](https://www.uniprot.org/help/downloadss) the Reviewed (Swiss-Prot).
From this Dump we extract only the protein linked to a pubmed Publication.
This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads).
From this dataset, only the protein records linked to a PubMed publication are extracted.
## Entity Mapping
The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format.
You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt)
| OpenAIRE Result field path | FASTA record field xpath| Notes|
|--------------------------------|----------------------|---------|
| **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)`|
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN`|main title |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/`|
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
| OpenAIRE Result field path | FASTA record field xpath | Notes |
|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` |
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN` | main title |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` |
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|---------------------|--------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | we create relationships between the BioEntity and the pubmed or DOI generating an unresolved target identifier |
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier |

View File

@ -88,8 +88,7 @@ const sidebars = {
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/pubmed' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/datacite' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/ebi', label: 'EMBL-EBI' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' },
{ type: 'doc', id: 'data-provision/aggregation/non-compatible-sources/uniprot', label: 'UniProtKB/Swiss-Prot' }
]
}
]

View File

@ -26,6 +26,9 @@ _Start Date: 2023-02-13 &bull; Release Date: 2023-03-01 &bull; Dump release: **n
- Revised SDG classification: improved coverage (+600K classified DOIs)
- General increase of the funded scientific outputs, thanks to the full text mining scanning new OpenAccess publications
- Integrated contents from
- [EMBL-EBIs Protein Data Bank in Europe](/data-provision/aggregation/non-compatible-sources/ebi)
- [UniProtKB/Swiss-Prot](/data-provision/aggregation/non-compatible-sources/uniprot)
#### Changed

View File

@ -18,6 +18,10 @@ Such a policy defines a list of data sources that are considered authoritative f
| doi | [Crossref](https://www.crossref.org), [Datacite](https://datacite.org) |
| pmc, pmid | [Europe PubMed Central](https://europepmc.org/), [PubMed Central](https://www.ncbi.nlm.nih.gov/pmc) |
| arXiv | [arXiv.org e-Print Archive](https://arxiv.org/) |
| uniprot | [Protein Data Bank](http://www.pdb.org/) |
| ena | [Protein Data Bank](http://www.pdb.org/) |
| pdb | [Protein Data Bank](http://www.pdb.org/) |
There is an exception though: Handle(s) are minted by several repositories; as listing them all would not be a viable option, to avoid losing them as PIDs, Handles bypass the PID authority filtering rule.
In all other cases, PIDs are be included in the graph as alternate Identifiers.
@ -63,12 +67,15 @@ When the record is collected from a source which is not authoritative for any ty
Currently, the following data sources are used as "PID authorities":
| PID Type | Prefix (12 chars) | Authority |
|-----------|------------------------|-----------------------------------------|
|-----------|------------------------|-------------------------------------------|
| doi | `doi_________` | Crossref, Datacite, Zenodo |
| pmc | `pmc_________` | Europe PubMed Central, PubMed Central |
| pmid | `pmid________` | Europe PubMed Central, PubMed Central |
| arXiv | `arXiv_______` | arXiv.org e-Print Archive |
| handle | `handle______` | any repository |
| ena | `ena_________` | EMBL-EBI |
| pdb | `pdb_________` | EMBL-EBI |
| uniprot | `uniprot_____` | EMBL-EBI |
OpenAIRE also perform duplicate identification (see the [dedicated section for details](/data-provision/deduplication)).
All duplicates are **merged** together in a **representative record** which must be assigned a dedicated OpenAIRE identifier (i.e. it cannot have the identifier of one of the aggregated record).

View File

@ -0,0 +1,31 @@
# UniProtKB/Swiss-Prot
This section describes the mapping implemented to integrate metadata and links from [UniProtKB/Swiss-Prot](https://www.uniprot.org/).
The complete data dump "Reviewed (Swiss-Prot)" can be downloaded from [here](https://www.uniprot.org/help/downloads).
From this dataset, only the protein records linked to a PubMed publication are extracted.
## Entity Mapping
The table below describes the mapping from the TEXT metadata format to the OpenAIRE Graph dump format.
You can check an example of the text metadata [here](https://rest.uniprot.org/uniprotkb/A0A0C5B5G6.txt)
| OpenAIRE Result field path | FASTA record field xpath | Notes |
|------------------------------|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------|
| **BIOEntity Mapping** | | |
| `id` | `LINE Starts with AC` | id in the form `uniprot_____::md5(id)` |
| `pid` | `LINE Starts with AC` | example `AC A0A0C5B5G6;` classid=classname=`uniprot` the vaue is the text after `AC` |
| `publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be `YYYY-mm-dd` |
| `maintitle` | `LINE START WITH GN` | main title |
| **Instance Mapping** | | |
| `instance.type` | | `Bioentity` |
| `type` | | `Dataset` |
| `instance.pid` | `LINE Starts with AC` | `classid = classname = uniprot` |
| `instance.url` | `pid` | prepend to the value `https://www.uniprot.org/uniprot/` |
| `instance.publicationdate` | `LINE START WITH DT containg text integrated into UniProtKB/Swiss-Prot` | clean and normalize the format of the date to be YYYY-mm-dd |
### Relation Mapping
| OpenAIRE Relation Semantic and inverse | Source/Target type | Notes |
|----------------------------------------|----------------------|--------------------------------------------------------------------------------------------------------------------------|
| `IsRelatedTo` | `LINE START WITH RX` | the mapping creates relationships between the BioEntity and the PubMed or DOI generating an unresolved target identifier |

View File

@ -128,6 +128,11 @@
"type": "doc",
"id": "data-provision/aggregation/non-compatible-sources/ebi",
"label": "EMBL-EBI"
},
{
"type": "doc",
"id": "data-provision/aggregation/non-compatible-sources/uniprot",
"label": "UniProtKB/Swiss-Prot"
}
]
}