added documentation

This commit is contained in:
Sandro La Bruzzo 2023-09-28 11:02:47 +02:00
parent 1f20d0ee07
commit fa1e7ce81e
2 changed files with 99 additions and 2 deletions

View File

@ -1,3 +1,100 @@
# bioentities-preprocess
This repository contains scripts for preprocessing Bioentities datasource like protein data bank, UniProt
This repository contains scripts for preprocessing Bioentities datasource like protein data bank, UniProt
# How it works
# Protein data Bank
The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and other biomolecules. The PDB is a valuable resource for scientists studying the structure and function of these molecules.
This script downloads and pre-process all proteins from the PDB FTP site. The script consists of three phases:
1. *Downloading the files*: The script first downloads all of the files from the PDB FTP site. These files are compressed in ZIP format.
2. *Extract Metadata* and related publication from the zip files
3. *Validate the results*: Checks if the results containing the minimun set of information
### Phase 1: Downloading the files
The script download all the files from [this url](https://ftp.ebi.ac.uk/pub/databases/pdb/data/structures/divided/pdb/)recursively get all the molecules on this folder
### Phase 2: Extract Metadata
The PDB format is a plain text format that is divided into three sections:
- **Header**: This section contains information about the structure, such as the name of the molecule, the experimental method used to determine the structure, and the resolution of the structure. The scripts will get all the information from this section
- **Coordinate data**: This section contains the coordinates of all of the atoms in the structure.
- **Additional information**: This section contains additional information about the structure, such as the sequence of the molecule and the list of ligands bound to the molecule.
**Header section**
The script extract metadata from the header section of the PDB file that is divided into several subsections:
- **TITLE**: This subsection contains the name of the molecule.
- **REMARK**: This subsection contains additional information about the structure, such as the experimental method used to determine the structure and the resolution of the structure.
- **SOURCE**: This subsection contains information about the source of the molecule.
- **KEYWDS**: This subsection contains keywords that can be used to search for the structure.
- **AUTHORS**: This subsection contains the names of the authors who published the structure.
- **JRNL**: Literature citation that defines the coordinate set.
The generated record will contain the following fields:
- **classification**: the typology of the molecule
- **pdb**: the Id of the molecule
- **deposition_date**: the deposition date
- **title**: name of the molecule
- **keywords**: keywords that can be used to search for the structure
- **authors**: the names of the authors who published the structure
- **PMID**: the pubmed Identifier of the Literature citation that defines the coordinate set
- **DOI**: the DOI of the Literature citation that defines the coordinate set
### Phase 3: Validation
The script checks if the generated metadata contain the minimum set of information, such as deposition date, authors, and relation to the article.
The script does this by iterating through the list of metadata in FASTA and checking for the following fields:
- **deposition_date**: The date the protein was deposited.
- **authors**: The names of the authors who deposited the protein.
- **relation**_to_article: The relationship between the protein and the article.
If any of these fields are missing, the script will output an error message.
# UNIPROT
This script downloads all proteins from the UNIPROT FTP site. The script then converts the proteins into a new data model that contains the following information:
- **PID**: Accession number of the protein
- **title**: The name of the protein
- **dates**: The relevant dates for the protein, such as the deposition or update date
- **organism**_species: The species from which the protein was extracted
- **references**: The references to the publication that cites the protein
## Preprocessing
The preprocessing phase of the script converts the protein files from FASTA to JSON format. The script also cleans the data by removing any unnecessary information.
Data model
The data model for the protein data is as follows:
```
class Protein:
def __init__(self, pid, title, dates, organism_species, references):
self.pid = pid
self.title = title
self.dates = dates
self.organism_species = organism_species
self.references = references
```
The pid field is the accession number of the protein. The title field is the name of the protein. The dates field is a list of dates for the protein, such as the deposition or update date. The organism_species field is the species from which the protein was extracted. The references field is a list of references to the publication that cites the protein.
# Requirements
The scripts requires python 3 and a list of library that you can install using:
``` pip install -r requirements.txt```
# Run scripts
you can run the scripts that will download and preprocess Protein Data Bank and Uniprot using the following command:
```python main.py```
the script will generate two folder:
- **pdb_metadata**: preprocessed molecule from Protein Data Bank
- **uniprot_metadata**: preprocessed protein from UNIPROT

View File

@ -49,7 +49,7 @@ class MetadataExctractor:
return(json.dumps(p))
def extract_metadata(self, input_path="download", output_path="metadata"):
def extract_metadata(self, input_path="download", output_path="pdb_metadata"):
if (os.path.exists(output_path) and os.path.isdir(output_path)):
shutil.rmtree(output_path)
os.mkdir(output_path)