added documentation

2023-09-28 11:02:47 +02:00 · 2023-09-28 11:02:47 +02:00 · fa1e7ce81e
parent 1f20d0ee07
commit fa1e7ce81e
2 changed files with 99 additions and 2 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,100 @@
 # bioentities-preprocess

 This repository contains scripts for preprocessing Bioentities datasource like protein data bank, UniProt
+
+# How it works
+# Protein data Bank
+The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and other biomolecules. The PDB is a valuable resource for scientists studying the structure and function of these molecules.
+
+This script downloads and pre-process all proteins from the PDB FTP site. The script consists of three phases:
+
+1. *Downloading the files*: The script first downloads all of the files from the PDB FTP site. These files are compressed in ZIP format.
+2. *Extract Metadata* and related publication from the zip files
+3. *Validate the results*: Checks if the results containing the minimun set of information
+
+### Phase 1: Downloading the files
+The script download all the files from [this url](https://ftp.ebi.ac.uk/pub/databases/pdb/data/structures/divided/pdb/)recursively get all the molecules on this folder
+
+### Phase 2: Extract Metadata
+The PDB format is a plain text format that is divided into three sections:
+
+- **Header**: This section contains information about the structure, such as the name of the molecule, the experimental method used to determine the structure, and the resolution of the structure. The scripts will get all the information from this section
+- **Coordinate data**: This section contains the coordinates of all of the atoms in the structure.
+- **Additional information**: This section contains additional information about the structure, such as the sequence of the molecule and the list of ligands bound to the molecule.
+
+
+**Header section**
+
+The script extract metadata from the header section of the PDB file that is divided into several subsections:
+
+- **TITLE**: This subsection contains the name of the molecule.
+- **REMARK**: This subsection contains additional information about the structure, such as the experimental method used to determine the structure and the resolution of the structure.
+- **SOURCE**: This subsection contains information about the source of the molecule.
+- **KEYWDS**: This subsection contains keywords that can be used to search for the structure.
+- **AUTHORS**: This subsection contains the names of the authors who published the structure.
+- **JRNL**:   Literature citation that defines the coordinate set.
+
+The generated record will contain the following fields:
+
+- **classification**: the typology of the molecule
+- **pdb**: the Id of the molecule
+- **deposition_date**: the deposition date
+- **title**: name of the molecule
+- **keywords**: keywords that can be used to search for the structure
+- **authors**: the names of the authors who published the structure
+- **PMID**: the pubmed Identifier of the Literature citation that defines the coordinate set
+- **DOI**: the DOI of the Literature citation that defines the coordinate set
+
+### Phase 3: Validation
+The script checks if the generated metadata contain the minimum set of information, such as deposition date, authors, and relation to the article.
+
+The script does this by iterating through the list of metadata in FASTA and checking for the following fields:
+
+- **deposition_date**: The date the protein was deposited.
+- **authors**: The names of the authors who deposited the protein.
+- **relation**_to_article: The relationship between the protein and the article.
+
+If any of these fields are missing, the script will output an error message.
+
+# UNIPROT
+This script downloads all proteins from the UNIPROT FTP site. The script then converts the proteins into a new data model that contains the following information:
+
+- **PID**: Accession number of the protein
+- **title**: The name of the protein
+- **dates**: The relevant dates for the protein, such as the deposition or update date
+- **organism**_species: The species from which the protein was extracted
+- **references**: The references to the publication that cites the protein
+
+## Preprocessing
+
+The preprocessing phase of the script converts the protein files from FASTA to JSON format. The script also cleans the data by removing any unnecessary information.
+
+Data model
+
+The data model for the protein data is as follows:
+```
+class Protein:
+    def __init__(self, pid, title, dates, organism_species, references):
+        self.pid = pid
+        self.title = title
+        self.dates = dates
+        self.organism_species = organism_species
+        self.references = references
+```
+The pid field is the accession number of the protein. The title field is the name of the protein. The dates field is a list of dates for the protein, such as the deposition or update date. The organism_species field is the species from which the protein was extracted. The references field is a list of references to the publication that cites the protein.
+
+
+# Requirements
+The scripts requires python 3 and a list of library that you can install using:
+
+``` pip install -r requirements.txt```
+
+# Run scripts
+
+you can run the scripts that will  download and  preprocess Protein Data Bank and Uniprot using the following command:
+```python main.py```
+
+the script will generate two folder:
+- **pdb_metadata**: preprocessed molecule from Protein Data Bank
+- **uniprot_metadata**: preprocessed protein from UNIPROT
+
--- a/pdb/pdb_metadata_extractor.py
+++ b/pdb/pdb_metadata_extractor.py
@ -49,7 +49,7 @@ class MetadataExctractor:
            return(json.dumps(p))
        

-    def extract_metadata(self, input_path="download", output_path="metadata"):
+    def extract_metadata(self, input_path="download", output_path="pdb_metadata"):
        if (os.path.exists(output_path) and os.path.isdir(output_path)):
            shutil.rmtree(output_path)
        os.mkdir(output_path)