This repository contains the procedures for preprocessing Bioentity records from Protein Data Bank, UniProt https://graph.openaire.eu/docs

Go to file

Claudio Atzori 733042a3d1 Update 'README.md' Minors		2023-09-28 11:34:35 +02:00
pdb	added documentation	2023-09-28 11:02:47 +02:00
uniprot	imported uniprot validation and parsing scripts	2023-09-22 14:22:20 +02:00
.gitignore	imported uniprot preprocessing scripts	2023-09-21 15:08:48 +02:00
LICENSE	Initial commit	2023-09-20 12:39:57 +02:00
README.md	Update 'README.md'	2023-09-28 11:34:35 +02:00
main.py	defined main class	2023-09-22 14:24:03 +02:00
requirements.txt	imported pdb preprocessing scripts	2023-09-20 15:16:12 +02:00

README.md

bioentities-preprocess

This repository contains the procedures for preprocessing Bioentity records from Protein Data Bank and UniProt. The preprocessing is aimed at preparing the data for their integration in the OpenAIRE Graph, more information about it is available at https://graph.openaire.eu/docs.

How it works

Protein data Bank

The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and other biomolecules. The PDB is a valuable resource for scientists studying the structure and function of these molecules.

This script downloads and pre-process all proteins from the PDB FTP site. The script consists of three phases:

Downloading the files: The script first downloads all of the files from the PDB FTP site. These files are compressed in ZIP format.
Extract Metadata and related publication from the zip files
Validate the results: Checks if the results containing the minimun set of information

Phase 1: Downloading the files

The script download all the files from this urlrecursively get all the molecules on this folder

Phase 2: Extract Metadata

The PDB format is a plain text format that is divided into three sections:

Header: This section contains information about the structure, such as the name of the molecule, the experimental method used to determine the structure, and the resolution of the structure. The scripts will get all the information from this section
Coordinate data: This section contains the coordinates of all of the atoms in the structure.
Additional information: This section contains additional information about the structure, such as the sequence of the molecule and the list of ligands bound to the molecule.

Header section

The script extract metadata from the header section of the PDB file that is divided into several subsections:

TITLE: This subsection contains the name of the molecule.
REMARK: This subsection contains additional information about the structure, such as the experimental method used to determine the structure and the resolution of the structure.
SOURCE: This subsection contains information about the source of the molecule.
KEYWDS: This subsection contains keywords that can be used to search for the structure.
AUTHORS: This subsection contains the names of the authors who published the structure.
JRNL: Literature citation that defines the coordinate set.

The generated record will contain the following fields:

classification: the typology of the molecule
pdb: the Id of the molecule
deposition_date: the deposition date
title: name of the molecule
keywords: keywords that can be used to search for the structure
authors: the names of the authors who published the structure
PMID: the pubmed Identifier of the Literature citation that defines the coordinate set
DOI: the DOI of the Literature citation that defines the coordinate set

Phase 3: Validation

The script checks if the generated metadata contain the minimum set of information, such as deposition date, authors, and relation to the article.

The script does this by iterating through the list of metadata in FASTA and checking for the following fields:

deposition_date: The date the protein was deposited.
authors: The names of the authors who deposited the protein.
relation_to_article: The relationship between the protein and the article.

If any of these fields are missing, the script will output an error message.

UNIPROT

This script downloads all proteins from the UNIPROT FTP site. The script then converts the proteins into a new data model that contains the following information:

PID: Accession number of the protein
title: The name of the protein
dates: The relevant dates for the protein, such as the deposition or update date
organism_species: The species from which the protein was extracted
references: The references to the publication that cites the protein

Preprocessing

The preprocessing phase of the script converts the protein files from FASTA to JSON format. The script also cleans the data by removing any unnecessary information.

Data model

The data model for the protein data is as follows:

class Protein:
    def __init__(self, pid, title, dates, organism_species, references):
        self.pid = pid
        self.title = title
        self.dates = dates
        self.organism_species = organism_species
        self.references = references

The pid field is the accession number of the protein. The title field is the name of the protein. The dates field is a list of dates for the protein, such as the deposition or update date. The organism_species field is the species from which the protein was extracted. The references field is a list of references to the publication that cites the protein.

Requirements

The scripts requires python 3 and a list of library that you can install using:

pip install -r requirements.txt

Run scripts

you can run the scripts that will download and preprocess Protein Data Bank and Uniprot using the following command: python main.py

the script will generate two folder:

pdb_metadata: preprocessed molecule from Protein Data Bank
uniprot_metadata: preprocessed protein from UNIPROT