This repository contains the procedures for preprocessing Bioentity records from Protein Data Bank, UniProt https://graph.openaire.eu/docs
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Claudio Atzori 733042a3d1 Update 'README.md'
Minors
7 months ago
pdb added documentation 7 months ago
uniprot imported uniprot validation and parsing scripts 7 months ago
.gitignore imported uniprot preprocessing scripts 7 months ago
LICENSE Initial commit 7 months ago
README.md Update 'README.md' 7 months ago
main.py defined main class 7 months ago
requirements.txt imported pdb preprocessing scripts 7 months ago

README.md

bioentities-preprocess

This repository contains the procedures for preprocessing Bioentity records from Protein Data Bank and UniProt. The preprocessing is aimed at preparing the data for their integration in the OpenAIRE Graph, more information about it is available at https://graph.openaire.eu/docs.

How it works

Protein data Bank

The Protein Data Bank (PDB) is a repository of experimentally determined three-dimensional structures of proteins, nucleic acids, and other biomolecules. The PDB is a valuable resource for scientists studying the structure and function of these molecules.

This script downloads and pre-process all proteins from the PDB FTP site. The script consists of three phases:

  1. Downloading the files: The script first downloads all of the files from the PDB FTP site. These files are compressed in ZIP format.
  2. Extract Metadata and related publication from the zip files
  3. Validate the results: Checks if the results containing the minimun set of information

Phase 1: Downloading the files

The script download all the files from this urlrecursively get all the molecules on this folder

Phase 2: Extract Metadata

The PDB format is a plain text format that is divided into three sections:

  • Header: This section contains information about the structure, such as the name of the molecule, the experimental method used to determine the structure, and the resolution of the structure. The scripts will get all the information from this section
  • Coordinate data: This section contains the coordinates of all of the atoms in the structure.
  • Additional information: This section contains additional information about the structure, such as the sequence of the molecule and the list of ligands bound to the molecule.

Header section

The script extract metadata from the header section of the PDB file that is divided into several subsections:

  • TITLE: This subsection contains the name of the molecule.
  • REMARK: This subsection contains additional information about the structure, such as the experimental method used to determine the structure and the resolution of the structure.
  • SOURCE: This subsection contains information about the source of the molecule.
  • KEYWDS: This subsection contains keywords that can be used to search for the structure.
  • AUTHORS: This subsection contains the names of the authors who published the structure.
  • JRNL: Literature citation that defines the coordinate set.

The generated record will contain the following fields:

  • classification: the typology of the molecule
  • pdb: the Id of the molecule
  • deposition_date: the deposition date
  • title: name of the molecule
  • keywords: keywords that can be used to search for the structure
  • authors: the names of the authors who published the structure
  • PMID: the pubmed Identifier of the Literature citation that defines the coordinate set
  • DOI: the DOI of the Literature citation that defines the coordinate set

Phase 3: Validation

The script checks if the generated metadata contain the minimum set of information, such as deposition date, authors, and relation to the article.

The script does this by iterating through the list of metadata in FASTA and checking for the following fields:

  • deposition_date: The date the protein was deposited.
  • authors: The names of the authors who deposited the protein.
  • relation_to_article: The relationship between the protein and the article.

If any of these fields are missing, the script will output an error message.

UNIPROT

This script downloads all proteins from the UNIPROT FTP site. The script then converts the proteins into a new data model that contains the following information:

  • PID: Accession number of the protein
  • title: The name of the protein
  • dates: The relevant dates for the protein, such as the deposition or update date
  • organism_species: The species from which the protein was extracted
  • references: The references to the publication that cites the protein

Preprocessing

The preprocessing phase of the script converts the protein files from FASTA to JSON format. The script also cleans the data by removing any unnecessary information.

Data model

The data model for the protein data is as follows:

class Protein:
    def __init__(self, pid, title, dates, organism_species, references):
        self.pid = pid
        self.title = title
        self.dates = dates
        self.organism_species = organism_species
        self.references = references

The pid field is the accession number of the protein. The title field is the name of the protein. The dates field is a list of dates for the protein, such as the deposition or update date. The organism_species field is the species from which the protein was extracted. The references field is a list of references to the publication that cites the protein.

Requirements

The scripts requires python 3 and a list of library that you can install using:

pip install -r requirements.txt

Run scripts

you can run the scripts that will download and preprocess Protein Data Bank and Uniprot using the following command: python main.py

the script will generate two folder:

  • pdb_metadata: preprocessed molecule from Protein Data Bank
  • uniprot_metadata: preprocessed protein from UNIPROT