PyCom

Warning

Documentation sites are Work in progress

Quick Summary

pycom.interface.PyCom.find([...])

Find proteins in the database that match the given criteria.

pycom.interface.PyCom.load_matrices(df[, ...])

Only for PyComLocal: Load the coevolution matrices into memory

pycom.interface.PyCom.paginate(df, page[, ...])

Only for PyComLocal: Paginate a DataFrame that is generated by PyCom.find().

pycom.interface.PyCom.get_data_loader()

Returns the PyComDataLoader object that is used to load additional data into the dataframe.

pycom.interface.PyCom.get_biological_process_list()

Retrieves the list of all biological processes in the database.

pycom.interface.PyCom.get_cellular_component_list()

Retrieves the list of all cellular components in the database.

pycom.interface.PyCom.get_cofactor_list()

Retrieves the list of all cofactors in the database.

pycom.interface.PyCom.get_disease_list()

Retrieves the list of all diseases in the database.

pycom.interface.PyCom.get_developmental_stage_list()

Retrieves the list of all developmental stages in the database.

pycom.interface.PyCom.get_domain_list()

Retrieves the list of all domains in the database.

pycom.interface.PyCom.get_ligand_list()

Retrieves the list of all ligands in the database.

pycom.interface.PyCom.get_molecular_function_list()

Retrieves the list of all molecular functions in the database.

pycom.interface.PyCom.get_organism_list()

Retrieves the list of all organisms in the database.

pycom.interface.PyCom.get_ptm_list()

Retrieves the list of all post-translational modifications in the database.

Documentation

class pycom.interface.PyCom(db_path: str | None = None, mat_path: str | None = None, remote: bool = False)[source]
abstract find(constraint_dict: dict | None = None, /, *, uniprot_id: str | None = None, sequence: str | None = None, min_length: int | None = None, max_length: int | None = None, min_helix: float | None = None, max_helix: float | None = None, min_turn: float | None = None, max_turn: float | None = None, min_strand: float | None = None, max_strand: float | None = None, organism_id: str | None = None, organism: str | None = None, cath: str | None = None, enzyme: str | None = None, has_substrate: bool | None = None, has_ptm: bool | None = None, has_pdb: bool | None = None, disease: str | None = None, disease_id: str | None = None, has_disease: bool | None = None, cofactor: str | None = None, cofactor_id: str | None = None, biological_process: str | None = None, cellular_component: str | None = None, developmental_stage: str | None = None, domain: str | None = None, ligand: str | None = None, molecular_function: str | None = None, ptm: str | None = None, page: int | None = None, per_page: int | None = None, matrix: bool | None = None, mat_format: MatrixFormat | None = None) DataFrame[source]

Find proteins in the database that match the given criteria.

This function searches the database for proteins that match the given criteria. The criteria can be specified using any combination of the parameters listed below.

Use either constraint_dict or the individual parameters, not both.

Usage:
>>> from pycom import PyCom, ProteinParams
>>> pyc = PyCom(db_path='/path/on/disk/pycom.db')
load all proteins associated with cancer:
>>> pyc = pyc.find(disease='cancer')
or (equivalent):
>>> pyc = pyc.find({ProteinParams.DISEASE: 'cancer'})
Parameters:
  • constraint_dict – A dictionary of constraints to apply to the search {ProteinParams: value}.

  • uniprot_id – The UniProt ID of the protein.

  • sequence – The amino acid sequence of protein to search for. (full match)

  • min_length – Minimum number of residues.

  • max_length – Maximum number of residues.

  • min_helix – Min percentage of helical structure in the protein.

  • max_helix – Max percentage of helical structure in the protein.

  • min_turn – Min percentage of turn structure in the protein.

  • max_turn – Max percentage of turn structure in the protein.

  • min_strand – Min percentage of beta strand structure in the protein.

  • max_strand – Max percentage of beta strand structure in the protein.

  • organism_id – NCBI Taxonomy ID of the genus / species of the protein. (get_organism_list())

  • organism – Taxonomic name of the genus / species of the protein. (case-insensitive, get_organism_list())

  • cath – CATH classification of the protein ( ‘3.40.50.360’ or ‘3.40.*.*’ or ‘3.*’ ).

  • enzyme – Enzyme Commission number of the protein. ( ‘3.40.50.360’ or ‘3.40.*.*’ or ‘3.*’ ).

  • has_substrate – Whether the protein has a known substrate. (True/False)

  • has_ptm – Whether the protein has a known post-translational modification. (True/False)

  • has_pdb – Whether the protein has a known PDB structure. (True/False)

  • disease – The disease associated with the protein. (name of disease, case-insensitive [e.g ‘cancer’])

  • disease_id – The ID of the disease associated with the protein. (‘DI-00001’, get_disease_list()

  • has_disease – Whether the protein is associated with a disease. (True/False)

  • cofactor – The cofactor associated with the protein. (name of cofactor, case-insensitive [e.g ‘Zn(2+)’])

  • cofactor_id – The ID of the cofactor associated with the protein. (‘CHEBI:00001’, get_cofactor_list())

  • biological_process – The biological process associated with the protein. (name of process, case-insensitive, get_biological_process_list())

  • cellular_component – The cellular component associated with the protein. (name of component, case-insensitive, get_cellular_component_list())

  • developmental_stage – The developmental stage associated with the protein. (name of stage, case-insensitive, get_developmental_stage_list())

  • domain – The domain associated with the protein. (name of domain, case-insensitive, get_domain_list())

  • ligand – The ligand associated with the protein. (name of ligand, case-insensitive, get_ligand_list())

  • molecular_function – The molecular function associated with the protein. (name of function, case-insensitive, get_molecular_function_list())

  • ptm – The post-translational modification associated with the protein. (name of ptm, case-insensitive, get_ptm_list())

(specific to PyComRemote) :param page: The page number of results to return. (1-i) :param per_page: The number of results per page. (1-100) :param matrix: Whether to return the coevolution matrix with the results. :param mat_format: The format of the coevolution matrix. (MatrixFormat.NUMPY or MatrixFormat.PANDAS)

Returns:

A pandas DataFrame containing the proteins that match the given criteria.

raise NotImplementedError(‘Implementation at bottom of file’)

abstract get_biological_process_list()[source]

Retrieves the list of all biological processes in the database.

abstract get_cellular_component_list()[source]

Retrieves the list of all cellular components in the database.

abstract get_cofactor_list() DataFrame[source]

Retrieves the list of all cofactors in the database.

abstract get_data_loader() PyComDataLoader[source]

Returns the PyComDataLoader object that is used to load additional data into the dataframe.

Not implemented in PyComRemote.

Returns:

PyComDataLoader

abstract get_developmental_stage_list()[source]

Retrieves the list of all developmental stages in the database.

abstract get_disease_list() DataFrame[source]

Retrieves the list of all diseases in the database.

abstract get_domain_list()[source]

Retrieves the list of all domains in the database.

abstract get_ligand_list()[source]

Retrieves the list of all ligands in the database.

abstract get_molecular_function_list()[source]

Retrieves the list of all molecular functions in the database.

abstract get_organism_list() DataFrame[source]

Retrieves the list of all organisms in the database.

abstract get_ptm_list()[source]

Retrieves the list of all post-translational modifications in the database.

abstract load_matrices(df: ~pandas.core.frame.DataFrame, max_load: int = 1000, mat_format: ~pycom.selector.matrix_format.MatrixFormat = <function MatrixFormat.<lambda>>) DataFrame[source]

Only for PyComLocal: Load the coevolution matrices into memory

Takes a DataFrame from PyCom.find() or PyCom.paginate() and loads the coevolution matrices into memory, into the ‘matrix’ column.

Requires the coevolution matrix file (pycom.mat) to be downloaded from https://pycom.brunel.ac.uk/downloads/

By default, this function will only load the first 1000 matrices. This can be changed by setting max_load.

abstract static paginate(df: DataFrame, page: int, per_page: int = 100) DataFrame[source]

Only for PyComLocal: Paginate a DataFrame that is generated by PyCom.find(). This is useful for using PyCom.load_matrices() on a large DataFrame.

First page is 1. By default, 100 results are returned per page; this can be changed by setting per_page.

Parameters:
  • df – The DataFrame to paginate

  • page – The page number to return

  • per_page – The number of results to return per page