Getting started: PyCoM Local
PyCoM, the Python interface for PyCoMDB can both run locally and remotely. - Run locally to run large-scale analyses. Requires 115GB of disk space for the database. - Run remotely to run small-scale analyses. No disk space required. Follow this tutorial 00_Getting_Started_Remotely.ipynb
.
This is a crash course on how to use the local variant of the Python interface for the PyCoM database. 1. Installation 2. Initialise Pycom object 3. Supported query keywords 4. Paginate the results 5. Load coevolution matrices 6. Adding biological data to dataframe
More indepth tutorials are available here: https://pycom.brunel.ac.uk/tutorials.html
Installation
Install the PyCom package:
pip3 install git+https://github.com/scdantu/pycom
Note: Requires Python 3.8 or higher
Download the pycom.db
and pycom.mat
files from https://pycom.brunel.ac.uk/downloads
Initialise PyCom object
Import the required classes and create a PyCom object:
[1]:
from pycom import PyCom, ProteinParams
pyc = PyCom(db_path='~/docs/pycom.db', mat_path='~/docs/pycom.mat')
Query the database
Query the database by passing a dictionary of keywords:
[2]:
entries = pyc.find({
ProteinParams.ENZYME: '3.*.*.*',
ProteinParams.DISEASE: 'cancer', # string search, case-insensitive
})
entries
[2]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | matrix | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P01111 | 12.817 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.349206 | 0.015873 | 0.227513 | 1 | 1 | 1 | None |
1 | P01112 | 12.841 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.317460 | 0.031746 | 0.359788 | 1 | 1 | 1 | None |
2 | P01116 | 12.626 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.375661 | 0.031746 | 0.328042 | 1 | 1 | 1 | None |
3 | P62070 | 12.754 | 204 | MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT... | 9606 | 0.299020 | 0.019608 | 0.220588 | 1 | 1 | 1 | None |
4 | Q9UNW1 | 9.554 | 487 | MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY... | 9606 | 0.000000 | 0.000000 | 0.000000 | 0 | 0 | 1 | None |
Alternatively, query the database by passing keyword arguments:
[3]:
entries = pyc.find(
cofactor='FAD', # string search, case-insensitive
has_ptm=True,
has_disease=True,
)
entries
[3]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | matrix | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P11310 | 9.930 | 421 | MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... | 9606 | 0.517815 | 0.016627 | 0.180523 | 1 | 1 | 1 | None |
1 | Q658P3 | 9.677 | 488 | MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... | 9606 | 0.157787 | 0.000000 | 0.086066 | 1 | 1 | 0 | None |
2 | Q16795 | 10.997 | 377 | MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... | 9606 | 0.363395 | 0.037135 | 0.124668 | 1 | 1 | 0 | None |
3 | O95299 | 9.244 | 355 | MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... | 9606 | 0.000000 | 0.000000 | 0.000000 | 1 | 1 | 0 | None |
4 | P13804 | 8.627 | 333 | MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... | 9606 | 0.300300 | 0.027027 | 0.333333 | 1 | 1 | 0 | None |
Supported query keywords
uniprot_id
: The UniProt ID of the protein.sequence
: The amino acid sequence of protein to search for. (full match)min_length
/max_length
: Min/Max number of residues in the protein.min_helix
/max_helix
: Min/Max percentage of helical structure in the protein.min_turn
/max_turn
: Min/Max percentage of turn structure in the protein.min_strand
/max_strand
: Min/Max percentage of beta strand structure in the protein.organism
: Taxonomic name of the genus / species of the protein. (case-insensitive)Species name or any parent taxonomic level can be used. (
pyc.get_organism_list()
for full list)Surround with
:
to get precise results:homo:
returnsHomo sapiens
&Homo sapiens neanderthalensis
)homo
also returns homoeomma, thomomys, and hundreds others
organism_id
: Precise NCBI Taxonomy ID of the species of the protein. (prefer to useorganism
instead)cath
: CATH classification of the protein (3.40.50.360
or3.40.*.*
or3.*
).enzyme
: Enzyme Commission number of the protein. (1.3.1.3
or1.3.*.*
or1.*
).has_substrate
: Whether the protein has a known substrate. (True
/False
)has_ptm
: Whether the protein has a known post-translational modification. (True
/False
)has_pbd
: Whether the protein has a known PDB structure. (True
/False
)disease
: The disease associated with the protein. (name of disease, case-insensitive, e.gcancer
)Use
pyc.get_disease_list()
for full list.cancer
searches forOvarian cancer
,Lung cancer
, …
disease_id
: The ID of the disease associated with the protein. (DI-02205
, get_disease_list()has_disease
: Whether the protein is associated with a disease. (True
/False
)cofactor
: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.gZn(2+)
])cofactor_id
: The ID of the cofactor associated with the protein. (CHEBI:00001
, get_cofactor_list())biological_process
: Biological process associated with the protein. (e.gantiviral defense
, usepyc.get_biological_process_list()
for full list)cellular_component
: Cellular component associated with the protein. (e.gnucleus
, usepyc.get_cellular_component_list()
for full listdomain
: Domain associated with the protein. (e.gzinc-finger
, usepyc.get_domain_list()
for full list)ligand
: Ligand associated with the protein. (e.gzinc
, usepyc.get_ligand_list()
for full listmolecular_function
: Molecular function associated with the protein. (e.gantioxidant activity
, usepyc.get_molecular_function_list()
for full listptm
: Post-translational modification associated with the protein. (e.gphosphoprotein
, usepyc.get_ptm_list()
for full list
Paginate the results
Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.
Here is an example of making a large query, then paginating the results:
[4]:
entries = pyc.find(max_length=20)
print(f'Found {len(entries)} entries with length <= 20')
page = pyc.paginate(entries, page=1, per_page=100) # get first n entries (default 100)
print(f'Found {len(page)} entries on page 1')
Found 2958 entries with length <= 20
Found 100 entries on page 1
Load coevolution matrices
Now the coevolution matrices can be loaded for the paginated results.
This loads them into the matrix
column of the dataframe.
[5]:
pyc.load_matrices(page)
page.iloc[0].matrix # show the coevolution matrix for the first entry
[5]:
array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
0.00000000e+00],
[2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
4.54485416e-07],
[1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
2.98023224e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
2.23517418e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
0.00000000e+00]])
By default, the matrices are loaded as a numpy.ndarray
. Different formats can be specified.
Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:
[2]:
from pycom import MatrixFormat
resultsNumpy = pyc.load_matrices(page, mat_format=MatrixFormat.NUMPY) # default
resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)
print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')
Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>
Adding biological data to dataframe
This is supported in the local variant only!
PyCom contains a lot of additional protein annotation info. This is not loaded by default, but can be added it needed.
The list of cofactors, diseases, and organisms can loaded by calling:
[7]:
cofactors = pyc.get_cofactor_list()
diseases = pyc.get_disease_list()
organisms = pyc.get_organism_list()
cofactors
[7]:
cofactorId | cofactorName | |
---|---|---|
0 | CHEBI:597326 | pyridoxal 5'-phosphate |
1 | CHEBI:18420 | Mg(2+) |
2 | CHEBI:60240 | a divalent metal cation |
3 | CHEBI:30413 | heme |
4 | CHEBI:29105 | Zn(2+) |
... | ... | ... |
109 | CHEBI:61721 | chlorophyll b |
110 | CHEBI:73095 | divinyl chlorophyll a |
111 | CHEBI:73096 | divinyl chlorophyll b |
112 | CHEBI:57453 | (6S)-5,6,7,8-tetrahydrofolate |
113 | CHEBI:30402 | tungstopterin |
114 rows × 2 columns
[30]:
loader = pyc.get_data_loader()
entries = pyc.find(uniprot_id='P15291')
# Add the protein's cofactors to the dataframe
entries = loader.add_cofactors(entries)
# The following functions are supported, data taken directly from UniProt
entries = loader.add_biological_processes(entries)
entries = loader.add_cath_class(entries) # Protein's CATH
entries = loader.add_coding_sequence_diversity(entries) # https://www.uniprot.org/help/keywords
entries = loader.add_cofactors(entries) # Cofactors
entries = loader.add_developmental_stage(entries)
entries = loader.add_diseases(entries) # The diseases associated with the protein
entries = loader.add_enzyme_commission(entries) # Protein's EC
entries = loader.add_ligand(entries) # Ligands
entries = loader.add_molecular_function(entries)
entries = loader.add_organism_name(entries)
entries = loader.add_organism_taxonomy(entries)
entries = loader.add_pdbs(entries) # Experimental PDB IDs of protein
entries = loader.add_protein_cellular_component(entries)
entries = loader.add_protein_domain(entries)
entries = loader.add_ptm(entries) # Protein's Post-translational modifications
entries = loader.add_substrates(entries) # Protein's substrates
entries.iloc[0]
[30]:
uniprot_id P15291
neff 7.854
sequence_length 398
sequence MRLREPLLSGSAAMPGASLQRACRLLVAVCALHLGVTLVYYLAGRD...
organism_id 9606
helix_frac 0.198492
turn_frac 0.030151
strand_frac 0.163317
has_ptm 1
has_pdb 1
has_substrate 1
matrix None
cofactor_x [Mn(2+)]
biological_process [Lipid metabolism]
cath_class 3.90.550.10
coding_sequence_diversity [Alternative initiation]
cofactor_y [Mn(2+)]
developmental_stage NaN
disease_name [Congenital disorder of glycosylation 2D]
disease_id [DI-00349]
enzyme_commission 2.4.1.-
ligand [Manganese, Metal-binding]
molecular_function [Glycosyltransferase, Transferase]
organism_name Homo sapiens
taxonomy [Eukaryota, Metazoa, Chordata, Craniata, Verte...
pdb_id [2AE7, 2AEC, 2AES, 2AGD, 2AH9, 2FY7, 2FYA, 2FY...
cellular_component [Cell membrane, Cell projection, Golgi apparat...
domain [Signal-anchor, Transmembrane, Transmembrane h...
ptm NaN
substrate [D-glucose + UDP-alpha-D-galactose = H(+) + la...
Name: 0, dtype: object