Getting started: PyCoM Local
PyCoM, the Python interface for PyCoMDB can both run locally and remotely.
Run locally to run large-scale analyses. Requires 115GB of disk space for the database.
Run remotely to run small-scale analyses. No disk space required. Follow this tutorial
00_Getting_Started_Remotely.ipynb
.
This is a crash course on how to use the local variant of the Python interface for the PyCoM database.
More indepth tutorials are available here: https://pycom.brunel.ac.uk/tutorials.html
Installation
Install the PyCom package:
pip3 install git+https://github.com/scdantu/pycom
Note: Requires Python 3.8 or higher
Download the pycom.db
and pycom.mat
files from https://pycom.brunel.ac.uk/downloads
[ ]:
!mkdir -p ~/docs
# note: downloads are 700MB and 114GB (!) respectively.
# if too much: consider running remote version (other tutorial!)
!wget -P ~/docs https://pycom.brunel.ac.uk/downloads/pycom.db
!wget -P ~/docs https://pycom.brunel.ac.uk/downloads/pycom.mat
Initialise PyCom object
Import the required classes and create a PyCom object:
[1]:
from pycom import PyCom, ProteinParams
pyc = PyCom(db_path='~/docs/pycom.db', mat_path='~/docs/pycom.mat')
Query the database
Query the database by passing a dictionary of keywords:
[2]:
entries = pyc.find({
ProteinParams.ENZYME: '3.*.*.*',
ProteinParams.DISEASE: 'cancer', # string search, case-insensitive
})
entries
[2]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | matrix | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P01111 | 12.817 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.349206 | 0.015873 | 0.227513 | 1 | 1 | 1 | None |
1 | P01112 | 12.841 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.317460 | 0.031746 | 0.359788 | 1 | 1 | 1 | None |
2 | P01116 | 12.626 | 189 | MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... | 9606 | 0.375661 | 0.031746 | 0.328042 | 1 | 1 | 1 | None |
3 | P62070 | 12.754 | 204 | MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT... | 9606 | 0.299020 | 0.019608 | 0.220588 | 1 | 1 | 1 | None |
4 | Q9UNW1 | 9.554 | 487 | MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY... | 9606 | 0.000000 | 0.000000 | 0.000000 | 0 | 0 | 1 | None |
Alternatively, query the database by passing keyword arguments:
[3]:
entries = pyc.find(
cofactor='FAD', # string search, case-insensitive
has_ptm=True,
has_disease=True,
)
entries
[3]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | matrix | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | P11310 | 9.930 | 421 | MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... | 9606 | 0.517815 | 0.016627 | 0.180523 | 1 | 1 | 1 | None |
1 | Q658P3 | 9.677 | 488 | MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... | 9606 | 0.157787 | 0.000000 | 0.086066 | 1 | 1 | 0 | None |
2 | Q16795 | 10.997 | 377 | MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... | 9606 | 0.363395 | 0.037135 | 0.124668 | 1 | 1 | 0 | None |
3 | O95299 | 9.244 | 355 | MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... | 9606 | 0.000000 | 0.000000 | 0.000000 | 1 | 1 | 0 | None |
4 | P13804 | 8.627 | 333 | MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... | 9606 | 0.300300 | 0.027027 | 0.333333 | 1 | 1 | 0 | None |
Supported query keywords
uniprot_id
: The UniProt ID of the protein.sequence
: The amino acid sequence of protein to search for. (full match)min_length
/max_length
: Min/Max number of residues in the protein.min_helix
/max_helix
: Min/Max percentage of helical structure in the protein.min_turn
/max_turn
: Min/Max percentage of turn structure in the protein.min_strand
/max_strand
: Min/Max percentage of beta strand structure in the protein.organism
: Taxonomic name of the genus / species of the protein. (case-insensitive)Species name or any parent taxonomic level can be used. (
pyc.get_organism_list()
for full list)Surround with
:
to get precise results:homo:
returnsHomo sapiens
&Homo sapiens neanderthalensis
)homo
also returns homoeomma, thomomys, and hundreds others
organism_id
: Precise NCBI Taxonomy ID of the species of the protein. (prefer to useorganism
instead)cath
: CATH classification of the protein (3.40.50.360
or3.40.*.*
or3.*
).enzyme
: Enzyme Commission number of the protein. (1.3.1.3
or1.3.*.*
or1.*
).has_substrate
: Whether the protein has a known substrate. (True
/False
)has_ptm
: Whether the protein has a known post-translational modification. (True
/False
)has_pbd
: Whether the protein has a known PDB structure. (True
/False
)disease
: The disease associated with the protein. (name of disease, case-insensitive, e.gcancer
)Use
pyc.get_disease_list()
for full list.cancer
searches forOvarian cancer
,Lung cancer
, …
disease_id
: The ID of the disease associated with the protein. (DI-02205
, get_disease_list()has_disease
: Whether the protein is associated with a disease. (True
/False
)cofactor
: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.gZn(2+)
])cofactor_id
: The ID of the cofactor associated with the protein. (CHEBI:00001
, get_cofactor_list())biological_process
: Biological process associated with the protein. (e.gantiviral defense
, usepyc.get_biological_process_list()
for full list)cellular_component
: Cellular component associated with the protein. (e.gnucleus
, usepyc.get_cellular_component_list()
for full listdomain
: Domain associated with the protein. (e.gzinc-finger
, usepyc.get_domain_list()
for full list)ligand
: Ligand associated with the protein. (e.gzinc
, usepyc.get_ligand_list()
for full listmolecular_function
: Molecular function associated with the protein. (e.gantioxidant activity
, usepyc.get_molecular_function_list()
for full listptm
: Post-translational modification associated with the protein. (e.gphosphoprotein
, usepyc.get_ptm_list()
for full list
Paginate the results
Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.
Here is an example of making a large query, then paginating the results:
[4]:
entries = pyc.find(max_length=20)
print(f'Found {len(entries)} entries with length <= 20')
page = pyc.paginate(entries, page=1, per_page=100) # get first n entries (default 100)
print(f'Found {len(page)} entries on page 1')
Found 2958 entries with length <= 20
Found 100 entries on page 1
Load coevolution matrices
Now the coevolution matrices can be loaded for the paginated results.
This loads them into the matrix
column of the dataframe.
[5]:
pyc.load_matrices(page)
page.iloc[0].matrix # show the coevolution matrix for the first entry
[5]:
array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
0.00000000e+00],
[2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
4.54485416e-07],
[1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
2.98023224e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
2.23517418e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
0.00000000e+00]])
By default, the matrices are loaded as a numpy.ndarray
. Different formats can be specified.
Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:
[2]:
from pycom import MatrixFormat
resultsNumpy = pyc.load_matrices(page, mat_format=MatrixFormat.NUMPY) # default
resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)
print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')
Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>
Adding biological data to dataframe
This is supported in the local variant only!
PyCom contains a lot of additional protein annotation info. This is not loaded by default, but can be added it needed.
The list of cofactors, diseases, and organisms can loaded by calling:
[7]:
cofactors = pyc.get_cofactor_list()
diseases = pyc.get_disease_list()
organisms = pyc.get_organism_list()
cofactors
[7]:
cofactorId | cofactorName | |
---|---|---|
0 | CHEBI:597326 | pyridoxal 5'-phosphate |
1 | CHEBI:18420 | Mg(2+) |
2 | CHEBI:60240 | a divalent metal cation |
3 | CHEBI:30413 | heme |
4 | CHEBI:29105 | Zn(2+) |
... | ... | ... |
109 | CHEBI:61721 | chlorophyll b |
110 | CHEBI:73095 | divinyl chlorophyll a |
111 | CHEBI:73096 | divinyl chlorophyll b |
112 | CHEBI:57453 | (6S)-5,6,7,8-tetrahydrofolate |
113 | CHEBI:30402 | tungstopterin |
114 rows × 2 columns
[30]:
loader = pyc.get_data_loader()
entries = pyc.find(uniprot_id='P15291')
# Add the protein's cofactors to the dataframe
entries = loader.add_cofactors(entries)
# The following functions are supported, data taken directly from UniProt
entries = loader.add_biological_processes(entries)
entries = loader.add_cath_class(entries) # Protein's CATH
entries = loader.add_coding_sequence_diversity(entries) # https://www.uniprot.org/help/keywords
entries = loader.add_cofactors(entries) # Cofactors
entries = loader.add_developmental_stage(entries)
entries = loader.add_diseases(entries) # The diseases associated with the protein
entries = loader.add_enzyme_commission(entries) # Protein's EC
entries = loader.add_ligand(entries) # Ligands
entries = loader.add_molecular_function(entries)
entries = loader.add_organism_name(entries)
entries = loader.add_organism_taxonomy(entries)
entries = loader.add_pdbs(entries) # Experimental PDB IDs of protein
entries = loader.add_protein_cellular_component(entries)
entries = loader.add_protein_domain(entries)
entries = loader.add_ptm(entries) # Protein's Post-translational modifications
entries = loader.add_substrates(entries) # Protein's substrates
entries.iloc[0]
[30]:
uniprot_id P15291
neff 7.854
sequence_length 398
sequence MRLREPLLSGSAAMPGASLQRACRLLVAVCALHLGVTLVYYLAGRD...
organism_id 9606
helix_frac 0.198492
turn_frac 0.030151
strand_frac 0.163317
has_ptm 1
has_pdb 1
has_substrate 1
matrix None
cofactor_x [Mn(2+)]
biological_process [Lipid metabolism]
cath_class 3.90.550.10
coding_sequence_diversity [Alternative initiation]
cofactor_y [Mn(2+)]
developmental_stage NaN
disease_name [Congenital disorder of glycosylation 2D]
disease_id [DI-00349]
enzyme_commission 2.4.1.-
ligand [Manganese, Metal-binding]
molecular_function [Glycosyltransferase, Transferase]
organism_name Homo sapiens
taxonomy [Eukaryota, Metazoa, Chordata, Craniata, Verte...
pdb_id [2AE7, 2AEC, 2AES, 2AGD, 2AH9, 2FY7, 2FYA, 2FY...
cellular_component [Cell membrane, Cell projection, Golgi apparat...
domain [Signal-anchor, Transmembrane, Transmembrane h...
ptm NaN
substrate [D-glucose + UDP-alpha-D-galactose = H(+) + la...
Name: 0, dtype: object