Getting started: PyCoM Remote (online)
Working with PyCom remotely is encouraged only with smaller datasets as it does not support loading of the biological features and it is slow in comparison.
Differences from local setup
There are slight differences in the API when using PyCom remotely.
Querying the PyCoMdb returns a paginated dataframe with max 100 entries per page, or 10 if loading matrices.
The
pyc.paginate
andpyc.load_matrices
methods are not availablepyc.find(..., page=1, per_page=100)
is used for paginationpyc.find(..., matrix=True)
is used for loading matrices
The helper methods for loading additional biological data into the dataframe (
pyc.data.*
) are not yet available.
Initialize the PyCom class
Import the PyCom class and initialize it with remote=True
:
[1]:
from pycom import PyCom, ProteinParams
import pandas as pd
pyc = PyCom(remote=True)
Query the database
Query the database by passing a dictionary of conditions:
[2]:
entries = pyc.find({
ProteinParams.DISEASE: 'parkinson', # string search, case-insensitive
}, page=1)
entries.head()
[2]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | O43464 | 8.095 | 458 | MAAPRAGRGAGWSLRAWRALGGIRWGRRPRLTPDLRALLTSGTSDP... | 9606 | 0.122271 | 0.034934 | 0.286026 | 1 | 1 | 1 |
1 | O60260 | 9.579 | 465 | MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFA... | 9606 | 0.174194 | 0.073118 | 0.273118 | 1 | 1 | 1 |
2 | O75787 | 8.590 | 350 | MAVFVVLLALVAGVLGNEFSILKSPGSVVFRNGNWPIPGERIPDVA... | 9606 | 0.085714 | 0.011429 | 0.000000 | 1 | 1 | 0 |
3 | P09936 | 7.605 | 223 | MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPA... | 9606 | 0.390135 | 0.053812 | 0.242152 | 1 | 1 | 1 |
4 | P31930 | 10.682 | 480 | MAASVVCRAATAGAQVLLRARRSPALLRTPALRSTATFAQALQFVP... | 9606 | 0.410417 | 0.020833 | 0.122917 | 0 | 1 | 0 |
[3]:
# tells us total number of entries in the search results
entries.attrs
[3]:
{'page': 1, 'total_pages': 1, 'total_results': 9}
Alternatively, query the database by passing keyword arguments:
[4]:
entries = pyc.find(
cofactor='FAD', # string search, case-insensitive
has_ptm=True,
has_disease=True,
page=1
)
entries
[4]:
uniprot_id | neff | sequence_length | sequence | organism_id | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | P11310 | 9.930 | 421 | MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... | 9606 | 0.517815 | 0.016627 | 0.180523 | 1 | 1 | 1 |
1 | Q658P3 | 9.677 | 488 | MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... | 9606 | 0.157787 | 0.000000 | 0.086066 | 1 | 1 | 0 |
2 | Q16795 | 10.997 | 377 | MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... | 9606 | 0.363395 | 0.037135 | 0.124668 | 1 | 1 | 0 |
3 | O95299 | 9.244 | 355 | MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... | 9606 | 0.000000 | 0.000000 | 0.000000 | 1 | 1 | 0 |
4 | P13804 | 8.627 | 333 | MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... | 9606 | 0.300300 | 0.027027 | 0.333333 | 1 | 1 | 0 |
[5]:
entries.attrs
[5]:
{'page': 1, 'total_pages': 1, 'total_results': 5}
Supported query keywords
uniprot_id
: The UniProt ID of the protein.sequence
: The amino acid sequence of protein to search for. (full match)min_length
/max_length
: Min/Max number of residues in the protein.min_helix
/max_helix
: Min/Max percentage of helical structure in the protein.min_turn
/max_turn
: Min/Max percentage of turn structure in the protein.min_strand
/max_strand
: Min/Max percentage of beta strand structure in the protein.organism
: Taxonomic name of the genus / species of the protein. (case-insensitive)Species name or any parent taxonomic level can be used. (
pyc.get_organism_list()
for full list)Surround with
:
to get precise results:homo:
returnsHomo sapiens
&Homo sapiens neanderthalensis
)homo
also returns homoeomma, thomomys, and hundreds others
organism_id
: Precise NCBI Taxonomy ID of the species of the protein. (prefer to useorganism
instead)cath
: CATH classification of the protein (3.40.50.360
or3.40.*.*
or3.*
).enzyme
: Enzyme Commission number of the protein. (1.3.1.3
or1.3.*.*
or1.*
).has_substrate
: Whether the protein has a known substrate. (True
/False
)has_ptm
: Whether the protein has a known post-translational modification. (True
/False
)has_pdb
: Whether the protein has a known PDB structure. (True
/False
)disease
: The disease associated with the protein. (name of disease, case-insensitive, e.gcancer
)Use
pyc.get_disease_list()
for full list.cancer
searches forOvarian cancer
,Lung cancer
, …
disease_id
: The ID of the disease associated with the protein. (DI-02205
, get_disease_list()has_disease
: Whether the protein is associated with a disease. (True
/False
)cofactor
: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.gZn(2+)
])cofactor_id
: The ID of the cofactor associated with the protein. (CHEBI:00001
, get_cofactor_list())biological_process
: Biological process associated with the protein. (e.gantiviral defense
, usepyc.get_biological_process_list()
for full list)cellular_component
: Cellular component associated with the protein. (e.gnucleus
, usepyc.get_cellular_component_list()
for full listdomain
: Domain associated with the protein. (e.gzinc-finger
, usepyc.get_domain_list()
for full list)ligand
: Ligand associated with the protein. (e.gzinc
, usepyc.get_ligand_list()
for full listmolecular_function
: Molecular function associated with the protein. (e.gantioxidant activity
, usepyc.get_molecular_function_list()
for full listptm
: Post-translational modification associated with the protein. (e.gphosphoprotein
, usepyc.get_ptm_list()
for full list
Pagination
Remote PyCom automatically paginates results. The default page size is 10 entries, but can be changed with pyc.find(..., per_page=100)
. The maximum page size is 100 entries, or 10 if loading matrices.
When loading more entries than the page size, just set the page
parameter to the page number:
[6]:
page1 = pyc.find(max_length=20, page=1, per_page=100) # get first 100 entries with length <= 20
page2 = pyc.find(max_length=20, page=2, per_page=100) # get entries 101-200 with length <= 20
# pages can be concatenated
pages = pd.concat([page1, page2], ignore_index=True)
print(f'Page 1: {len(page1)} entries, Page 2: {len(page2)} entries, Total: {len(pages)} entries')
Page 1: 100 entries, Page 2: 100 entries, Total: 200 entries
Load coevolution matrices
Coevolution matrices can be loaded by setting the matrix
param: pyc.find(..., matrix=True)
.
This loads them into the matrix
column of the dataframe.
[7]:
results = pyc.find(max_length=20, page=1, matrix=True)
results.iloc[0].matrix # show the coevolution matrix for the first entry
[7]:
array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
0.00000000e+00],
[2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
4.54485416e-07],
[1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
2.98023224e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
2.23517418e-07],
[0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
0.00000000e+00]])
By default, the matrices are loaded as a numpy.ndarray
. Different formats can be specified.
Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:
[9]:
from pycom import MatrixFormat
resultsNumpy = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.NUMPY)
resultsPandas = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.LIST)
print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')
Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>