Getting started: PyCoM Remote (online)

Working with PyCom remotely is encouraged only with smaller datasets as it does not support loading of the biological features and it is slow in comparison.

Differences from local setup
Initialize the PyCom class
Query the database
Supported query keywords
Pagination
Load coevolution matrices

Differences from local setup

There are slight differences in the API when using PyCom remotely.

Querying the PyCoMdb returns a paginated dataframe with max 100 entries per page, or 10 if loading matrices.
The pyc.paginate and pyc.load_matrices methods are not available
- pyc.find(..., page=1, per_page=100) is used for pagination
- pyc.find(..., matrix=True) is used for loading matrices
The helper methods for loading additional biological data into the dataframe (pyc.data.*) are not yet available.

Initialize the PyCom class

Import the PyCom class and initialize it with remote=True:

[1]:

from pycom import PyCom, ProteinParams
import pandas as pd

pyc = PyCom(remote=True)

Query the database

Query the database by passing a dictionary of conditions:

[2]:

entries = pyc.find({
    ProteinParams.DISEASE: 'parkinson',  # string search, case-insensitive
}, page=1)

entries.head()

[2]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate
0	O43464	8.095	458	MAAPRAGRGAGWSLRAWRALGGIRWGRRPRLTPDLRALLTSGTSDP...	9606	0.122271	0.034934	0.286026	1	1	1
1	O60260	9.579	465	MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFA...	9606	0.174194	0.073118	0.273118	1	1	1
2	O75787	8.590	350	MAVFVVLLALVAGVLGNEFSILKSPGSVVFRNGNWPIPGERIPDVA...	9606	0.085714	0.011429	0.000000	1	1	0
3	P09936	7.605	223	MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPA...	9606	0.390135	0.053812	0.242152	1	1	1
4	P31930	10.682	480	MAASVVCRAATAGAQVLLRARRSPALLRTPALRSTATFAQALQFVP...	9606	0.410417	0.020833	0.122917	0	1	0

[3]:

# tells us total number of entries in the search results
entries.attrs

[3]:

{'page': 1, 'total_pages': 1, 'total_results': 9}

Alternatively, query the database by passing keyword arguments:

[4]:

entries = pyc.find(
    cofactor='FAD',  # string search, case-insensitive
    has_ptm=True,
    has_disease=True,
    page=1
)

entries

[4]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate
0	P11310	9.930	421	MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK...	9606	0.517815	0.016627	0.180523	1	1	1
1	Q658P3	9.677	488	MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR...	9606	0.157787	0.000000	0.086066	1	1	0
2	Q16795	10.997	377	MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG...	9606	0.363395	0.037135	0.124668	1	1	0
3	O95299	9.244	355	MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG...	9606	0.000000	0.000000	0.000000	1	1	0
4	P13804	8.627	333	MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR...	9606	0.300300	0.027027	0.333333	1	1	0

[5]:

entries.attrs

[5]:

{'page': 1, 'total_pages': 1, 'total_results': 5}

Supported query keywords

uniprot_id: The UniProt ID of the protein.
sequence: The amino acid sequence of protein to search for. (full match)
min_length / max_length: Min/Max number of residues in the protein.
min_helix / max_helix: Min/Max percentage of helical structure in the protein.
min_turn / max_turn: Min/Max percentage of turn structure in the protein.
min_strand / max_strand: Min/Max percentage of beta strand structure in the protein.
organism: Taxonomic name of the genus / species of the protein. (case-insensitive)
- Species name or any parent taxonomic level can be used. (pyc.get_organism_list() for full list)
- Surround with : to get precise results
  - :homo: returns Homo sapiens & Homo sapiens neanderthalensis)
  - homo also returns homoeomma, thomomys, and hundreds others
organism_id: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use organism instead)
cath: CATH classification of the protein (3.40.50.360 or 3.40.*.* or 3.*).
enzyme: Enzyme Commission number of the protein. (1.3.1.3 or 1.3.*.* or 1.*).
has_substrate: Whether the protein has a known substrate. (True/False)
has_ptm: Whether the protein has a known post-translational modification. (True/False)
has_pdb: Whether the protein has a known PDB structure. (True/False)
disease: The disease associated with the protein. (name of disease, case-insensitive, e.g cancer)
- Use pyc.get_disease_list() for full list.
- cancer searches for Ovarian cancer, Lung cancer, …
disease_id: The ID of the disease associated with the protein. (DI-02205, get_disease_list()
has_disease: Whether the protein is associated with a disease. (True/False)
cofactor: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g Zn(2+)])
cofactor_id: The ID of the cofactor associated with the protein. (CHEBI:00001, get_cofactor_list())
biological_process: Biological process associated with the protein. (e.g antiviral defense, use pyc.get_biological_process_list() for full list)
cellular_component: Cellular component associated with the protein. (e.g nucleus, use pyc.get_cellular_component_list() for full list
domain: Domain associated with the protein. (e.g zinc-finger, use pyc.get_domain_list() for full list)
ligand: Ligand associated with the protein. (e.g zinc, use pyc.get_ligand_list() for full list
molecular_function: Molecular function associated with the protein. (e.g antioxidant activity, use pyc.get_molecular_function_list() for full list
ptm: Post-translational modification associated with the protein. (e.g phosphoprotein, use pyc.get_ptm_list() for full list

Load coevolution matrices

Coevolution matrices can be loaded by setting the matrix param: pyc.find(..., matrix=True).

This loads them into the matrix column of the dataframe.

[7]:

results = pyc.find(max_length=20, page=1, matrix=True)

results.iloc[0].matrix  # show the coevolution matrix for the first entry

[7]:

array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
        0.00000000e+00],
       [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
        4.54485416e-07],
       [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
        2.98023224e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
        2.23517418e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
        0.00000000e+00]])

By default, the matrices are loaded as a numpy.ndarray. Different formats can be specified.

Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:

[9]:

from pycom import MatrixFormat

resultsNumpy = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.NUMPY)
resultsPandas = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.LIST)

print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')

Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>