Getting started: PyCoM Remote (online)

Working with PyCom remotely is encouraged only with smaller datasets as it does not support loading of the biological features and it is slow in comparison.

  1. Differences from local setup

  2. Initialize the PyCom class

  3. Query the database

  4. Supported query keywords

  5. Pagination

  6. Load coevolution matrices

Differences from local setup

There are slight differences in the API when using PyCom remotely.

  • Querying the PyCoMdb returns a paginated dataframe with max 100 entries per page, or 10 if loading matrices.

  • The pyc.paginate and pyc.load_matrices methods are not available

    • pyc.find(..., page=1, per_page=100) is used for pagination

    • pyc.find(..., matrix=True) is used for loading matrices

  • The helper methods for loading additional biological data into the dataframe (pyc.data.*) are not yet available.

Initialize the PyCom class

Import the PyCom class and initialize it with remote=True:

[1]:
from pycom import PyCom, ProteinParams
import pandas as pd

pyc = PyCom(remote=True)

Query the database

Query the database by passing a dictionary of conditions:

[2]:
entries = pyc.find({
    ProteinParams.DISEASE: 'parkinson',  # string search, case-insensitive
}, page=1)

entries.head()
[2]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 O43464 8.095 458 MAAPRAGRGAGWSLRAWRALGGIRWGRRPRLTPDLRALLTSGTSDP... 9606 0.122271 0.034934 0.286026 1 1 1
1 O60260 9.579 465 MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFA... 9606 0.174194 0.073118 0.273118 1 1 1
2 O75787 8.590 350 MAVFVVLLALVAGVLGNEFSILKSPGSVVFRNGNWPIPGERIPDVA... 9606 0.085714 0.011429 0.000000 1 1 0
3 P09936 7.605 223 MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPA... 9606 0.390135 0.053812 0.242152 1 1 1
4 P31930 10.682 480 MAASVVCRAATAGAQVLLRARRSPALLRTPALRSTATFAQALQFVP... 9606 0.410417 0.020833 0.122917 0 1 0
[3]:
# tells us total number of entries in the search results
entries.attrs
[3]:
{'page': 1, 'total_pages': 1, 'total_results': 9}

Alternatively, query the database by passing keyword arguments:

[4]:
entries = pyc.find(
    cofactor='FAD',  # string search, case-insensitive
    has_ptm=True,
    has_disease=True,
    page=1
)

entries
[4]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 P11310 9.930 421 MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... 9606 0.517815 0.016627 0.180523 1 1 1
1 Q658P3 9.677 488 MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... 9606 0.157787 0.000000 0.086066 1 1 0
2 Q16795 10.997 377 MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... 9606 0.363395 0.037135 0.124668 1 1 0
3 O95299 9.244 355 MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... 9606 0.000000 0.000000 0.000000 1 1 0
4 P13804 8.627 333 MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... 9606 0.300300 0.027027 0.333333 1 1 0
[5]:
entries.attrs
[5]:
{'page': 1, 'total_pages': 1, 'total_results': 5}

Supported query keywords

  • uniprot_id: The UniProt ID of the protein.

  • sequence: The amino acid sequence of protein to search for. (full match)

  • min_length / max_length: Min/Max number of residues in the protein.

  • min_helix / max_helix: Min/Max percentage of helical structure in the protein.

  • min_turn / max_turn: Min/Max percentage of turn structure in the protein.

  • min_strand / max_strand: Min/Max percentage of beta strand structure in the protein.

  • organism: Taxonomic name of the genus / species of the protein. (case-insensitive)

    • Species name or any parent taxonomic level can be used. (pyc.get_organism_list() for full list)

    • Surround with : to get precise results

      • :homo: returns Homo sapiens & Homo sapiens neanderthalensis)

      • homo also returns homoeomma, thomomys, and hundreds others

  • organism_id: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use organism instead)

  • cath: CATH classification of the protein (3.40.50.360 or 3.40.*.* or 3.*).

  • enzyme: Enzyme Commission number of the protein. (1.3.1.3 or 1.3.*.* or 1.*).

  • has_substrate: Whether the protein has a known substrate. (True/False)

  • has_ptm: Whether the protein has a known post-translational modification. (True/False)

  • has_pdb: Whether the protein has a known PDB structure. (True/False)

  • disease: The disease associated with the protein. (name of disease, case-insensitive, e.g cancer)

    • Use pyc.get_disease_list() for full list.

    • cancer searches for Ovarian cancer, Lung cancer, …

  • disease_id: The ID of the disease associated with the protein. (DI-02205, get_disease_list()

  • has_disease: Whether the protein is associated with a disease. (True/False)

  • cofactor: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g Zn(2+)])

  • cofactor_id: The ID of the cofactor associated with the protein. (CHEBI:00001, get_cofactor_list())

  • biological_process: Biological process associated with the protein. (e.g antiviral defense, use pyc.get_biological_process_list() for full list)

  • cellular_component: Cellular component associated with the protein. (e.g nucleus, use pyc.get_cellular_component_list() for full list

  • domain: Domain associated with the protein. (e.g zinc-finger, use pyc.get_domain_list() for full list)

  • ligand: Ligand associated with the protein. (e.g zinc, use pyc.get_ligand_list() for full list

  • molecular_function: Molecular function associated with the protein. (e.g antioxidant activity, use pyc.get_molecular_function_list() for full list

  • ptm: Post-translational modification associated with the protein. (e.g phosphoprotein, use pyc.get_ptm_list() for full list

Pagination

Remote PyCom automatically paginates results. The default page size is 10 entries, but can be changed with pyc.find(..., per_page=100). The maximum page size is 100 entries, or 10 if loading matrices.

When loading more entries than the page size, just set the page parameter to the page number:

[6]:
page1 = pyc.find(max_length=20, page=1, per_page=100)  # get first 100 entries with length <= 20
page2 = pyc.find(max_length=20, page=2, per_page=100)  # get entries 101-200 with length <= 20

# pages can be concatenated

pages = pd.concat([page1, page2], ignore_index=True)

print(f'Page 1: {len(page1)} entries, Page 2: {len(page2)} entries, Total: {len(pages)} entries')
Page 1: 100 entries, Page 2: 100 entries, Total: 200 entries

Load coevolution matrices

Coevolution matrices can be loaded by setting the matrix param: pyc.find(..., matrix=True).

This loads them into the matrix column of the dataframe.

[7]:
results = pyc.find(max_length=20, page=1, matrix=True)

results.iloc[0].matrix  # show the coevolution matrix for the first entry
[7]:
array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
        0.00000000e+00],
       [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
        4.54485416e-07],
       [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
        2.98023224e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
        2.23517418e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
        0.00000000e+00]])

By default, the matrices are loaded as a numpy.ndarray. Different formats can be specified.

Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:

[9]:
from pycom import MatrixFormat

resultsNumpy = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.NUMPY)
resultsPandas = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.LIST)

print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')
Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>