Getting started: PyCoM Local

PyCoM, the Python interface for PyCoMDB can both run locally and remotely. - Run locally to run large-scale analyses. Requires 115GB of disk space for the database. - Run remotely to run small-scale analyses. No disk space required. Follow this tutorial 00_Getting_Started_Remotely.ipynb.

This is a crash course on how to use the local variant of the Python interface for the PyCoM database. 1. Installation 2. Initialise Pycom object 3. Supported query keywords 4. Paginate the results 5. Load coevolution matrices 6. Adding biological data to dataframe

More indepth tutorials are available here: https://pycom.brunel.ac.uk/tutorials.html

Installation

Install the PyCom package:

pip3 install git+https://github.com/scdantu/pycom

Note: Requires Python 3.8 or higher

Download the pycom.db and pycom.mat files from https://pycom.brunel.ac.uk/downloads

Initialise PyCom object

Import the required classes and create a PyCom object:

[1]:
from pycom import PyCom, ProteinParams

pyc = PyCom(db_path='~/docs/pycom.db', mat_path='~/docs/pycom.mat')

Query the database

Query the database by passing a dictionary of keywords:

[2]:
entries = pyc.find({
    ProteinParams.ENZYME: '3.*.*.*',
    ProteinParams.DISEASE: 'cancer',  # string search, case-insensitive
})

entries
[2]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate matrix
0 P01111 12.817 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 0.015873 0.227513 1 1 1 None
1 P01112 12.841 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 0.031746 0.359788 1 1 1 None
2 P01116 12.626 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 0.031746 0.328042 1 1 1 None
3 P62070 12.754 204 MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT... 9606 0.299020 0.019608 0.220588 1 1 1 None
4 Q9UNW1 9.554 487 MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY... 9606 0.000000 0.000000 0.000000 0 0 1 None

Alternatively, query the database by passing keyword arguments:

[3]:
entries = pyc.find(
    cofactor='FAD',  # string search, case-insensitive
    has_ptm=True,
    has_disease=True,
)

entries
[3]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate matrix
0 P11310 9.930 421 MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... 9606 0.517815 0.016627 0.180523 1 1 1 None
1 Q658P3 9.677 488 MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... 9606 0.157787 0.000000 0.086066 1 1 0 None
2 Q16795 10.997 377 MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... 9606 0.363395 0.037135 0.124668 1 1 0 None
3 O95299 9.244 355 MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... 9606 0.000000 0.000000 0.000000 1 1 0 None
4 P13804 8.627 333 MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... 9606 0.300300 0.027027 0.333333 1 1 0 None

Supported query keywords

  • uniprot_id: The UniProt ID of the protein.

  • sequence: The amino acid sequence of protein to search for. (full match)

  • min_length / max_length: Min/Max number of residues in the protein.

  • min_helix / max_helix: Min/Max percentage of helical structure in the protein.

  • min_turn / max_turn: Min/Max percentage of turn structure in the protein.

  • min_strand / max_strand: Min/Max percentage of beta strand structure in the protein.

  • organism: Taxonomic name of the genus / species of the protein. (case-insensitive)

    • Species name or any parent taxonomic level can be used. (pyc.get_organism_list() for full list)

    • Surround with : to get precise results

      • :homo: returns Homo sapiens & Homo sapiens neanderthalensis)

      • homo also returns homoeomma, thomomys, and hundreds others

  • organism_id: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use organism instead)

  • cath: CATH classification of the protein (3.40.50.360 or 3.40.*.* or 3.*).

  • enzyme: Enzyme Commission number of the protein. (1.3.1.3 or 1.3.*.* or 1.*).

  • has_substrate: Whether the protein has a known substrate. (True/False)

  • has_ptm: Whether the protein has a known post-translational modification. (True/False)

  • has_pbd: Whether the protein has a known PDB structure. (True/False)

  • disease: The disease associated with the protein. (name of disease, case-insensitive, e.g cancer)

    • Use pyc.get_disease_list() for full list.

    • cancer searches for Ovarian cancer, Lung cancer, …

  • disease_id: The ID of the disease associated with the protein. (DI-02205, get_disease_list()

  • has_disease: Whether the protein is associated with a disease. (True/False)

  • cofactor: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g Zn(2+)])

  • cofactor_id: The ID of the cofactor associated with the protein. (CHEBI:00001, get_cofactor_list())

  • biological_process: Biological process associated with the protein. (e.g antiviral defense, use pyc.get_biological_process_list() for full list)

  • cellular_component: Cellular component associated with the protein. (e.g nucleus, use pyc.get_cellular_component_list() for full list

  • domain: Domain associated with the protein. (e.g zinc-finger, use pyc.get_domain_list() for full list)

  • ligand: Ligand associated with the protein. (e.g zinc, use pyc.get_ligand_list() for full list

  • molecular_function: Molecular function associated with the protein. (e.g antioxidant activity, use pyc.get_molecular_function_list() for full list

  • ptm: Post-translational modification associated with the protein. (e.g phosphoprotein, use pyc.get_ptm_list() for full list

Paginate the results

Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.

Here is an example of making a large query, then paginating the results:

[4]:
entries = pyc.find(max_length=20)
print(f'Found {len(entries)} entries with length <= 20')

page = pyc.paginate(entries, page=1, per_page=100)  # get first n entries (default 100)
print(f'Found {len(page)} entries on page 1')
Found 2958 entries with length <= 20
Found 100 entries on page 1

Load coevolution matrices

Now the coevolution matrices can be loaded for the paginated results.

This loads them into the matrix column of the dataframe.

[5]:
pyc.load_matrices(page)

page.iloc[0].matrix  # show the coevolution matrix for the first entry
[5]:
array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
        0.00000000e+00],
       [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
        4.54485416e-07],
       [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
        2.98023224e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
        2.23517418e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
        0.00000000e+00]])

By default, the matrices are loaded as a numpy.ndarray. Different formats can be specified.

Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:

[2]:
from pycom import MatrixFormat

resultsNumpy = pyc.load_matrices(page, mat_format=MatrixFormat.NUMPY)  # default
resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)

print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')
Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>

Adding biological data to dataframe

This is supported in the local variant only!

PyCom contains a lot of additional protein annotation info. This is not loaded by default, but can be added it needed.

The list of cofactors, diseases, and organisms can loaded by calling:

[7]:
cofactors = pyc.get_cofactor_list()
diseases = pyc.get_disease_list()
organisms = pyc.get_organism_list()

cofactors
[7]:
cofactorId cofactorName
0 CHEBI:597326 pyridoxal 5'-phosphate
1 CHEBI:18420 Mg(2+)
2 CHEBI:60240 a divalent metal cation
3 CHEBI:30413 heme
4 CHEBI:29105 Zn(2+)
... ... ...
109 CHEBI:61721 chlorophyll b
110 CHEBI:73095 divinyl chlorophyll a
111 CHEBI:73096 divinyl chlorophyll b
112 CHEBI:57453 (6S)-5,6,7,8-tetrahydrofolate
113 CHEBI:30402 tungstopterin

114 rows × 2 columns

[30]:
loader = pyc.get_data_loader()

entries = pyc.find(uniprot_id='P15291')

# Add the protein's cofactors to the dataframe
entries = loader.add_cofactors(entries)

# The following functions are supported, data taken directly from UniProt
entries = loader.add_biological_processes(entries)
entries = loader.add_cath_class(entries)  # Protein's CATH
entries = loader.add_coding_sequence_diversity(entries)  # https://www.uniprot.org/help/keywords
entries = loader.add_cofactors(entries)  # Cofactors
entries = loader.add_developmental_stage(entries)
entries = loader.add_diseases(entries)  # The diseases associated with the protein
entries = loader.add_enzyme_commission(entries)  # Protein's EC
entries = loader.add_ligand(entries)  # Ligands
entries = loader.add_molecular_function(entries)
entries = loader.add_organism_name(entries)
entries = loader.add_organism_taxonomy(entries)
entries = loader.add_pdbs(entries)  # Experimental PDB IDs of protein
entries = loader.add_protein_cellular_component(entries)
entries = loader.add_protein_domain(entries)
entries = loader.add_ptm(entries)  # Protein's Post-translational modifications
entries = loader.add_substrates(entries)  # Protein's substrates

entries.iloc[0]
[30]:
uniprot_id                                                              P15291
neff                                                                     7.854
sequence_length                                                            398
sequence                     MRLREPLLSGSAAMPGASLQRACRLLVAVCALHLGVTLVYYLAGRD...
organism_id                                                               9606
helix_frac                                                            0.198492
turn_frac                                                             0.030151
strand_frac                                                           0.163317
has_ptm                                                                      1
has_pdb                                                                      1
has_substrate                                                                1
matrix                                                                    None
cofactor_x                                                            [Mn(2+)]
biological_process                                          [Lipid metabolism]
cath_class                                                         3.90.550.10
coding_sequence_diversity                             [Alternative initiation]
cofactor_y                                                            [Mn(2+)]
developmental_stage                                                        NaN
disease_name                         [Congenital disorder of glycosylation 2D]
disease_id                                                          [DI-00349]
enzyme_commission                                                      2.4.1.-
ligand                                              [Manganese, Metal-binding]
molecular_function                          [Glycosyltransferase, Transferase]
organism_name                                                     Homo sapiens
taxonomy                     [Eukaryota, Metazoa, Chordata, Craniata, Verte...
pdb_id                       [2AE7, 2AEC, 2AES, 2AGD, 2AH9, 2FY7, 2FYA, 2FY...
cellular_component           [Cell membrane, Cell projection, Golgi apparat...
domain                       [Signal-anchor, Transmembrane, Transmembrane h...
ptm                                                                        NaN
substrate                    [D-glucose + UDP-alpha-D-galactose = H(+) + la...
Name: 0, dtype: object