Getting started: PyCoM Local

PyCoM, the Python interface for PyCoMDB can both run locally and remotely. - Run locally to run large-scale analyses. Requires 115GB of disk space for the database. - Run remotely to run small-scale analyses. No disk space required. Follow this tutorial 00_Getting_Started_Remotely.ipynb.

This is a crash course on how to use the local variant of the Python interface for the PyCoM database. 1. Installation 2. Initialise Pycom object 3. Supported query keywords 4. Paginate the results 5. Load coevolution matrices 6. Adding biological data to dataframe

More indepth tutorials are available here: https://pycom.brunel.ac.uk/tutorials.html

Installation

Install the PyCom package:

pip3 install git+https://github.com/scdantu/pycom

Note: Requires Python 3.8 or higher

Download the pycom.db and pycom.mat files from https://pycom.brunel.ac.uk/downloads

[ ]:

!mkdir -p ~/docs
# note: downloads are 700MB and 114GB (!) respectively.
# if too much: consider running remote version (other tutorial!)
!wget -P ~/docs https://pycom.brunel.ac.uk/downloads/pycom.db
!wget -P ~/docs https://pycom.brunel.ac.uk/downloads/pycom.mat

Initialise PyCom object

Import the required classes and create a PyCom object:

[1]:

from pycom import PyCom, ProteinParams

pyc = PyCom(db_path='~/docs/pycom.db', mat_path='~/docs/pycom.mat')

Query the database

Query the database by passing a dictionary of keywords:

[2]:

entries = pyc.find({
    ProteinParams.ENZYME: '3.*.*.*',
    ProteinParams.DISEASE: 'cancer',  # string search, case-insensitive
})

entries

[2]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix
0	P01111	12.817	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.349206	0.015873	0.227513	1	1	1	None
1	P01112	12.841	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.317460	0.031746	0.359788	1	1	1	None
2	P01116	12.626	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.375661	0.031746	0.328042	1	1	1	None
3	P62070	12.754	204	MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT...	9606	0.299020	0.019608	0.220588	1	1	1	None
4	Q9UNW1	9.554	487	MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY...	9606	0.000000	0.000000	0.000000	0	0	1	None

Alternatively, query the database by passing keyword arguments:

[3]:

entries = pyc.find(
    cofactor='FAD',  # string search, case-insensitive
    has_ptm=True,
    has_disease=True,
)

entries

[3]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix
0	P11310	9.930	421	MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK...	9606	0.517815	0.016627	0.180523	1	1	1	None
1	Q658P3	9.677	488	MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR...	9606	0.157787	0.000000	0.086066	1	1	0	None
2	Q16795	10.997	377	MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG...	9606	0.363395	0.037135	0.124668	1	1	0	None
3	O95299	9.244	355	MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG...	9606	0.000000	0.000000	0.000000	1	1	0	None
4	P13804	8.627	333	MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR...	9606	0.300300	0.027027	0.333333	1	1	0	None

Supported query keywords

uniprot_id: The UniProt ID of the protein.
sequence: The amino acid sequence of protein to search for. (full match)
min_length / max_length: Min/Max number of residues in the protein.
min_helix / max_helix: Min/Max percentage of helical structure in the protein.
min_turn / max_turn: Min/Max percentage of turn structure in the protein.
min_strand / max_strand: Min/Max percentage of beta strand structure in the protein.
organism: Taxonomic name of the genus / species of the protein. (case-insensitive)
- Species name or any parent taxonomic level can be used. (pyc.get_organism_list() for full list)
- Surround with : to get precise results
  - :homo: returns Homo sapiens & Homo sapiens neanderthalensis)
  - homo also returns homoeomma, thomomys, and hundreds others
organism_id: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use organism instead)
cath: CATH classification of the protein (3.40.50.360 or 3.40.*.* or 3.*).
enzyme: Enzyme Commission number of the protein. (1.3.1.3 or 1.3.*.* or 1.*).
has_substrate: Whether the protein has a known substrate. (True/False)
has_ptm: Whether the protein has a known post-translational modification. (True/False)
has_pbd: Whether the protein has a known PDB structure. (True/False)
disease: The disease associated with the protein. (name of disease, case-insensitive, e.g cancer)
- Use pyc.get_disease_list() for full list.
- cancer searches for Ovarian cancer, Lung cancer, …
disease_id: The ID of the disease associated with the protein. (DI-02205, get_disease_list()
has_disease: Whether the protein is associated with a disease. (True/False)
cofactor: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g Zn(2+)])
cofactor_id: The ID of the cofactor associated with the protein. (CHEBI:00001, get_cofactor_list())
biological_process: Biological process associated with the protein. (e.g antiviral defense, use pyc.get_biological_process_list() for full list)
cellular_component: Cellular component associated with the protein. (e.g nucleus, use pyc.get_cellular_component_list() for full list
domain: Domain associated with the protein. (e.g zinc-finger, use pyc.get_domain_list() for full list)
ligand: Ligand associated with the protein. (e.g zinc, use pyc.get_ligand_list() for full list
molecular_function: Molecular function associated with the protein. (e.g antioxidant activity, use pyc.get_molecular_function_list() for full list
ptm: Post-translational modification associated with the protein. (e.g phosphoprotein, use pyc.get_ptm_list() for full list

Paginate the results

Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.

Here is an example of making a large query, then paginating the results:

[4]:

entries = pyc.find(max_length=20)
print(f'Found {len(entries)} entries with length <= 20')

page = pyc.paginate(entries, page=1, per_page=100)  # get first n entries (default 100)
print(f'Found {len(page)} entries on page 1')

Found 2958 entries with length <= 20
Found 100 entries on page 1

Load coevolution matrices

Now the coevolution matrices can be loaded for the paginated results.

This loads them into the matrix column of the dataframe.

[5]:

pyc.load_matrices(page)

page.iloc[0].matrix  # show the coevolution matrix for the first entry

[5]:

array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,
        0.00000000e+00],
       [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,
        4.54485416e-07],
       [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,
        2.98023224e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,
        2.23517418e-07],
       [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,
        0.00000000e+00]])

By default, the matrices are loaded as a numpy.ndarray. Different formats can be specified.

Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:

[2]:

from pycom import MatrixFormat

resultsNumpy = pyc.load_matrices(page, mat_format=MatrixFormat.NUMPY)  # default
resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)
resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)

print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')
print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')
print(f'List: {type(resultsList.iloc[0].matrix)}')

Numpy: <class 'numpy.ndarray'>
Pandas: <class 'pandas.core.frame.DataFrame'>
List: <class 'list'>

Adding biological data to dataframe

This is supported in the local variant only!

PyCom contains a lot of additional protein annotation info. This is not loaded by default, but can be added it needed.

The list of cofactors, diseases, and organisms can loaded by calling:

[7]:

cofactors = pyc.get_cofactor_list()
diseases = pyc.get_disease_list()
organisms = pyc.get_organism_list()

cofactors

[7]:

	cofactorId	cofactorName
0	CHEBI:597326	pyridoxal 5'-phosphate
1	CHEBI:18420	Mg(2+)
2	CHEBI:60240	a divalent metal cation
3	CHEBI:30413	heme
4	CHEBI:29105	Zn(2+)
...	...	...
109	CHEBI:61721	chlorophyll b
110	CHEBI:73095	divinyl chlorophyll a
111	CHEBI:73096	divinyl chlorophyll b
112	CHEBI:57453	(6S)-5,6,7,8-tetrahydrofolate
113	CHEBI:30402	tungstopterin

114 rows × 2 columns

[30]:

loader = pyc.get_data_loader()

entries = pyc.find(uniprot_id='P15291')

# Add the protein's cofactors to the dataframe
entries = loader.add_cofactors(entries)

# The following functions are supported, data taken directly from UniProt
entries = loader.add_biological_processes(entries)
entries = loader.add_cath_class(entries)  # Protein's CATH
entries = loader.add_coding_sequence_diversity(entries)  # https://www.uniprot.org/help/keywords
entries = loader.add_cofactors(entries)  # Cofactors
entries = loader.add_developmental_stage(entries)
entries = loader.add_diseases(entries)  # The diseases associated with the protein
entries = loader.add_enzyme_commission(entries)  # Protein's EC
entries = loader.add_ligand(entries)  # Ligands
entries = loader.add_molecular_function(entries)
entries = loader.add_organism_name(entries)
entries = loader.add_organism_taxonomy(entries)
entries = loader.add_pdbs(entries)  # Experimental PDB IDs of protein
entries = loader.add_protein_cellular_component(entries)
entries = loader.add_protein_domain(entries)
entries = loader.add_ptm(entries)  # Protein's Post-translational modifications
entries = loader.add_substrates(entries)  # Protein's substrates

entries.iloc[0]

[30]:

uniprot_id                                                              P15291
neff                                                                     7.854
sequence_length                                                            398
sequence                     MRLREPLLSGSAAMPGASLQRACRLLVAVCALHLGVTLVYYLAGRD...
organism_id                                                               9606
helix_frac                                                            0.198492
turn_frac                                                             0.030151
strand_frac                                                           0.163317
has_ptm                                                                      1
has_pdb                                                                      1
has_substrate                                                                1
matrix                                                                    None
cofactor_x                                                            [Mn(2+)]
biological_process                                          [Lipid metabolism]
cath_class                                                         3.90.550.10
coding_sequence_diversity                             [Alternative initiation]
cofactor_y                                                            [Mn(2+)]
developmental_stage                                                        NaN
disease_name                         [Congenital disorder of glycosylation 2D]
disease_id                                                          [DI-00349]
enzyme_commission                                                      2.4.1.-
ligand                                              [Manganese, Metal-binding]
molecular_function                          [Glycosyltransferase, Transferase]
organism_name                                                     Homo sapiens
taxonomy                     [Eukaryota, Metazoa, Chordata, Craniata, Verte...
pdb_id                       [2AE7, 2AEC, 2AES, 2AGD, 2AH9, 2FY7, 2FYA, 2FY...
cellular_component           [Cell membrane, Cell projection, Golgi apparat...
domain                       [Signal-anchor, Transmembrane, Transmembrane h...
ptm                                                                        NaN
substrate                    [D-glucose + UDP-alpha-D-galactose = H(+) + la...
Name: 0, dtype: object