Tutorial 1: Workflow of PyCoM

In this tutorial, with a small example, you will learn how to create a workflow with the local version of PyCoM, using the steps listed below:

Setup
Initalise pycom objects
Create a query dictionary
Save and retrieve progress
Analyse search results
Add biological features
Some statistics
Coevolution matrix analysis
Help on UniProt Controlled Vocabulary

Setup

The assumption is that you have completed the installation and downloaded the database. For help on this please look at the quick guide here.

Initalise pycom objects

First, lets import all the libraries and classes we need from pycom, pandas, matplotlib, and numpy

[14]:

# importing all usefull classes from PyCoM
from pycom import PyCom, ProteinParams,CoMAnalysis
import pandas as pd
import numpy as np
# matplotlib; useful for plotting later
import matplotlib
import matplotlib.pyplot as plt
#setting matplotlib parameters
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['font.family'] = "sans-serif"
matplotlib.rcParams['font.sans-serif'] = "Arial"

[15]:

#set the path to the database
database_folder_path="/Volumes/mason/Work/Sarath/Research/pycom/"
#matrix file name and path
file_matrix_db = database_folder_path+"pycom.mat"
#protein database file name and path
file_protein_db= database_folder_path+"pycom.db"

[16]:

obj_pycom = PyCom(db_path=file_protein_db, mat_path=file_matrix_db)

Create a query dictionary

To query the database, we need to create a dictionary object query_parameters using the keywords for our choice of properties. For the full list of keywords please check

Empty ``query_parameters`` will return information on all the ~457,000 proteins in the database

To query the database, we need to create a dictionary object query_parameters using the keywords for our choice of properties. For the full list of keywords please check

Empty ``query_parameters`` will return information on all the ~457,000 proteins in the database

[17]:

#creating empty query dictionary
query_parameters={}

[18]:

# Here we are asking for all the proteins that match the enzyme class 3 and have been associated with the disease cancer.
query_parameters={ProteinParams.DISEASE:"cancer",
                  ProteinParams.ENZYME: '3.*',
                  ProteinParams.MIN_LENGTH: 100,
                  ProteinParams.MAX_LENGTH: 200,
                 }

Executing the query with the parameters defined in the above cell using the pycom object obj_pycom find() function will return a pandas dataframe with the search results containing information about all the proteins which match our query.

[19]:

entries_data_frame=obj_pycom.find(query_parameters)

Save and retrieve progress

We can save and retreive our progress by saving our dataframe with information on our favourite proteins by saving it to a csv file.

Save the query to a csv file

To avoid rerunning the query we can cave the progress to a csv file.

[20]:

entries_data_frame.to_csv("output/DB_Query_Results.csv",index=False)

Read query data from csv file

Retrieving our progress from the csv file.

[21]:

#entries_data_frame=pd.read_csv("output/DB_Query_Results.csv")

Analyse search results

The search returns a pandas data frame with proteins matching the query critiera with following information for each protein:

uniprot_id: Uniprot ID
neff: Depth of the sequence alignment \(N_{eff}\)
sequence_length: Sequence length
sequence: protein sequence
organism_id: Organism ID
helic_frac, turn_frac, strand_frac: helix, turn, and strand structure fraction
has_ptm: Has a PTM Yes/No
has_pdb: Has a PDB structure Yes/No
has_substrate: Has a substrate for biological activity Yes/No
matrix: coevolution matrix column is empty because at this stage we would still want you to check the search results and if required filter them based on any of the biological properties before loading the matrices.

First, look what columns we have and their names.

[22]:

entries_data_frame.head()

[22]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix
0	P01111	12.817	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.349206	0.015873	0.227513	1	1	1	None
1	P01112	12.841	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.317460	0.031746	0.359788	1	1	1	None
2	P01116	12.626	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.375661	0.031746	0.328042	1	1	1	None

describe() function from pandas can be used to get a summary of all the features:

[23]:

entries_data_frame.describe(include="all")

[23]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix
count	3	3.000000	3.0	3	3	3.000000	3.000000	3.000000	3.0	3.0	3.0	0
unique	3	NaN	NaN	3	1	NaN	NaN	NaN	NaN	NaN	NaN	0
top	P01111	NaN	NaN	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	NaN	NaN	NaN	NaN	NaN	NaN	NaN
freq	1	NaN	NaN	1	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN
mean	NaN	12.761333	189.0	NaN	NaN	0.347443	0.026455	0.305115	1.0	1.0	1.0	NaN
std	NaN	0.117815	0.0	NaN	NaN	0.029141	0.009164	0.069054	0.0	0.0	0.0	NaN
min	NaN	12.626000	189.0	NaN	NaN	0.317460	0.015873	0.227513	1.0	1.0	1.0	NaN
25%	NaN	12.721500	189.0	NaN	NaN	0.333333	0.023810	0.277778	1.0	1.0	1.0	NaN
50%	NaN	12.817000	189.0	NaN	NaN	0.349206	0.031746	0.328042	1.0	1.0	1.0	NaN
75%	NaN	12.829000	189.0	NaN	NaN	0.362434	0.031746	0.343915	1.0	1.0	1.0	NaN
max	NaN	12.841000	189.0	NaN	NaN	0.375661	0.031746	0.359788	1.0	1.0	1.0	NaN

Get counts of categorical data in the column, for example: * number of proteins with a known PDB structure * number of proteins with a known PTM

[24]:

entries_data_frame['has_pdb'].value_counts()

[24]:

has_pdb
1    3
Name: count, dtype: int64

Find number of unique elements in the column, for example number of unique organisms:

[25]:

entries_data_frame['organism_id'].unique()

[25]:

array(['9606'], dtype=object)

All the sequences are from the same organism, 9606 i.e. from Homo sapiens. Full list is available from UniProt

Some statistics on the numerical column, for example neff:

[26]:

entries_data_frame["neff"].describe()

[26]:

count     3.000000
mean     12.761333
std       0.117815
min      12.626000
25%      12.721500
50%      12.817000
75%      12.829000
max      12.841000
Name: neff, dtype: float64

We can also use other functions to get some of the information

[27]:

entries_data_frame["neff"].min()

[27]:

12.626

[28]:

entries_data_frame["neff"].mean()

[28]:

12.761333333333333

Add biological features

Initialise the object loader class and then call each add function

Add Enzyme Classification
Add CATH Class
Add Co-factors
Add PTM
Add Diseases

For a protein entry if the requested data (EC/CATH/Cofactors…) does not exist, corresponding entry in that column will be nan. We can filter such rows as shown further below.

Please note the dataloader functions will not work with remote version of PyCoM

[29]:

#initialise the object for data loader class
obj_data_loader=obj_pycom.get_data_loader()

[30]:

#add enzyme commission data to the dataframe
entries_data_frame=obj_data_loader.add_enzyme_commission(entries_data_frame,force_single_entry=False)
#add CATH data to the dataframe
entries_data_frame=obj_data_loader.add_cath_class(entries_data_frame,force_single_entry=False)
#add CATH data to the dataframe
entries_data_frame=obj_data_loader.add_pdbs(entries_data_frame,force_single_entry=False)
#get list of all cofactors for each protein
entries_data_frame=obj_data_loader.add_cofactors(entries_data_frame,force_single_entry=False)
#get list of all PTM's for each protein
entries_data_frame=obj_data_loader.add_ptm(entries_data_frame,force_single_entry=False)
#get list of all diseases for each protein
entries_data_frame=obj_data_loader.add_diseases(entries_data_frame,force_single_entry=False)

[31]:

#get substrates for the proteins
entries_data_frame=obj_data_loader.add_ligand(entries_data_frame,force_single_entry=False)

Save the progress to a csv file

As we have added a lot of information to our dataframe, let’s save our progress so that we can restart from this point, in future, if required.

[32]:

entries_data_frame.to_csv("output/DB_Query_Results_With_Details.csv",index=False)

Some Statistics

Let’s look at some statistics for all the columns in the dataframe. Below are some examples of how you can do some fun things with the dataframe.

[33]:

#include=all will also include columns with 'nan' entries
entries_data_frame.describe(include="all")

[33]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix	enzyme_commission	cath_class	pdb_id	cofactor	ptm	disease_name	disease_id	ligand
count	3	3.000000	3.0	3	3	3.000000	3.000000	3.000000	3.0	3.0	3.0	0	3	3	3	0	0	3	3	3
unique	3	NaN	NaN	3	1	NaN	NaN	NaN	NaN	NaN	NaN	0	1	1	3	0	0	3	3	1
top	P01111	NaN	NaN	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	NaN	NaN	NaN	NaN	NaN	NaN	NaN	[3.6.5.2]	[3.40.50.300]	[2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL...	NaN	NaN	[Leukemia, juvenile myelomonocytic, Noonan syn...	[DI-01851, DI-02558, DI-03381, DI-04099, DI-04...	[GTP-binding, Nucleotide-binding]
freq	1	NaN	NaN	1	3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3	3	1	NaN	NaN	1	1	3
mean	NaN	12.761333	189.0	NaN	NaN	0.347443	0.026455	0.305115	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
std	NaN	0.117815	0.0	NaN	NaN	0.029141	0.009164	0.069054	0.0	0.0	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
min	NaN	12.626000	189.0	NaN	NaN	0.317460	0.015873	0.227513	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
25%	NaN	12.721500	189.0	NaN	NaN	0.333333	0.023810	0.277778	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
50%	NaN	12.817000	189.0	NaN	NaN	0.349206	0.031746	0.328042	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75%	NaN	12.829000	189.0	NaN	NaN	0.362434	0.031746	0.343915	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
max	NaN	12.841000	189.0	NaN	NaN	0.375661	0.031746	0.359788	1.0	1.0	1.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

Find unique ligands and count them:

[34]:

entries_data_frame['ligand'].value_counts()

[34]:

ligand
[GTP-binding, Nucleotide-binding]    3
Name: count, dtype: int64

[35]:

# Filter the search results where we have a ligand interacting with the protein
df_results_with_ligand=entries_data_frame[entries_data_frame['ligand'].notna()]

Count columns without ‘nan’ entries

[36]:

#Number of proteins with pdb data
df_results_with_ligand["pdb_id"].notna().sum()

[36]:

Coevolution matrix analysis

Load the matrix

Lets get the coevolution matrix for the filtered dataframe df_results_with_ligand using the load_matrices() from obj_pycom

[37]:

df_results_with_ligand=obj_pycom.load_matrices(df_results_with_ligand)

[38]:

df_results_with_ligand

[38]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate	matrix	enzyme_commission	cath_class	pdb_id	cofactor	ptm	disease_name	disease_id	ligand
0	P01111	12.817	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.349206	0.015873	0.227513	1	1	1	[[0.0, 0.5163763761520386, 0.3219393491744995,...	[3.6.5.2]	[3.40.50.300]	[2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL...	NaN	NaN	[Leukemia, juvenile myelomonocytic, Noonan syn...	[DI-01851, DI-02558, DI-03381, DI-04099, DI-04...	[GTP-binding, Nucleotide-binding]
1	P01112	12.841	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.317460	0.031746	0.359788	1	1	1	[[0.0, 0.5560339689254761, 0.34521734714508057...	[3.6.5.2]	[3.40.50.300]	[121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR...	NaN	NaN	[Costello syndrome, Congenital myopathy with e...	[DI-01437, DI-01411, DI-04532, DI-02612, DI-03...	[GTP-binding, Nucleotide-binding]
2	P01116	12.626	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.375661	0.031746	0.328042	1	1	1	[[0.0, 0.38467222452163696, 0.3104382753372192...	[3.6.5.2]	[3.40.50.300]	[1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4...	NaN	NaN	[Leukemia, acute myelogenous, Leukemia, juveni...	[DI-01171, DI-01851, DI-02073, DI-02971, DI-03...	[GTP-binding, Nucleotide-binding]

Normalise/Scale the matrix

Scaled matrix (:math:`S_{i}`): Coevolution Matrices (\(C_{i}\)) have to be scaled by average \(\langle{C_{i}}\rangle\), all values < \(\langle{C_{i}}\rangle\) are set to 0.

Normalised matrix (:math:`N_{i}`): For comparing scaled coevolution scores across multiple proteins, we can normalise the values of all matrices \({S_{i}...S_{n}}\), by dividing them by the \(\max({S_{i}...S_{n}})\).

These operations can be performed by using object from CoMAnalysis class.

[39]:

#initialise CoMAnalysis class object
obj_com_analysis=CoMAnalysis()

[40]:

df_results_with_ligand_matrix=obj_com_analysis.scale_and_normalise_coevolution_matrices(df_results_with_ligand)

[41]:

df_results_with_ligand_matrix.head()

[41]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	...	enzyme_commission	cath_class	pdb_id	cofactor	ptm	disease_name	disease_id	ligand	matrix_S	matrix_N
0	P01111	12.817	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.349206	0.015873	0.227513	1	1	...	[3.6.5.2]	[3.40.50.300]	[2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL...	NaN	NaN	[Leukemia, juvenile myelomonocytic, Noonan syn...	[DI-01851, DI-02558, DI-03381, DI-04099, DI-04...	[GTP-binding, Nucleotide-binding]	[[0.0, 2.4990853333930714, 1.5580765172867141,...	[[0.0, 0.18836677642207234, 0.1174389073708615...
1	P01112	12.841	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.317460	0.031746	0.359788	1	1	...	[3.6.5.2]	[3.40.50.300]	[121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR...	NaN	NaN	[Costello syndrome, Congenital myopathy with e...	[DI-01437, DI-01411, DI-04532, DI-02612, DI-03...	[GTP-binding, Nucleotide-binding]	[[0.0, 2.4841366766214805, 1.5422925960913767,...	[[0.0, 0.18724003206873668, 0.1162492055567066...
2	P01116	12.626	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.375661	0.031746	0.328042	1	1	...	[3.6.5.2]	[3.40.50.300]	[1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4...	NaN	NaN	[Leukemia, acute myelogenous, Leukemia, juveni...	[DI-01171, DI-01851, DI-02073, DI-02971, DI-03...	[GTP-binding, Nucleotide-binding]	[[0.0, 1.8285180440964264, 1.4756510916226846,...	[[0.0, 0.1378232447662731, 0.1112261496390277,...

3 rows × 22 columns

matrix_S column contains the \(S_{i}\) matrix and the matrix_N column contains \(N_{i}\)

Plot the matrix and save it

Lets plot the matrix for the first (index is 0) protein in the dataframe.

[42]:

plt.imshow(df_results_with_ligand_matrix.loc[0,'matrix_S'],cmap='Blues')
plt.colorbar()
file_name="output/%s_Scaled_Matrix.png"%(df_results_with_ligand_matrix.loc[0,'uniprot_id'])
plt.savefig(file_name,dpi=300)

../_images/tutorials_01_Workflow_56_0.png

[43]:

plt.imshow(df_results_with_ligand_matrix.loc[0,'matrix_N'],cmap='Blues')
plt.colorbar()
file_name="output/%s_Normalised_Matrix.png"%(df_results_with_ligand_matrix.loc[0,'uniprot_id'])
plt.savefig(file_name,dpi=300)

../_images/tutorials_01_Workflow_57_0.png

Save the list of top coevolution pairs

Residues with scores close to 1 have the strongest evolutionary relationship, and close to 0 may not be covarying during evolution.

The top scoring residues are sorted in descending order and saved to an ASCII file for further interpretation.

[44]:

obj_com_analysis.save_top_scoring_residue_pairs(df_results_with_ligand_matrix,data_folder="output",matrix_type="matrix_N",res_gap=5,percentile=95)

Saved file : output/P01111_Pairs.txt
Saved file : output/P01112_Pairs.txt
Saved file : output/P01116_Pairs.txt

Help on UniProt Controlled Vocabulary

For each biological feature category, UniProtKB/Swiss-Prot has a curated list of keywords. To search using those keywords or ID’s and help you find them we have some helper functions that will help you find them:

Function get_cofactor_list() from pycom will get you list of cofactors. You can either use the cofactorId or the cofactorName.

[45]:

# list of cofactors
cofactors = obj_pycom.get_cofactor_list()
cofactors

[45]:

	cofactorId	cofactorName
0	CHEBI:597326	pyridoxal 5'-phosphate
1	CHEBI:18420	Mg(2+)
2	CHEBI:60240	a divalent metal cation
3	CHEBI:30413	heme
4	CHEBI:29105	Zn(2+)
...	...	...
109	CHEBI:61721	chlorophyll b
110	CHEBI:73095	divinyl chlorophyll a
111	CHEBI:73096	divinyl chlorophyll b
112	CHEBI:57453	(6S)-5,6,7,8-tetrahydrofolate
113	CHEBI:30402	tungstopterin

114 rows × 2 columns

Function get_disease_list() from pycom will get you list of diseases. You can either use the diseaseId or the diseaseName.

[46]:

# list of diseases
diseases = obj_pycom.get_disease_list()
diseases

[46]:

	diseaseId	diseaseName
0	DI-04420	Intellectual developmental disorder, autosomal...
1	DI-00085	Alzheimer disease 1
2	DI-00097	Cerebral amyloid angiopathy, APP-related
3	DI-00262	Chanarin-Dorfman syndrome
4	DI-01042	Spastic paraplegia 42, autosomal dominant
...	...	...
6039	DI-05800	Wieacker-Wolff syndrome, female-restricted
6040	DI-01041	Spastic paraplegia 33, autosomal dominant
6041	DI-05703	Neurodevelopmental disorder with dysmorphic fa...
6042	DI-06050	Intellectual developmental disorder, autosomal...
6043	DI-04662	Paget disease of bone 6

6044 rows × 2 columns

Function get_organism_list() from pycom will get you list of diseases. You can either use the organismId or the nameScientific or nameCommon or any categories in the taxonomy.

[47]:

# list of organisms
organisms = obj_pycom.get_organism_list()
organisms

[47]:

	organismId	nameScientific	nameCommon	taxonomy
0	561445	African swine fever virus (isolate Pig/Kenya/K...	ASFV	:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
1	10500	African swine fever virus (isolate Tick/Malawi...	ASFV	:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
2	561443	African swine fever virus (isolate Tick/South ...	ASFV	:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
3	561444	African swine fever virus (isolate Warthog/Nam...	ASFV	:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
4	10498	African swine fever virus (strain Badajoz 1971...	Ba71V	:Viruses:Varidnaviria:Bamfordvirae:Nucleocytov...
...	...	...	...	...
14316	31581	Rotavirus A (isolate RVA/Pig/Australia/TFR-41/...	RV-A	:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14317	31579	Rotavirus A (isolate RVA/Pig/Australia/BEN144/...	RV-A	:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14318	10918	Rotavirus A (strain RVA/Pig/Russia/K/1987)	RV-A	:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14319	31580	Rotavirus A (isolate RVA/Pig/Australia/BMI-1/1...	RV-A	:Viruses:Riboviria:Orthornavirae:Duplornaviric...
14320	47664	Populus tremula x Populus tremuloides	Hybrid aspen	:Eukaryota:Viridiplantae:Streptophyta:Embryoph...

14321 rows × 4 columns

[48]:

#full list of helper functions to get searchable terms on other biological categories
obj_pycom.get_biological_process_list()
obj_pycom.get_cellular_component_list()
obj_pycom.get_developmental_stage_list()
obj_pycom.get_domain_list()
obj_pycom.get_ligand_list()
obj_pycom.get_ptm_list()
obj_pycom.get_molecular_function_list()

[48]:

	name
0	Acetylcholine receptor inhibiting toxin
1	Actin-binding
2	Activator
3	Acyltransferase
4	Allosteric enzyme
...	...
191	Viral short tail ejection system
192	Viral exotoxin
193	Chloride channel impairing toxin
194	Proton-gated sodium channel impairing toxin
195	Translocase

196 rows × 1 columns