Tutorial 2: Use cases

Here are some more examples on how you can query the PyCoMdb using either a keyword or combination of keywords, followed by one example of how to interpret coevolution matrix data.

Query by:

UniProtID
Sequence
EC/CATH
Disease
Multiple Keywords
Interpreting coevolution matrix

Import all the essential PyCoM classes…

[1]:

from pycom import PyCom, ProteinParams,CoMAnalysis
import matplotlib.pyplot as plt

Setup the paths to the local database files if you are using PyCoM (local) i.e. with PyCoMdb on your computer.

[ ]:

# Set the paths to the Database files
# obj_pycom = PyCom(db_path='~/pycom.db', mat_path='/Volumes/SuperSpeedBoi/pycom.mat')

or you can use PyCoM with the remote version of PyCoMdb, uncomment or comment the code as per the right scenario.

[2]:

obj_pycom = PyCom(remote=True)

UniProtID

Using the UniProtID

[3]:

query_parameters={ProteinParams.ID:"Q65209"}
# use for remote mode
entries_data_frame=obj_pycom.find(query_parameters,page=1)
# entries_data_frame=obj_pycom.find(query_parameters)
entries_data_frame

[3]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate
0	Q65209	6.062	141	MGNKESKYLEMCSEEAWLNIPNIFKCIFIRKLFYNKWLKYQEKKLK...	10498	0.0	0.0	0.0	0	0	0

Sequence

Using the complete protein sequence, using partial protein sequence will not yield any or incorrect result.

[4]:

entries_data_frame=obj_pycom.find({
   ProteinParams.SEQUENCE:"MGNKESKYLEMCSEEAWLNIPNIFKCIFIRKLFYNKWLKYQEKKLK"
}, page=1)

entries_data_frame

[4]:

EC and CATH

Using a combination of Enzyme Commission number and CATH ID.

[5]:

entries_data_frame=obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.ENZYME:"3.2.*"
}, page=1)

entries_data_frame

[5]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_pdb	has_substrate
0	A0A0S2UQQ5	8.393	380	MNITGKGAYDTGTYANLFQRSGYREDEIKARLEQTWNDLFYGDEHT...	198119	0.505263	0.052632	0.131579	1	1
1	A0A168WVR6	8.987	382	MDESNLLQGISMIDLRSPDAILSDYAKRYAHLLPELSPQLQQRMAY...	1081940	0.000000	0.000000	0.000000	0	1
2	A0R567	8.089	293	MSISPVELLSWYDHARRDLPWRRPGVSAWQILVSEFMLQQTPVSRV...	246196	0.000000	0.000000	0.000000	0	1
3	A1A048	8.357	379	MTNATDTNKTLGESMFAQCGYAQDAIDKRVSQVWHEIFEGPNKFYW...	367928	0.000000	0.000000	0.000000	0	1
4	A1D1W1	7.956	493	MHLPSLSVALALVSSSLALPQTVLPESDVSSHAAAVKEAFSHAWDG...	331117	0.000000	0.000000	0.000000	0	1
5	A3DBX3	10.164	334	MKNRVISLLMASLLLVLSVIVAPFYKAEAATVVNTPFVAVFSNFDS...	203119	0.038922	0.008982	0.365269	1	1
6	A3DC29	8.316	477	MKNVKKRVGVVLLILAVLGVYMLAMPANTVSAAGVPFNTKYPYGPT...	203119	0.379455	0.031447	0.058700	1	1
7	A7M7B9	8.207	281	MCAAAPRGGGRAARRLGAATAGSRVPSAAPRYSRRTRRVPIAYEAE...	9031	0.000000	0.000000	0.000000	0	1
8	A8GG79	9.140	366	MIDLREDTWTLQLYAQRYKGLSPKNSRELQLRMEYDPLKPNLPTSG...	399741	0.571038	0.035519	0.027322	1	1
9	B0KTG8	9.132	371	MIDLRSPNALLSDYVERYAHLSPEPSRQLQQRMDYNVRADAPAEPA...	76869	0.000000	0.000000	0.000000	0	1

We have found 153 proteins that match our search criteria i.e., proteins from CATH class 1.* and EC class 3.2.*.

Disease

Getting proteins linked to a particular disease.

[6]:

entries_data_frame = obj_pycom.find({
   ProteinParams.DISEASE:"Cancer"
}, page=1)

entries_data_frame

[6]:

	uniprot_id	neff	sequence_length	sequence	organism_id	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate
0	O00358	5.065	373	MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK...	9606	0.000000	0.000000	0.000000	1	0	0
1	O15105	6.671	426	MFRTKRSALVRRLWRSRAPGGEDEEEGAGGGGGGGELRGEGATDSR...	9606	0.000000	0.000000	0.014085	1	1	0
2	O43502	9.997	376	MRGKTFRFEMQRDLVSFPLSPAVRVKLVSAGFQTAEELLEVKPSEL...	9606	0.000000	0.000000	0.000000	0	0	0
3	O43542	10.237	346	MDLDLLDLNPRIIAAIKKAKLKSVKEVLHFSGPDLKRLTNLSSPEV...	9606	0.000000	0.000000	0.000000	0	0	0
4	O75771	10.597	328	MGVLRVGLCPGLTEEMIQLLRSHRIKTVVDLVSADLEEVAQKCGLS...	9606	0.109756	0.000000	0.000000	0	1	0
5	P01111	12.817	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.349206	0.015873	0.227513	1	1	1
6	P01112	12.841	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.317460	0.031746	0.359788	1	1	1
7	P01116	12.626	189	MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...	9606	0.375661	0.031746	0.328042	1	1	1
8	P04637	5.749	393	MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLS...	9606	0.206107	0.066158	0.262087	1	1	0
9	P08118	7.621	114	MNVLLGSVVIFATFVTLCNASCYFIPNEGVPGDSTRKCMDLKGNKH...	9606	0.026316	0.026316	0.421053	0	1	0

That’s 29 proteins with known association to cancer.

Multiple Keywords

[7]:

entries_data_frame = obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.DISEASE:"Cancer"
}, page=1)

entries_data_frame

[7]:

	uniprot_id	neff	sequence_length	sequence	organism_id	has_ptm
0	O00358	5.065	373	MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK...	9606	1
1	P10914	5.137	325	MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKHGW...	9606	1
2	Q96BI1	11.891	424	MQGARAPRDQGRSPGRMSALGRSSVILLTYVLAATELTCLFMQFSI...	9606	0

We found 3 proteins from CATH class 1.* i.e., mainly alpha helical proteins. For help on CATH Heirarchy.

Interpreting coevolution matrix

Now let’s get the coevolution matrices for these 3 proteins and understand how to interpret the data using one of the proteins O00358.

[8]:

# Rerun the last query, but this time load the coevolution matrices (by setting matrix=True)

entries_data_frame = obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.DISEASE:"Cancer"
}, page=1, matrix=True)

entries_data_frame.iloc[0]

[8]:

uniprot_id                                                    O00358
neff                                                           5.065
sequence_length                                                  373
sequence           MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK...
organism_id                                                     9606
helix_frac                                                       0.0
turn_frac                                                        0.0
strand_frac                                                      0.0
has_ptm                                                            1
has_pdb                                                            0
has_substrate                                                      0
matrix             [[0.0, 0.05772481486201286, 0.0409208163619041...
Name: 0, dtype: object

Coevolution matrix (\(C_{i}\)) for each protein has coevolution scores \(c_{ij}\) for all residue pairs i & j.

Residue pairs with coevolution score \(c_{ij} \geq \langle C_{i}\rangle\) are considered significant, therefore \(C_{i}\) is scaled by average \(\langle C_{i}\rangle\) as shown below:

\(S_{i} = \frac{C_{i}}{\langle C_{i}\rangle}\)

Note: All \(c_{ij} \leq \langle C_{i}\rangle\) are set to 0

To compare coevolution scores between two or more proteins, \(S_{i}\) are normalised (\(N_{i}\)) as shown below:

\(N_{i} = \frac{S_{i}}{max\langle {S_{i}...S_{N}} \rangle}\)

[9]:

# scale and normalise the coevolution matrices and add them to the dataframe
obj_com_analysis = CoMAnalysis()
entries_data_frame = obj_com_analysis.scale_and_normalise_coevolution_matrices(entries_data_frame)

# We have 3 proteins, so we will have 3 matrices
entries_data_frame['uniprot_id']

[9]:

0    O00358
1    P10914
2    Q96BI1
Name: uniprot_id, dtype: object

For the first protein, lets get the top scoring residues pairs from the 90th percentile:

[10]:

#for matrix in entries_data_frame['matrix_S']:
df_top_scoring_residues = obj_com_analysis.get_top_scoring_residues(entries_data_frame['matrix_S'][0],percentile=90)
df_top_scoring_residues

[10]:

	ResA	ResB	coevolution_score
0	81	134	3.114167
1	333	360	3.075198
2	1	8	2.830141
3	320	367	2.802745
4	232	345	2.790903
...	...	...	...
6069	13	27	1.406554
6070	196	281	1.406419
6071	70	172	1.406395
6072	194	249	1.406223
6073	185	224	1.406221

6074 rows × 3 columns

Residues 81 and 134 have the strongest coevolution signal/score of 3.11, i.e., there is a very high chance that a mutation in one has a resulted in a change in the other residue. For example studies please refer to: 1. Bai et al. 2. Hopf et al. 3. Akere et al.

It is also very interesting to check which residue is the most coevolving, i.e., involved in most number of strongly coevolving pairs and which residue is the least coevolving, i.e., is involved in the least number of pairs.

[11]:

df_residue_frequencies=obj_com_analysis.get_residue_frequencies(df_top_scoring_residues)

df_residue_frequencies

[11]:

	residueID	count
0	81	39
1	333	46
2	1	17
3	320	35
4	232	34
...	...	...
364	365	29
365	355	8
366	353	12
367	354	8
368	131	14

369 rows × 2 columns

[12]:

# saving the residue frequencies to a csv file
fname = "output/csv/" + entries_data_frame['uniprot_id'][0] + "_res_freqs.csv"
df_residue_frequencies.to_csv(fname,index=False)

[13]:

#saving the top scoring residue pairs to a csv file
obj_com_analysis.save_top_scoring_residue_pairs(entries_data_frame,matrix_type='matrix_S',data_folder="output/csv")

Let us compare if the range of the coevolution scores for the top 10% of the residue pairs is different or similar for the three proteins.

[14]:

list_top_residue_stats = []
for uniprotid,matrix in zip(entries_data_frame['uniprot_id'],entries_data_frame['matrix_S']):
    df_residue_pairs_scores = obj_com_analysis.get_top_scoring_residues(matrix,percentile=90)
    list_top_residue_stats.append(df_residue_pairs_scores['coevolution_score'])

[15]:

my_color="#6495ED"
ticks_font=12
labels_font=14

[16]:

plt.figure(figsize=(4,3))
plt.boxplot(list_top_residue_stats)
plt.xticks(range(1, len(list_top_residue_stats) + 1), entries_data_frame['uniprot_id'])
plt.xlabel("UniProt ID",fontsize=labels_font)
plt.ylabel("Scaled Coevolution Score ",fontsize=labels_font-2)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("output/png/04_box_plot.png",dpi=300,transparent=True)

../_images/tutorials_02_UseCases_33_0.png

[17]:

plt.figure(figsize=(4,3))
plt.bar(
   range(0,len(list_top_residue_stats),1),
   height=[len(pair_list) for pair_list in list_top_residue_stats],
   color=my_color,
   tick_label=entries_data_frame['uniprot_id']
)

plt.xlabel("UniProt ID",fontsize=labels_font)
plt.ylabel("Residue pairs (#)",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("output/png/04_bar_plot.png",dpi=300,transparent=True)

../_images/tutorials_02_UseCases_34_0.png

Based on the box plot the range of scaled coevolution scores is different between the three proteins and the bar plot, the number of residue pairs is also different.