Tutorial 2: Use cases

Here are some more examples on how you can query the PyCoMdb using either a keyword or combination of keywords, followed by one example of how to interpret coevolution matrix data.

Query by:

  1. UniProtID

  2. Sequence

  3. EC/CATH

  4. Disease

  5. Multiple Keywords

  6. Interpreting coevolution matrix

Import all the essential PyCoM classes…

[1]:
from pycom import PyCom, ProteinParams,CoMAnalysis
import matplotlib.pyplot as plt

Setup the paths to the local database files if you are using PyCoM (local) i.e. with PyCoMdb on your computer.

[ ]:
# Set the paths to the Database files
# obj_pycom = PyCom(db_path='~/pycom.db', mat_path='/Volumes/SuperSpeedBoi/pycom.mat')

or you can use PyCoM with the remote version of PyCoMdb, uncomment or comment the code as per the right scenario.

[2]:
obj_pycom = PyCom(remote=True)

UniProtID

Using the UniProtID

[3]:
query_parameters={ProteinParams.ID:"Q65209"}
# use for remote mode
entries_data_frame=obj_pycom.find(query_parameters,page=1)
# entries_data_frame=obj_pycom.find(query_parameters)
entries_data_frame
[3]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 Q65209 6.062 141 MGNKESKYLEMCSEEAWLNIPNIFKCIFIRKLFYNKWLKYQEKKLK... 10498 0.0 0.0 0.0 0 0 0

Sequence

Using the complete protein sequence, using partial protein sequence will not yield any or incorrect result.

[4]:
entries_data_frame=obj_pycom.find({
   ProteinParams.SEQUENCE:"MGNKESKYLEMCSEEAWLNIPNIFKCIFIRKLFYNKWLKYQEKKLK"
}, page=1)

entries_data_frame
[4]:

EC and CATH

Using a combination of Enzyme Commission number and CATH ID.

[5]:
entries_data_frame=obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.ENZYME:"3.2.*"
}, page=1)

entries_data_frame
[5]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 A0A0S2UQQ5 8.393 380 MNITGKGAYDTGTYANLFQRSGYREDEIKARLEQTWNDLFYGDEHT... 198119 0.505263 0.052632 0.131579 0 1 1
1 A0A168WVR6 8.987 382 MDESNLLQGISMIDLRSPDAILSDYAKRYAHLLPELSPQLQQRMAY... 1081940 0.000000 0.000000 0.000000 0 0 1
2 A0R567 8.089 293 MSISPVELLSWYDHARRDLPWRRPGVSAWQILVSEFMLQQTPVSRV... 246196 0.000000 0.000000 0.000000 0 0 1
3 A1A048 8.357 379 MTNATDTNKTLGESMFAQCGYAQDAIDKRVSQVWHEIFEGPNKFYW... 367928 0.000000 0.000000 0.000000 0 0 1
4 A1D1W1 7.956 493 MHLPSLSVALALVSSSLALPQTVLPESDVSSHAAAVKEAFSHAWDG... 331117 0.000000 0.000000 0.000000 0 0 1
5 A3DBX3 10.164 334 MKNRVISLLMASLLLVLSVIVAPFYKAEAATVVNTPFVAVFSNFDS... 203119 0.038922 0.008982 0.365269 0 1 1
6 A3DC29 8.316 477 MKNVKKRVGVVLLILAVLGVYMLAMPANTVSAAGVPFNTKYPYGPT... 203119 0.379455 0.031447 0.058700 0 1 1
7 A7M7B9 8.207 281 MCAAAPRGGGRAARRLGAATAGSRVPSAAPRYSRRTRRVPIAYEAE... 9031 0.000000 0.000000 0.000000 0 0 1
8 A8GG79 9.140 366 MIDLREDTWTLQLYAQRYKGLSPKNSRELQLRMEYDPLKPNLPTSG... 399741 0.571038 0.035519 0.027322 0 1 1
9 B0KTG8 9.132 371 MIDLRSPNALLSDYVERYAHLSPEPSRQLQQRMDYNVRADAPAEPA... 76869 0.000000 0.000000 0.000000 0 0 1

We have found 153 proteins that match our search criteria i.e., proteins from CATH class 1.* and EC class 3.2.*.

Disease

Getting proteins linked to a particular disease.

[6]:
entries_data_frame = obj_pycom.find({
   ProteinParams.DISEASE:"Cancer"
}, page=1)

entries_data_frame
[6]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 O00358 5.065 373 MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK... 9606 0.000000 0.000000 0.000000 1 0 0
1 O15105 6.671 426 MFRTKRSALVRRLWRSRAPGGEDEEEGAGGGGGGGELRGEGATDSR... 9606 0.000000 0.000000 0.014085 1 1 0
2 O43502 9.997 376 MRGKTFRFEMQRDLVSFPLSPAVRVKLVSAGFQTAEELLEVKPSEL... 9606 0.000000 0.000000 0.000000 0 0 0
3 O43542 10.237 346 MDLDLLDLNPRIIAAIKKAKLKSVKEVLHFSGPDLKRLTNLSSPEV... 9606 0.000000 0.000000 0.000000 0 0 0
4 O75771 10.597 328 MGVLRVGLCPGLTEEMIQLLRSHRIKTVVDLVSADLEEVAQKCGLS... 9606 0.109756 0.000000 0.000000 0 1 0
5 P01111 12.817 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 0.015873 0.227513 1 1 1
6 P01112 12.841 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 0.031746 0.359788 1 1 1
7 P01116 12.626 189 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 0.031746 0.328042 1 1 1
8 P04637 5.749 393 MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLS... 9606 0.206107 0.066158 0.262087 1 1 0
9 P08118 7.621 114 MNVLLGSVVIFATFVTLCNASCYFIPNEGVPGDSTRKCMDLKGNKH... 9606 0.026316 0.026316 0.421053 0 1 0

That’s 29 proteins with known association to cancer.

Multiple Keywords

[7]:
entries_data_frame = obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.DISEASE:"Cancer"
}, page=1)

entries_data_frame
[7]:
uniprot_id neff sequence_length sequence organism_id helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate
0 O00358 5.065 373 MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK... 9606 0.0 0.0 0.0 1 0 0
1 P10914 5.137 325 MPITRMRMRPWLEMQINSNQIPGLIWINKEEMIFQIPWKHAAKHGW... 9606 0.0 0.0 0.0 1 0 0
2 Q96BI1 11.891 424 MQGARAPRDQGRSPGRMSALGRSSVILLTYVLAATELTCLFMQFSI... 9606 0.0 0.0 0.0 0 0 0

We found 3 proteins from CATH class 1.* i.e., mainly alpha helical proteins. For help on CATH Heirarchy.

Interpreting coevolution matrix

Now let’s get the coevolution matrices for these 3 proteins and understand how to interpret the data using one of the proteins O00358.

[8]:
# Rerun the last query, but this time load the coevolution matrices (by setting matrix=True)

entries_data_frame = obj_pycom.find({
   ProteinParams.CATH:"1.*",
   ProteinParams.DISEASE:"Cancer"
}, page=1, matrix=True)

entries_data_frame.iloc[0]
[8]:
uniprot_id                                                    O00358
neff                                                           5.065
sequence_length                                                  373
sequence           MTAESGPPPPQPEVLATVKEERGETAAGAGVPGEATGRGAGGRRRK...
organism_id                                                     9606
helix_frac                                                       0.0
turn_frac                                                        0.0
strand_frac                                                      0.0
has_ptm                                                            1
has_pdb                                                            0
has_substrate                                                      0
matrix             [[0.0, 0.05772481486201286, 0.0409208163619041...
Name: 0, dtype: object

Coevolution matrix (\(C_{i}\)) for each protein has coevolution scores \(c_{ij}\) for all residue pairs i & j.

Residue pairs with coevolution score \(c_{ij} \geq \langle C_{i}\rangle\) are considered significant, therefore \(C_{i}\) is scaled by average \(\langle C_{i}\rangle\) as shown below:

\(S_{i} = \frac{C_{i}}{\langle C_{i}\rangle}\)

Note: All \(c_{ij} \leq \langle C_{i}\rangle\) are set to 0

To compare coevolution scores between two or more proteins, \(S_{i}\) are normalised (\(N_{i}\)) as shown below:

\(N_{i} = \frac{S_{i}}{max\langle {S_{i}...S_{N}} \rangle}\)

[9]:
# scale and normalise the coevolution matrices and add them to the dataframe
obj_com_analysis = CoMAnalysis()
entries_data_frame = obj_com_analysis.scale_and_normalise_coevolution_matrices(entries_data_frame)

# We have 3 proteins, so we will have 3 matrices
entries_data_frame['uniprot_id']
[9]:
0    O00358
1    P10914
2    Q96BI1
Name: uniprot_id, dtype: object

For the first protein, lets get the top scoring residues pairs from the 90th percentile:

[10]:
#for matrix in entries_data_frame['matrix_S']:
df_top_scoring_residues = obj_com_analysis.get_top_scoring_residues(entries_data_frame['matrix_S'][0],percentile=90)
df_top_scoring_residues
[10]:
ResA ResB coevolution_score
0 81 134 3.114167
1 333 360 3.075198
2 1 8 2.830141
3 320 367 2.802745
4 232 345 2.790903
... ... ... ...
6069 13 27 1.406554
6070 196 281 1.406419
6071 70 172 1.406395
6072 194 249 1.406223
6073 185 224 1.406221

6074 rows × 3 columns

Residues 81 and 134 have the strongest coevolution signal/score of 3.11, i.e., there is a very high chance that a mutation in one has a resulted in a change in the other residue. For example studies please refer to: 1. Bai et al. 2. Hopf et al. 3. Akere et al.

It is also very interesting to check which residue is the most coevolving, i.e., involved in most number of strongly coevolving pairs and which residue is the least coevolving, i.e., is involved in the least number of pairs.

[11]:
df_residue_frequencies=obj_com_analysis.get_residue_frequencies(df_top_scoring_residues)

df_residue_frequencies
[11]:
residueID count
0 81 39
1 333 46
2 1 17
3 320 35
4 232 34
... ... ...
364 365 29
365 355 8
366 353 12
367 354 8
368 131 14

369 rows × 2 columns

[12]:
# saving the residue frequencies to a csv file
fname = "output/csv/" + entries_data_frame['uniprot_id'][0] + "_res_freqs.csv"
df_residue_frequencies.to_csv(fname,index=False)
[13]:
#saving the top scoring residue pairs to a csv file
obj_com_analysis.save_top_scoring_residue_pairs(entries_data_frame,matrix_type='matrix_S',data_folder="output/csv")

Let us compare if the range of the coevolution scores for the top 10% of the residue pairs is different or similar for the three proteins.

[14]:
list_top_residue_stats = []
for uniprotid,matrix in zip(entries_data_frame['uniprot_id'],entries_data_frame['matrix_S']):
    df_residue_pairs_scores = obj_com_analysis.get_top_scoring_residues(matrix,percentile=90)
    list_top_residue_stats.append(df_residue_pairs_scores['coevolution_score'])
[15]:
my_color="#6495ED"
ticks_font=12
labels_font=14
[16]:
plt.figure(figsize=(4,3))
plt.boxplot(list_top_residue_stats)
plt.xticks(range(1, len(list_top_residue_stats) + 1), entries_data_frame['uniprot_id'])
plt.xlabel("UniProt ID",fontsize=labels_font)
plt.ylabel("Scaled Coevolution Score ",fontsize=labels_font-2)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("output/png/04_box_plot.png",dpi=300,transparent=True)
../_images/tutorials_02_UseCases_33_0.png
[17]:
plt.figure(figsize=(4,3))
plt.bar(
   range(0,len(list_top_residue_stats),1),
   height=[len(pair_list) for pair_list in list_top_residue_stats],
   color=my_color,
   tick_label=entries_data_frame['uniprot_id']
)

plt.xlabel("UniProt ID",fontsize=labels_font)
plt.ylabel("Residue pairs (#)",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("output/png/04_bar_plot.png",dpi=300,transparent=True)
../_images/tutorials_02_UseCases_34_0.png

Based on the box plot the range of scaled coevolution scores is different between the three proteins and the bar plot, the number of residue pairs is also different.