Tutorial 4: Generate images from the paper
[1]:
# importing all usefull classes from PyCoM
from pycom import PyCom, ProteinParams,CoMAnalysis
import pandas as pd
import numpy as np
# matplotlib
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['font.family'] = "sans-serif"
matplotlib.rcParams['font.sans-serif'] = "Arial"
[5]:
#set the path to the database
database_folder_path="/Volumes/mason/Work/Sarath/Research/pycom/"
file_matrix_db = database_folder_path+"pycom.mat"
file_protein_db= database_folder_path+"pycom.db"
my_color="#6495ED"
[6]:
obj_pycom = PyCom(db_path=file_protein_db, mat_path=file_matrix_db)
[2]:
obj_pycom = PyCom(remote=True)
Construct your query (its empty as I want all information)
[7]:
# Here we are asking for all the proteins that match the enzyme class 3 and have been associated with the disease cancer.
query_parameters={}
# executing the query returns a pandas dataframe with information about all the proteins which match the query
Finding out dimensions of the dataframe:
[8]:
entries_data_frame=obj_pycom.find(query_parameters)
/Users/sdantu/Work/pyc_wspace/pycom/pycom/pycom/interface/_find_helper.py:19: UserWarning: No constraints were passed to find(). This will return all proteins in the database.
warn('No constraints were passed to find(). This will return all proteins in the database.')
[9]:
entries_data_frame.describe()
[9]:
neff | sequence_length | helix_frac | turn_frac | strand_frac | has_ptm | has_pdb | has_substrate | |
---|---|---|---|---|---|---|---|---|
count | 457622.000000 | 457622.000000 | 457622.000000 | 457622.000000 | 457622.000000 | 457622.000000 | 457622.000000 | 457622.000000 |
mean | 8.397407 | 251.278734 | 0.013926 | 0.001262 | 0.009162 | 0.064923 | 0.050238 | 0.427045 |
std | 2.498266 | 124.627642 | 0.076069 | 0.008348 | 0.052192 | 0.246389 | 0.218436 | 0.494649 |
min | 1.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 6.928000 | 147.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 8.621000 | 243.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 10.176000 | 351.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
max | 17.205000 | 500.000000 | 0.956522 | 0.542857 | 0.816901 | 1.000000 | 1.000000 | 1.000000 |
Save the query to a csv file
[ ]:
entries_data_frame.to_csv("Full_DB_Query.csv",index=False)
Read query data from csv file
[ ]:
#entries_data_frame=pd.read_csv("Full_DB.csv")
Find unique entries in a column:
[ ]:
entries_data_frame['has_ptm'].unique()
[ ]:
entries_data_frame['has_pdb'].value_counts()
[ ]:
entries_data_frame["neff"].min()
Supported query keywords:
uniprot_id
: The UniProt ID of the protein.sequence
: The amino acid sequence of protein to search for. (full match)min_length
/max_length
: Min/Max number of residues in the protein.min_helix
/max_helix
: Min/Max percentage of helical structure in the protein.min_turn
/max_turn
: Min/Max percentage of turn structure in the protein.min_strand
/max_strand
: Min/Max percentage of beta strand structure in the protein.organism
: Taxonomic name of the genus / species of the protein. (case-insensitive)Species name or any parent taxonomic level can be used. (
pyc.get_organism_list()
for full list)Surround with
:
to get precise results:homo:
returnsHomo sapiens
&Homo sapiens neanderthalensis
)homo
also returns homoeomma, thomomys, and hundreds others
organism_id
: Precise NCBI Taxonomy ID of the species of the protein. (prefer to useorganism
instead)cath
: CATH classification of the protein (3.40.50.360
or3.40.*.*
or3.*
).enzyme
: Enzyme Commission number of the protein. (1.3.1.3
or1.3.*.*
or1.*
).has_substrate
: Whether the protein has a known substrate. (True
/False
)has_ptm
: Whether the protein has a known post-translational modification. (True
/False
)has_pbd
: Whether the protein has a known PDB structure. (True
/False
)disease
: The disease associated with the protein. (name of disease, case-insensitive, e.gcancer
)Use
pyc.get_disease_list()
for full list.cancer
searches forOvarian cancer
,Lung cancer
, …
disease_id
: The ID of the disease associated with the protein. (DI-02205
, get_disease_list()has_disease
: Whether the protein is associated with a disease. (True
/False
)cofactor
: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.gZn(2+)
])cofactor_id
: The ID of the cofactor associated with the protein. (CHEBI:00001
, get_cofactor_list())biological_process
: Biological process associated with the protein. (e.gantiviral defense
, usepyc.get_biological_process_list()
for full list)cellular_component
: Cellular component associated with the protein. (e.gnucleus
, usepyc.get_cellular_component_list()
for full listdomain
: Domain associated with the protein. (e.gzinc-finger
, usepyc.get_domain_list()
for full list)ligand
: Ligand associated with the protein. (e.gzinc
, usepyc.get_ligand_list()
for full listmolecular_function
: Molecular function associated with the protein. (e.gantioxidant activity
, usepyc.get_molecular_function_list()
for full listptm
: Post-translational modification associated with the protein. (e.gphosphoprotein
, usepyc.get_ptm_list()
for full list
Here is an example of making a large query, then paginating the results:
[ ]:
# plotting parameters
ticks_font=12
labels_font=14
Plot N_eff
[ ]:
xlabel='$\mathrm{N}_{eff}$'
ylabel="Count"
plt.figure(figsize=(4,3))
neff_hist=plt.hist(entries_data_frame["neff"],bins=50,color=my_color,cumulative=False,density=False,alpha=0.5)
plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,18,2),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(linestyle="--",lw=1)
plt.tight_layout()
plt.savefig("Neff.png",dpi=300,transparent=True)
Plot Sequence length distribution
[ ]:
xlabel='Sequence length'
ylabel="Count"
plt.figure(figsize=(4,3))
seq_hist=plt.hist(entries_data_frame["sequence_length"],bins=50,color=my_color,cumulative=False,density=False,alpha=0.5)
plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,550,100),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(linestyle="--",lw=1)
plt.tight_layout()
plt.savefig("seq_len.png",dpi=300,transparent=True)
[ ]:
#This is not informative
xlabel='Secondary structure (%)'
ylabel="Count"
helix_hist=plt.hist(entries_has_pdb["helix_frac"]*100,bins=50,color="#6495ED",cumulative=False,density=False,alpha=0.4)
strand_hist=plt.hist(entries_has_pdb["strand_frac"]*100,bins=50,color="#FFBF00",cumulative=False,density=False,alpha=0.4)
turn_hist=plt.hist(entries_has_pdb["helix_frac"]*100,bins=50,color="#DE3163",cumulative=False,density=False,alpha=0.4)
plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,110,10),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.savefig("sstruc.png",dpi=300,transparent=True)
Get columns in the dataframe
[ ]:
entries_data_frame.head()
Add biological features to the dataframe for each protein
Initialise the object loader class and then call each add function
Add Enzyme Classification
Add CATH Class
Add Co-factors
Add PTM
Add Diseases
[ ]:
obj_data_loader=obj_pycom.get_data_loader()
entries_data_frame=obj_data_loader.add_enzyme_commission(entries_data_frame,force_single_entry=False)
entries_data_frame=obj_data_loader.add_cath_class(entries_data_frame,force_single_entry=False)
[ ]:
entries_data_frame=obj_data_loader.add_pdbs(entries_data_frame,force_single_entry=False)
[ ]:
entries_data_frame=obj_data_loader.add_cofactors(entries_data_frame,force_single_entry=False)
[ ]:
entries_data_frame=obj_data_loader.add_ptm(entries_data_frame,force_single_entry=False)
[ ]:
entries_data_frame=obj_data_loader.add_diseases(entries_data_frame,force_single_entry=False)
Save the progress to a csv file
[ ]:
entries_data_frame.to_csv("Full_DB_With_Details.csv",index=False)
[ ]:
entries_data_frame['sequence'].unique().sum()
Number of entries with PDB files available
[ ]:
entries_data_frame["pdb_id"].notna().sum()
[ ]:
ec_full.describe(include="all")
[ ]:
entries_data_frame["cofactor"].notna().sum()
[ ]:
entries_data_frame["cath_class"].notna().sum()
[ ]:
entries_with_ec_data=entries_data_frame[entries_data_frame["enzyme_commission"].notna()]
entries_with_cath_data=entries_data_frame[entries_data_frame["cath_class"].notna()]
[ ]:
entries_with_cath_data["cath_class"].isna().value_counts()
[ ]:
from collections import OrderedDict
def group_data_by_class(data,data_type=1):
global dict_group_data
dict_group_data={}
data_class="enzyme_commission"
if(data_type==1):
data_class="enzyme_commission"
if(data_type==2):
data_class="cath_class"
for i_data in data[data_class]:
n=len(i_data)
if(n==1):
classid=i_data[0].split('.')[0]
update_dict_group(classid)
if(n>1):
for j in i_data:
update_dict_group(j[0].split('.')[0])
dict_group_data=dict(sorted(dict_group_data.items()))
return dict_group_data
def update_dict_group(classid):
global dict_group_data
if(classid in dict_group_data.keys()):
dict_group_data[classid]=dict_group_data[classid]+1
else:
dict_group_data[classid]=1
[ ]:
ec_numbers=group_data_by_class(entries_with_ec_data,data_type=1)
ec_numbers
[ ]:
cath_numbers=group_data_by_class(entries_with_cath_data,data_type=2)
cath_numbers
[ ]:
entries_with_cath_data["cath_class"].isna().sum()
[ ]:
entries_with_cath_data
[ ]:
ec_pie=plt.pie(ec_numbers.values(),
labels=ec_numbers.keys(),
autopct='%1.1f%%',
textprops=dict(color="w",fontsize=10),
startangle=90)
plt.savefig("ecdata.png",dpi=300,transparent=True)
[ ]:
cath_pie=plt.pie(cath_numbers.values(),
labels=cath_numbers.keys(),
autopct='%1.1f%%',
textprops=dict(color="black",fontsize=10),
startangle=90)
plt.tight_layout()
plt.savefig("cathdata.png",dpi=300,transparent=True)
[ ]:
plt.figure(figsize=(4,3))
plt.bar(cath_numbers.keys(),
height=cath_numbers.values(),
color=my_color
)
plt.xlabel("CATH Class",fontsize=labels_font)
plt.ylabel("Count",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("cathdata.png",dpi=300,transparent=True)
[ ]:
plt.figure(figsize=(4,3))
plt.bar(ec_numbers.keys(),
height=ec_numbers.values(),
color=my_color,
)
plt.xlabel("Enzyme Commission Class",fontsize=labels_font)
plt.ylabel("Count",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("ecdata.png",dpi=300,transparent=True)
[ ]:
dis=entries_data_frame[entries_data_frame['enzyme_commission'].notna()]
[ ]:
dis["enzyme_commission"].unique()
[ ]:
Protein