Tutorial 4: Generate images from the paper

[1]:

# importing all usefull classes from PyCoM
from pycom import PyCom, ProteinParams,CoMAnalysis
import pandas as pd
import numpy as np
# matplotlib
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['pdf.fonttype'] = 42
matplotlib.rcParams['font.family'] = "sans-serif"
matplotlib.rcParams['font.sans-serif'] = "Arial"

[5]:

#set the path to the database
database_folder_path="/Volumes/mason/Work/Sarath/Research/pycom/"
file_matrix_db = database_folder_path+"pycom.mat"
file_protein_db= database_folder_path+"pycom.db"
my_color="#6495ED"

[6]:

obj_pycom = PyCom(db_path=file_protein_db, mat_path=file_matrix_db)

[2]:

obj_pycom = PyCom(remote=True)

Construct your query (its empty as I want all information)

[7]:

# Here we are asking for all the proteins that match the enzyme class 3 and have been associated with the disease cancer.
query_parameters={}
# executing the query returns a pandas dataframe with information about all the proteins which match the query

Finding out dimensions of the dataframe:

[8]:

entries_data_frame=obj_pycom.find(query_parameters)

/Users/sdantu/Work/pyc_wspace/pycom/pycom/pycom/interface/_find_helper.py:19: UserWarning: No constraints were passed to find(). This will return all proteins in the database.
  warn('No constraints were passed to find(). This will return all proteins in the database.')

[9]:

entries_data_frame.describe()

[9]:

	neff	sequence_length	helix_frac	turn_frac	strand_frac	has_ptm	has_pdb	has_substrate
count	457622.000000	457622.000000	457622.000000	457622.000000	457622.000000	457622.000000	457622.000000	457622.000000
mean	8.397407	251.278734	0.013926	0.001262	0.009162	0.064923	0.050238	0.427045
std	2.498266	124.627642	0.076069	0.008348	0.052192	0.246389	0.218436	0.494649
min	1.000000	5.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	6.928000	147.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
50%	8.621000	243.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	10.176000	351.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000
max	17.205000	500.000000	0.956522	0.542857	0.816901	1.000000	1.000000	1.000000

Save the query to a csv file

[ ]:

entries_data_frame.to_csv("Full_DB_Query.csv",index=False)

Read query data from csv file

[ ]:

#entries_data_frame=pd.read_csv("Full_DB.csv")

Find unique entries in a column:

[ ]:

entries_data_frame['has_ptm'].unique()

[ ]:

entries_data_frame['has_pdb'].value_counts()

[ ]:

entries_data_frame["neff"].min()

Supported query keywords:

uniprot_id: The UniProt ID of the protein.
sequence: The amino acid sequence of protein to search for. (full match)
min_length / max_length: Min/Max number of residues in the protein.
min_helix / max_helix: Min/Max percentage of helical structure in the protein.
min_turn / max_turn: Min/Max percentage of turn structure in the protein.
min_strand / max_strand: Min/Max percentage of beta strand structure in the protein.
organism: Taxonomic name of the genus / species of the protein. (case-insensitive)
- Species name or any parent taxonomic level can be used. (pyc.get_organism_list() for full list)
- Surround with : to get precise results
  - :homo: returns Homo sapiens & Homo sapiens neanderthalensis)
  - homo also returns homoeomma, thomomys, and hundreds others
organism_id: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use organism instead)
cath: CATH classification of the protein (3.40.50.360 or 3.40.*.* or 3.*).
enzyme: Enzyme Commission number of the protein. (1.3.1.3 or 1.3.*.* or 1.*).
has_substrate: Whether the protein has a known substrate. (True/False)
has_ptm: Whether the protein has a known post-translational modification. (True/False)
has_pbd: Whether the protein has a known PDB structure. (True/False)
disease: The disease associated with the protein. (name of disease, case-insensitive, e.g cancer)
- Use pyc.get_disease_list() for full list.
- cancer searches for Ovarian cancer, Lung cancer, …
disease_id: The ID of the disease associated with the protein. (DI-02205, get_disease_list()
has_disease: Whether the protein is associated with a disease. (True/False)
cofactor: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g Zn(2+)])
cofactor_id: The ID of the cofactor associated with the protein. (CHEBI:00001, get_cofactor_list())
biological_process: Biological process associated with the protein. (e.g antiviral defense, use pyc.get_biological_process_list() for full list)
cellular_component: Cellular component associated with the protein. (e.g nucleus, use pyc.get_cellular_component_list() for full list
domain: Domain associated with the protein. (e.g zinc-finger, use pyc.get_domain_list() for full list)
ligand: Ligand associated with the protein. (e.g zinc, use pyc.get_ligand_list() for full list
molecular_function: Molecular function associated with the protein. (e.g antioxidant activity, use pyc.get_molecular_function_list() for full list
ptm: Post-translational modification associated with the protein. (e.g phosphoprotein, use pyc.get_ptm_list() for full list

Here is an example of making a large query, then paginating the results:

[ ]:

# plotting parameters
ticks_font=12
labels_font=14

Plot N_eff

[ ]:

xlabel='$\mathrm{N}_{eff}$'
ylabel="Count"
plt.figure(figsize=(4,3))
neff_hist=plt.hist(entries_data_frame["neff"],bins=50,color=my_color,cumulative=False,density=False,alpha=0.5)
plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,18,2),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(linestyle="--",lw=1)
plt.tight_layout()
plt.savefig("Neff.png",dpi=300,transparent=True)

Plot Sequence length distribution

[ ]:

xlabel='Sequence length'
ylabel="Count"
plt.figure(figsize=(4,3))
seq_hist=plt.hist(entries_data_frame["sequence_length"],bins=50,color=my_color,cumulative=False,density=False,alpha=0.5)
plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,550,100),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(linestyle="--",lw=1)
plt.tight_layout()
plt.savefig("seq_len.png",dpi=300,transparent=True)

[ ]:

#This is not informative

xlabel='Secondary structure (%)'
ylabel="Count"
helix_hist=plt.hist(entries_has_pdb["helix_frac"]*100,bins=50,color="#6495ED",cumulative=False,density=False,alpha=0.4)
strand_hist=plt.hist(entries_has_pdb["strand_frac"]*100,bins=50,color="#FFBF00",cumulative=False,density=False,alpha=0.4)
turn_hist=plt.hist(entries_has_pdb["helix_frac"]*100,bins=50,color="#DE3163",cumulative=False,density=False,alpha=0.4)

plt.xlabel(xlabel,fontsize=labels_font)
plt.ylabel(ylabel,fontsize=labels_font)
plt.xticks(np.arange(0,110,10),fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.savefig("sstruc.png",dpi=300,transparent=True)

Get columns in the dataframe

[ ]:

entries_data_frame.head()

Add biological features to the dataframe for each protein

Initialise the object loader class and then call each add function

Add Enzyme Classification
Add CATH Class
Add Co-factors
Add PTM
Add Diseases

[ ]:

obj_data_loader=obj_pycom.get_data_loader()
entries_data_frame=obj_data_loader.add_enzyme_commission(entries_data_frame,force_single_entry=False)
entries_data_frame=obj_data_loader.add_cath_class(entries_data_frame,force_single_entry=False)

[ ]:

entries_data_frame=obj_data_loader.add_pdbs(entries_data_frame,force_single_entry=False)

[ ]:

entries_data_frame=obj_data_loader.add_cofactors(entries_data_frame,force_single_entry=False)

[ ]:

entries_data_frame=obj_data_loader.add_ptm(entries_data_frame,force_single_entry=False)

[ ]:

entries_data_frame=obj_data_loader.add_diseases(entries_data_frame,force_single_entry=False)

Save the progress to a csv file

[ ]:

entries_data_frame.to_csv("Full_DB_With_Details.csv",index=False)

[ ]:

entries_data_frame['sequence'].unique().sum()

Number of entries with PDB files available

[ ]:

entries_data_frame["pdb_id"].notna().sum()

[ ]:

ec_full.describe(include="all")

[ ]:

entries_data_frame["cofactor"].notna().sum()

[ ]:

entries_data_frame["cath_class"].notna().sum()

[ ]:

entries_with_ec_data=entries_data_frame[entries_data_frame["enzyme_commission"].notna()]
entries_with_cath_data=entries_data_frame[entries_data_frame["cath_class"].notna()]

[ ]:

entries_with_cath_data["cath_class"].isna().value_counts()

[ ]:

from collections import OrderedDict

def group_data_by_class(data,data_type=1):
    global dict_group_data
    dict_group_data={}

    data_class="enzyme_commission"
    if(data_type==1):
        data_class="enzyme_commission"
    if(data_type==2):
        data_class="cath_class"

    for i_data in data[data_class]:
        n=len(i_data)
        if(n==1):
            classid=i_data[0].split('.')[0]
            update_dict_group(classid)
        if(n>1):
            for j in i_data:
                update_dict_group(j[0].split('.')[0])
    dict_group_data=dict(sorted(dict_group_data.items()))
    return dict_group_data
def update_dict_group(classid):
    global dict_group_data

    if(classid in dict_group_data.keys()):
        dict_group_data[classid]=dict_group_data[classid]+1
    else:
        dict_group_data[classid]=1

[ ]:

ec_numbers=group_data_by_class(entries_with_ec_data,data_type=1)
ec_numbers

[ ]:

cath_numbers=group_data_by_class(entries_with_cath_data,data_type=2)
cath_numbers

[ ]:

entries_with_cath_data["cath_class"].isna().sum()

[ ]:

entries_with_cath_data

[ ]:

ec_pie=plt.pie(ec_numbers.values(),
               labels=ec_numbers.keys(),
               autopct='%1.1f%%',
               textprops=dict(color="w",fontsize=10),
               startangle=90)
plt.savefig("ecdata.png",dpi=300,transparent=True)

[ ]:

cath_pie=plt.pie(cath_numbers.values(),
               labels=cath_numbers.keys(),
               autopct='%1.1f%%',
               textprops=dict(color="black",fontsize=10),
               startangle=90)
plt.tight_layout()
plt.savefig("cathdata.png",dpi=300,transparent=True)

[ ]:

plt.figure(figsize=(4,3))
plt.bar(cath_numbers.keys(),
                 height=cath_numbers.values(),
                 color=my_color
                )
plt.xlabel("CATH Class",fontsize=labels_font)
plt.ylabel("Count",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("cathdata.png",dpi=300,transparent=True)

[ ]:

plt.figure(figsize=(4,3))
plt.bar(ec_numbers.keys(),
        height=ec_numbers.values(),
        color=my_color,
       )
plt.xlabel("Enzyme Commission Class",fontsize=labels_font)
plt.ylabel("Count",fontsize=labels_font)
plt.xticks(fontsize=ticks_font)
plt.yticks(fontsize=ticks_font)
plt.grid(axis='y',ls="--",lw=1)
plt.tight_layout()
plt.savefig("ecdata.png",dpi=300,transparent=True)

[ ]:

dis=entries_data_frame[entries_data_frame['enzyme_commission'].notna()]

[ ]:

dis["enzyme_commission"].unique()

[ ]:

Protein