{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Tutorial 1: Workflow of PyCoM\n",
"\n",
"In this tutorial, with a small example, you will learn how to create a workflow with the local version of PyCoM, using the steps listed below:\n",
"\n",
"1. [Setup](#setup)\n",
"2. [Initalise pycom objects](#initalise-pycom-objects)\n",
"3. [Create a query dictionary](#create-a-query-dictionary)\n",
"4. [Save and retrieve progress](#save-and-retrieve-progress)\n",
"5. [Analyse search results](#analyse-search-results)\n",
"6. [Add biological features](#add-biological-features)\n",
"7. [Some statistics](#some-statistics)\n",
"8. [Coevolution matrix analysis](#coevolution-matrix-analysis)\n",
"9. [Help on UniProt Controlled Vocabulary](#help-on-uniprot-controlled-vocabulary)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Setup\n",
"\n",
"The assumption is that you have completed the [installation](https://pycom.brunel.ac.uk/install.html) and [downloaded](https://pycom.brunel.ac.uk/database.html) the database. For help on this please look at the quick guide [here](https://pycom.brunel.ac.uk/gettingstarted.html)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initalise pycom objects\n",
"First, lets import all the libraries and classes we need from pycom, pandas, matplotlib, and numpy"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"# importing all usefull classes from PyCoM\n",
"from pycom import PyCom, ProteinParams,CoMAnalysis\n",
"import pandas as pd\n",
"import numpy as np\n",
"# matplotlib; useful for plotting later\n",
"import matplotlib\n",
"import matplotlib.pyplot as plt\n",
"#setting matplotlib parameters\n",
"matplotlib.rcParams['pdf.fonttype'] = 42\n",
"matplotlib.rcParams['font.family'] = \"sans-serif\"\n",
"matplotlib.rcParams['font.sans-serif'] = \"Arial\""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"#set the path to the database \n",
"database_folder_path=\"/Volumes/mason/Work/Sarath/Research/pycom/\"\n",
"#matrix file name and path\n",
"file_matrix_db = database_folder_path+\"pycom.mat\"\n",
"#protein database file name and path\n",
"file_protein_db= database_folder_path+\"pycom.db\""
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"obj_pycom = PyCom(db_path=file_protein_db, mat_path=file_matrix_db)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create a query dictionary\n",
"\n",
"To query the database, we need to create a dictionary object `query_parameters` using the keywords for our choice of properties. For the full list of keywords please check []()\n",
"\n",
"**Empty `query_parameters` will return information on all the ~457,000 proteins in the database**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"To query the database, we need to create a dictionary object `query_parameters` using the keywords for our choice of properties. For the full list of keywords please check []()\n",
"\n",
"**Empty `query_parameters` will return information on all the ~457,000 proteins in the database**"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"#creating empty query dictionary\n",
"query_parameters={}"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"# Here we are asking for all the proteins that match the enzyme class 3 and have been associated with the disease cancer.\n",
"query_parameters={ProteinParams.DISEASE:\"cancer\",\n",
" ProteinParams.ENZYME: '3.*',\n",
" ProteinParams.MIN_LENGTH: 100,\n",
" ProteinParams.MAX_LENGTH: 200,\n",
" }"
]
},
{
"cell_type": "markdown",
"metadata": {
"jp-MarkdownHeadingCollapsed": true,
"tags": []
},
"source": [
"Executing the query with the parameters defined in the above cell using the pycom object `obj_pycom` `find()` function will return a pandas dataframe with the search results containing information about all the proteins which match our query."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"entries_data_frame=obj_pycom.find(query_parameters)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Save and retrieve progress\n",
"\n",
"We can save and retreive our progress by saving our dataframe with information on our favourite proteins by saving it to a csv file. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Save the query to a csv file\n",
"\n",
"To avoid rerunning the query we can cave the progress to a csv file."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"entries_data_frame.to_csv(\"output/DB_Query_Results.csv\",index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Read query data from csv file\n",
"\n",
"Retrieving our progress from the csv file."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"#entries_data_frame=pd.read_csv(\"output/DB_Query_Results.csv\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Analyse search results\n",
"\n",
"The search returns a pandas data frame with proteins matching the query critiera with following information for each protein:\n",
"\n",
"* uniprot_id: Uniprot ID\n",
"* neff: Depth of the sequence alignment $N_{eff}$ \n",
"* sequence_length: Sequence length\n",
"* sequence: protein sequence\n",
"* organism_id: Organism ID\n",
"* helic_frac, turn_frac, strand_frac: helix, turn, and strand structure fraction\n",
"* has_ptm: Has a PTM Yes/No\n",
"* has_pdb: Has a PDB structure Yes/No\n",
"* has_substrate: Has a substrate for biological activity Yes/No\n",
"* matrix: coevolution matrix column is empty because at this stage we would still want you to check the search results and if required filter them based on any of the biological properties before loading the matrices.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, look what columns we have and their names."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" uniprot_id \n",
" neff \n",
" sequence_length \n",
" sequence \n",
" organism_id \n",
" helix_frac \n",
" turn_frac \n",
" strand_frac \n",
" has_ptm \n",
" has_pdb \n",
" has_substrate \n",
" matrix \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P01111 \n",
" 12.817 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.349206 \n",
" 0.015873 \n",
" 0.227513 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" None \n",
" \n",
" \n",
" 1 \n",
" P01112 \n",
" 12.841 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.317460 \n",
" 0.031746 \n",
" 0.359788 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" None \n",
" \n",
" \n",
" 2 \n",
" P01116 \n",
" 12.626 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.375661 \n",
" 0.031746 \n",
" 0.328042 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" None \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" uniprot_id neff sequence_length \\\n",
"0 P01111 12.817 189 \n",
"1 P01112 12.841 189 \n",
"2 P01116 12.626 189 \n",
"\n",
" sequence organism_id helix_frac \\\n",
"0 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 \n",
"1 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 \n",
"2 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 \n",
"\n",
" turn_frac strand_frac has_ptm has_pdb has_substrate matrix \n",
"0 0.015873 0.227513 1 1 1 None \n",
"1 0.031746 0.359788 1 1 1 None \n",
"2 0.031746 0.328042 1 1 1 None "
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`describe()` function from pandas can be used to get a summary of all the features:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" uniprot_id \n",
" neff \n",
" sequence_length \n",
" sequence \n",
" organism_id \n",
" helix_frac \n",
" turn_frac \n",
" strand_frac \n",
" has_ptm \n",
" has_pdb \n",
" has_substrate \n",
" matrix \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 3 \n",
" 3.000000 \n",
" 3.0 \n",
" 3 \n",
" 3 \n",
" 3.000000 \n",
" 3.000000 \n",
" 3.000000 \n",
" 3.0 \n",
" 3.0 \n",
" 3.0 \n",
" 0 \n",
" \n",
" \n",
" unique \n",
" 3 \n",
" NaN \n",
" NaN \n",
" 3 \n",
" 1 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" 0 \n",
" \n",
" \n",
" top \n",
" P01111 \n",
" NaN \n",
" NaN \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" freq \n",
" 1 \n",
" NaN \n",
" NaN \n",
" 1 \n",
" 3 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" mean \n",
" NaN \n",
" 12.761333 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.347443 \n",
" 0.026455 \n",
" 0.305115 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
" std \n",
" NaN \n",
" 0.117815 \n",
" 0.0 \n",
" NaN \n",
" NaN \n",
" 0.029141 \n",
" 0.009164 \n",
" 0.069054 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" NaN \n",
" \n",
" \n",
" min \n",
" NaN \n",
" 12.626000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.317460 \n",
" 0.015873 \n",
" 0.227513 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
" 25% \n",
" NaN \n",
" 12.721500 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.333333 \n",
" 0.023810 \n",
" 0.277778 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
" 50% \n",
" NaN \n",
" 12.817000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.349206 \n",
" 0.031746 \n",
" 0.328042 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
" 75% \n",
" NaN \n",
" 12.829000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.362434 \n",
" 0.031746 \n",
" 0.343915 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
" max \n",
" NaN \n",
" 12.841000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.375661 \n",
" 0.031746 \n",
" 0.359788 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" uniprot_id neff sequence_length \\\n",
"count 3 3.000000 3.0 \n",
"unique 3 NaN NaN \n",
"top P01111 NaN NaN \n",
"freq 1 NaN NaN \n",
"mean NaN 12.761333 189.0 \n",
"std NaN 0.117815 0.0 \n",
"min NaN 12.626000 189.0 \n",
"25% NaN 12.721500 189.0 \n",
"50% NaN 12.817000 189.0 \n",
"75% NaN 12.829000 189.0 \n",
"max NaN 12.841000 189.0 \n",
"\n",
" sequence organism_id \\\n",
"count 3 3 \n",
"unique 3 1 \n",
"top MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 \n",
"freq 1 3 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN \n",
"\n",
" helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate \\\n",
"count 3.000000 3.000000 3.000000 3.0 3.0 3.0 \n",
"unique NaN NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN NaN \n",
"mean 0.347443 0.026455 0.305115 1.0 1.0 1.0 \n",
"std 0.029141 0.009164 0.069054 0.0 0.0 0.0 \n",
"min 0.317460 0.015873 0.227513 1.0 1.0 1.0 \n",
"25% 0.333333 0.023810 0.277778 1.0 1.0 1.0 \n",
"50% 0.349206 0.031746 0.328042 1.0 1.0 1.0 \n",
"75% 0.362434 0.031746 0.343915 1.0 1.0 1.0 \n",
"max 0.375661 0.031746 0.359788 1.0 1.0 1.0 \n",
"\n",
" matrix \n",
"count 0 \n",
"unique 0 \n",
"top NaN \n",
"freq NaN \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame.describe(include=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get counts of categorical data in the column, for example:\n",
"* number of proteins with a known PDB structure\n",
"* number of proteins with a known PTM"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"has_pdb\n",
"1 3\n",
"Name: count, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame['has_pdb'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find number of unique elements in the column, for example number of unique organisms:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['9606'], dtype=object)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame['organism_id'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"All the sequences are from the same organism, `9606` i.e. from `Homo sapiens`. Full list is available from [UniProt](https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/docs/speclist.txt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Some statistics on the numerical column, for example `neff`:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 3.000000\n",
"mean 12.761333\n",
"std 0.117815\n",
"min 12.626000\n",
"25% 12.721500\n",
"50% 12.817000\n",
"75% 12.829000\n",
"max 12.841000\n",
"Name: neff, dtype: float64"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame[\"neff\"].describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use other functions to get some of the information"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12.626"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame[\"neff\"].min()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"12.761333333333333"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame[\"neff\"].mean()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Add biological features\n",
"\n",
"Initialise the object loader class and then call each add function\n",
"\n",
"1. Add Enzyme Classification \n",
"2. Add CATH Class\n",
"3. Add Co-factors\n",
"4. Add PTM\n",
"5. Add Diseases\n",
"\n",
"For a protein entry if the requested data (EC/CATH/Cofactors...) does not exist, corresponding entry in that column will be `nan`. We can filter such rows as shown further below.\n",
"\n",
"**Please note the dataloader functions will not work with remote version of PyCoM**"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"#initialise the object for data loader class\n",
"obj_data_loader=obj_pycom.get_data_loader()"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"#add enzyme commission data to the dataframe\n",
"entries_data_frame=obj_data_loader.add_enzyme_commission(entries_data_frame,force_single_entry=False)\n",
"#add CATH data to the dataframe\n",
"entries_data_frame=obj_data_loader.add_cath_class(entries_data_frame,force_single_entry=False)\n",
"#add CATH data to the dataframe\n",
"entries_data_frame=obj_data_loader.add_pdbs(entries_data_frame,force_single_entry=False)\n",
"#get list of all cofactors for each protein\n",
"entries_data_frame=obj_data_loader.add_cofactors(entries_data_frame,force_single_entry=False)\n",
"#get list of all PTM's for each protein\n",
"entries_data_frame=obj_data_loader.add_ptm(entries_data_frame,force_single_entry=False)\n",
"#get list of all diseases for each protein\n",
"entries_data_frame=obj_data_loader.add_diseases(entries_data_frame,force_single_entry=False)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"#get substrates for the proteins\n",
"entries_data_frame=obj_data_loader.add_ligand(entries_data_frame,force_single_entry=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Save the progress to a csv file\n",
"\n",
"As we have added a lot of information to our dataframe, let's save our progress so that we can restart from this point, in future, if required."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"entries_data_frame.to_csv(\"output/DB_Query_Results_With_Details.csv\",index=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Some Statistics\n",
"\n",
"Let's look at some statistics for all the columns in the dataframe. Below are some examples of how you can do some fun things with the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" uniprot_id \n",
" neff \n",
" sequence_length \n",
" sequence \n",
" organism_id \n",
" helix_frac \n",
" turn_frac \n",
" strand_frac \n",
" has_ptm \n",
" has_pdb \n",
" has_substrate \n",
" matrix \n",
" enzyme_commission \n",
" cath_class \n",
" pdb_id \n",
" cofactor \n",
" ptm \n",
" disease_name \n",
" disease_id \n",
" ligand \n",
" \n",
" \n",
" \n",
" \n",
" count \n",
" 3 \n",
" 3.000000 \n",
" 3.0 \n",
" 3 \n",
" 3 \n",
" 3.000000 \n",
" 3.000000 \n",
" 3.000000 \n",
" 3.0 \n",
" 3.0 \n",
" 3.0 \n",
" 0 \n",
" 3 \n",
" 3 \n",
" 3 \n",
" 0 \n",
" 0 \n",
" 3 \n",
" 3 \n",
" 3 \n",
" \n",
" \n",
" unique \n",
" 3 \n",
" NaN \n",
" NaN \n",
" 3 \n",
" 1 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" 0 \n",
" 1 \n",
" 1 \n",
" 3 \n",
" 0 \n",
" 0 \n",
" 3 \n",
" 3 \n",
" 1 \n",
" \n",
" \n",
" top \n",
" P01111 \n",
" NaN \n",
" NaN \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... \n",
" NaN \n",
" NaN \n",
" [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
" [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
" [GTP-binding, Nucleotide-binding] \n",
" \n",
" \n",
" freq \n",
" 1 \n",
" NaN \n",
" NaN \n",
" 1 \n",
" 3 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" 3 \n",
" 3 \n",
" 1 \n",
" NaN \n",
" NaN \n",
" 1 \n",
" 1 \n",
" 3 \n",
" \n",
" \n",
" mean \n",
" NaN \n",
" 12.761333 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.347443 \n",
" 0.026455 \n",
" 0.305115 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" std \n",
" NaN \n",
" 0.117815 \n",
" 0.0 \n",
" NaN \n",
" NaN \n",
" 0.029141 \n",
" 0.009164 \n",
" 0.069054 \n",
" 0.0 \n",
" 0.0 \n",
" 0.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" min \n",
" NaN \n",
" 12.626000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.317460 \n",
" 0.015873 \n",
" 0.227513 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 25% \n",
" NaN \n",
" 12.721500 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.333333 \n",
" 0.023810 \n",
" 0.277778 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 50% \n",
" NaN \n",
" 12.817000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.349206 \n",
" 0.031746 \n",
" 0.328042 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" 75% \n",
" NaN \n",
" 12.829000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.362434 \n",
" 0.031746 \n",
" 0.343915 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
" max \n",
" NaN \n",
" 12.841000 \n",
" 189.0 \n",
" NaN \n",
" NaN \n",
" 0.375661 \n",
" 0.031746 \n",
" 0.359788 \n",
" 1.0 \n",
" 1.0 \n",
" 1.0 \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" NaN \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" uniprot_id neff sequence_length \\\n",
"count 3 3.000000 3.0 \n",
"unique 3 NaN NaN \n",
"top P01111 NaN NaN \n",
"freq 1 NaN NaN \n",
"mean NaN 12.761333 189.0 \n",
"std NaN 0.117815 0.0 \n",
"min NaN 12.626000 189.0 \n",
"25% NaN 12.721500 189.0 \n",
"50% NaN 12.817000 189.0 \n",
"75% NaN 12.829000 189.0 \n",
"max NaN 12.841000 189.0 \n",
"\n",
" sequence organism_id \\\n",
"count 3 3 \n",
"unique 3 1 \n",
"top MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 \n",
"freq 1 3 \n",
"mean NaN NaN \n",
"std NaN NaN \n",
"min NaN NaN \n",
"25% NaN NaN \n",
"50% NaN NaN \n",
"75% NaN NaN \n",
"max NaN NaN \n",
"\n",
" helix_frac turn_frac strand_frac has_ptm has_pdb has_substrate \\\n",
"count 3.000000 3.000000 3.000000 3.0 3.0 3.0 \n",
"unique NaN NaN NaN NaN NaN NaN \n",
"top NaN NaN NaN NaN NaN NaN \n",
"freq NaN NaN NaN NaN NaN NaN \n",
"mean 0.347443 0.026455 0.305115 1.0 1.0 1.0 \n",
"std 0.029141 0.009164 0.069054 0.0 0.0 0.0 \n",
"min 0.317460 0.015873 0.227513 1.0 1.0 1.0 \n",
"25% 0.333333 0.023810 0.277778 1.0 1.0 1.0 \n",
"50% 0.349206 0.031746 0.328042 1.0 1.0 1.0 \n",
"75% 0.362434 0.031746 0.343915 1.0 1.0 1.0 \n",
"max 0.375661 0.031746 0.359788 1.0 1.0 1.0 \n",
"\n",
" matrix enzyme_commission cath_class \\\n",
"count 0 3 3 \n",
"unique 0 1 1 \n",
"top NaN [3.6.5.2] [3.40.50.300] \n",
"freq NaN 3 3 \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
" pdb_id cofactor ptm \\\n",
"count 3 0 0 \n",
"unique 3 0 0 \n",
"top [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... NaN NaN \n",
"freq 1 NaN NaN \n",
"mean NaN NaN NaN \n",
"std NaN NaN NaN \n",
"min NaN NaN NaN \n",
"25% NaN NaN NaN \n",
"50% NaN NaN NaN \n",
"75% NaN NaN NaN \n",
"max NaN NaN NaN \n",
"\n",
" disease_name \\\n",
"count 3 \n",
"unique 3 \n",
"top [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
"freq 1 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"\n",
" disease_id \\\n",
"count 3 \n",
"unique 3 \n",
"top [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
"freq 1 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN \n",
"\n",
" ligand \n",
"count 3 \n",
"unique 1 \n",
"top [GTP-binding, Nucleotide-binding] \n",
"freq 3 \n",
"mean NaN \n",
"std NaN \n",
"min NaN \n",
"25% NaN \n",
"50% NaN \n",
"75% NaN \n",
"max NaN "
]
},
"execution_count": 33,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#include=all will also include columns with 'nan' entries\n",
"entries_data_frame.describe(include=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Find unique ligands and count them:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"ligand\n",
"[GTP-binding, Nucleotide-binding] 3\n",
"Name: count, dtype: int64"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries_data_frame['ligand'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [],
"source": [
"# Filter the search results where we have a ligand interacting with the protein\n",
"df_results_with_ligand=entries_data_frame[entries_data_frame['ligand'].notna()]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Count columns without 'nan' entries"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Number of proteins with pdb data\n",
"df_results_with_ligand[\"pdb_id\"].notna().sum()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Coevolution matrix analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Load the matrix\n",
"Lets get the coevolution matrix for the filtered dataframe `df_results_with_ligand` using the `load_matrices()` from `obj_pycom`"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"df_results_with_ligand=obj_pycom.load_matrices(df_results_with_ligand)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" uniprot_id \n",
" neff \n",
" sequence_length \n",
" sequence \n",
" organism_id \n",
" helix_frac \n",
" turn_frac \n",
" strand_frac \n",
" has_ptm \n",
" has_pdb \n",
" has_substrate \n",
" matrix \n",
" enzyme_commission \n",
" cath_class \n",
" pdb_id \n",
" cofactor \n",
" ptm \n",
" disease_name \n",
" disease_id \n",
" ligand \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P01111 \n",
" 12.817 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.349206 \n",
" 0.015873 \n",
" 0.227513 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" [[0.0, 0.5163763761520386, 0.3219393491744995,... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... \n",
" NaN \n",
" NaN \n",
" [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
" [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
" [GTP-binding, Nucleotide-binding] \n",
" \n",
" \n",
" 1 \n",
" P01112 \n",
" 12.841 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.317460 \n",
" 0.031746 \n",
" 0.359788 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" [[0.0, 0.5560339689254761, 0.34521734714508057... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR... \n",
" NaN \n",
" NaN \n",
" [Costello syndrome, Congenital myopathy with e... \n",
" [DI-01437, DI-01411, DI-04532, DI-02612, DI-03... \n",
" [GTP-binding, Nucleotide-binding] \n",
" \n",
" \n",
" 2 \n",
" P01116 \n",
" 12.626 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.375661 \n",
" 0.031746 \n",
" 0.328042 \n",
" 1 \n",
" 1 \n",
" 1 \n",
" [[0.0, 0.38467222452163696, 0.3104382753372192... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4... \n",
" NaN \n",
" NaN \n",
" [Leukemia, acute myelogenous, Leukemia, juveni... \n",
" [DI-01171, DI-01851, DI-02073, DI-02971, DI-03... \n",
" [GTP-binding, Nucleotide-binding] \n",
" \n",
" \n",
"
\n",
"
"
],
"text/plain": [
" uniprot_id neff sequence_length \\\n",
"0 P01111 12.817 189 \n",
"1 P01112 12.841 189 \n",
"2 P01116 12.626 189 \n",
"\n",
" sequence organism_id helix_frac \\\n",
"0 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 \n",
"1 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 \n",
"2 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 \n",
"\n",
" turn_frac strand_frac has_ptm has_pdb has_substrate \\\n",
"0 0.015873 0.227513 1 1 1 \n",
"1 0.031746 0.359788 1 1 1 \n",
"2 0.031746 0.328042 1 1 1 \n",
"\n",
" matrix enzyme_commission \\\n",
"0 [[0.0, 0.5163763761520386, 0.3219393491744995,... [3.6.5.2] \n",
"1 [[0.0, 0.5560339689254761, 0.34521734714508057... [3.6.5.2] \n",
"2 [[0.0, 0.38467222452163696, 0.3104382753372192... [3.6.5.2] \n",
"\n",
" cath_class pdb_id cofactor \\\n",
"0 [3.40.50.300] [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... NaN \n",
"1 [3.40.50.300] [121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR... NaN \n",
"2 [3.40.50.300] [1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4... NaN \n",
"\n",
" ptm disease_name \\\n",
"0 NaN [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
"1 NaN [Costello syndrome, Congenital myopathy with e... \n",
"2 NaN [Leukemia, acute myelogenous, Leukemia, juveni... \n",
"\n",
" disease_id \\\n",
"0 [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
"1 [DI-01437, DI-01411, DI-04532, DI-02612, DI-03... \n",
"2 [DI-01171, DI-01851, DI-02073, DI-02971, DI-03... \n",
"\n",
" ligand \n",
"0 [GTP-binding, Nucleotide-binding] \n",
"1 [GTP-binding, Nucleotide-binding] \n",
"2 [GTP-binding, Nucleotide-binding] "
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_results_with_ligand"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Normalise/Scale the matrix"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Scaled matrix ($S_{i}$)**: Coevolution Matrices ($C_{i}$) have to be scaled by average $\\langle{C_{i}}\\rangle$, all values < $\\langle{C_{i}}\\rangle$ are set to 0.\n",
"\n",
"**Normalised matrix ($N_{i}$)**: For comparing scaled coevolution scores across multiple proteins, we can normalise the values of all matrices ${S_{i}...S_{n}}$, by dividing them by the $\\max({S_{i}...S_{n}})$.\n",
"\n",
"These operations can be performed by using *object* from `CoMAnalysis` class."
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"#initialise CoMAnalysis class object\n",
"obj_com_analysis=CoMAnalysis()"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"df_results_with_ligand_matrix=obj_com_analysis.scale_and_normalise_coevolution_matrices(df_results_with_ligand)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" uniprot_id \n",
" neff \n",
" sequence_length \n",
" sequence \n",
" organism_id \n",
" helix_frac \n",
" turn_frac \n",
" strand_frac \n",
" has_ptm \n",
" has_pdb \n",
" ... \n",
" enzyme_commission \n",
" cath_class \n",
" pdb_id \n",
" cofactor \n",
" ptm \n",
" disease_name \n",
" disease_id \n",
" ligand \n",
" matrix_S \n",
" matrix_N \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" P01111 \n",
" 12.817 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.349206 \n",
" 0.015873 \n",
" 0.227513 \n",
" 1 \n",
" 1 \n",
" ... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... \n",
" NaN \n",
" NaN \n",
" [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
" [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
" [GTP-binding, Nucleotide-binding] \n",
" [[0.0, 2.4990853333930714, 1.5580765172867141,... \n",
" [[0.0, 0.18836677642207234, 0.1174389073708615... \n",
" \n",
" \n",
" 1 \n",
" P01112 \n",
" 12.841 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.317460 \n",
" 0.031746 \n",
" 0.359788 \n",
" 1 \n",
" 1 \n",
" ... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR... \n",
" NaN \n",
" NaN \n",
" [Costello syndrome, Congenital myopathy with e... \n",
" [DI-01437, DI-01411, DI-04532, DI-02612, DI-03... \n",
" [GTP-binding, Nucleotide-binding] \n",
" [[0.0, 2.4841366766214805, 1.5422925960913767,... \n",
" [[0.0, 0.18724003206873668, 0.1162492055567066... \n",
" \n",
" \n",
" 2 \n",
" P01116 \n",
" 12.626 \n",
" 189 \n",
" MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... \n",
" 9606 \n",
" 0.375661 \n",
" 0.031746 \n",
" 0.328042 \n",
" 1 \n",
" 1 \n",
" ... \n",
" [3.6.5.2] \n",
" [3.40.50.300] \n",
" [1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4... \n",
" NaN \n",
" NaN \n",
" [Leukemia, acute myelogenous, Leukemia, juveni... \n",
" [DI-01171, DI-01851, DI-02073, DI-02971, DI-03... \n",
" [GTP-binding, Nucleotide-binding] \n",
" [[0.0, 1.8285180440964264, 1.4756510916226846,... \n",
" [[0.0, 0.1378232447662731, 0.1112261496390277,... \n",
" \n",
" \n",
"
\n",
"
3 rows × 22 columns
\n",
"
"
],
"text/plain": [
" uniprot_id neff sequence_length \\\n",
"0 P01111 12.817 189 \n",
"1 P01112 12.841 189 \n",
"2 P01116 12.626 189 \n",
"\n",
" sequence organism_id helix_frac \\\n",
"0 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 \n",
"1 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 \n",
"2 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 \n",
"\n",
" turn_frac strand_frac has_ptm has_pdb ... enzyme_commission \\\n",
"0 0.015873 0.227513 1 1 ... [3.6.5.2] \n",
"1 0.031746 0.359788 1 1 ... [3.6.5.2] \n",
"2 0.031746 0.328042 1 1 ... [3.6.5.2] \n",
"\n",
" cath_class pdb_id cofactor \\\n",
"0 [3.40.50.300] [2N9C, 3CON, 5UHV, 6E6H, 6MPP, 6ULI, 6ULK, 6UL... NaN \n",
"1 [3.40.50.300] [121P, 1AA9, 1AGP, 1BKD, 1CLU, 1CRP, 1CRQ, 1CR... NaN \n",
"2 [3.40.50.300] [1D8D, 1D8E, 1KZO, 1KZP, 1N4P, 1N4Q, 1N4R, 1N4... NaN \n",
"\n",
" ptm disease_name \\\n",
"0 NaN [Leukemia, juvenile myelomonocytic, Noonan syn... \n",
"1 NaN [Costello syndrome, Congenital myopathy with e... \n",
"2 NaN [Leukemia, acute myelogenous, Leukemia, juveni... \n",
"\n",
" disease_id \\\n",
"0 [DI-01851, DI-02558, DI-03381, DI-04099, DI-04... \n",
"1 [DI-01437, DI-01411, DI-04532, DI-02612, DI-03... \n",
"2 [DI-01171, DI-01851, DI-02073, DI-02971, DI-03... \n",
"\n",
" ligand \\\n",
"0 [GTP-binding, Nucleotide-binding] \n",
"1 [GTP-binding, Nucleotide-binding] \n",
"2 [GTP-binding, Nucleotide-binding] \n",
"\n",
" matrix_S \\\n",
"0 [[0.0, 2.4990853333930714, 1.5580765172867141,... \n",
"1 [[0.0, 2.4841366766214805, 1.5422925960913767,... \n",
"2 [[0.0, 1.8285180440964264, 1.4756510916226846,... \n",
"\n",
" matrix_N \n",
"0 [[0.0, 0.18836677642207234, 0.1174389073708615... \n",
"1 [[0.0, 0.18724003206873668, 0.1162492055567066... \n",
"2 [[0.0, 0.1378232447662731, 0.1112261496390277,... \n",
"\n",
"[3 rows x 22 columns]"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_results_with_ligand_matrix.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`matrix_S` column contains the $S_{i}$ matrix and the `matrix_N` column contains $N_{i}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Plot the matrix and save it\n",
"\n",
"Lets plot the matrix for the first `(index is 0)` protein in the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.imshow(df_results_with_ligand_matrix.loc[0,'matrix_S'],cmap='Blues')\n",
"plt.colorbar()\n",
"file_name=\"output/%s_Scaled_Matrix.png\"%(df_results_with_ligand_matrix.loc[0,'uniprot_id'])\n",
"plt.savefig(file_name,dpi=300)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.imshow(df_results_with_ligand_matrix.loc[0,'matrix_N'],cmap='Blues')\n",
"plt.colorbar()\n",
"file_name=\"output/%s_Normalised_Matrix.png\"%(df_results_with_ligand_matrix.loc[0,'uniprot_id'])\n",
"plt.savefig(file_name,dpi=300)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Save the list of top coevolution pairs\n",
"\n",
"Residues with scores close to 1 have the strongest evolutionary relationship, and close to 0 may not be covarying during evolution.\n",
"\n",
"The top scoring residues are sorted in descending order and saved to an ASCII file for further interpretation."
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Saved file : output/P01111_Pairs.txt\n",
"Saved file : output/P01112_Pairs.txt\n",
"Saved file : output/P01116_Pairs.txt\n"
]
}
],
"source": [
"obj_com_analysis.save_top_scoring_residue_pairs(df_results_with_ligand_matrix,data_folder=\"output\",matrix_type=\"matrix_N\",res_gap=5,percentile=95)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Help on UniProt Controlled Vocabulary \n",
"\n",
"For each biological feature category, UniProtKB/Swiss-Prot has a curated list of [keywords](https://www.uniprot.org/help/controlled_vocabulary). To search using those keywords or ID's and help you find them we have some helper functions that will help you find them:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function `get_cofactor_list()` from `pycom` will get you list of cofactors. You can either use the `cofactorId` or the `cofactorName`."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" cofactorId \n",
" cofactorName \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" CHEBI:597326 \n",
" pyridoxal 5'-phosphate \n",
" \n",
" \n",
" 1 \n",
" CHEBI:18420 \n",
" Mg(2+) \n",
" \n",
" \n",
" 2 \n",
" CHEBI:60240 \n",
" a divalent metal cation \n",
" \n",
" \n",
" 3 \n",
" CHEBI:30413 \n",
" heme \n",
" \n",
" \n",
" 4 \n",
" CHEBI:29105 \n",
" Zn(2+) \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 109 \n",
" CHEBI:61721 \n",
" chlorophyll b \n",
" \n",
" \n",
" 110 \n",
" CHEBI:73095 \n",
" divinyl chlorophyll a \n",
" \n",
" \n",
" 111 \n",
" CHEBI:73096 \n",
" divinyl chlorophyll b \n",
" \n",
" \n",
" 112 \n",
" CHEBI:57453 \n",
" (6S)-5,6,7,8-tetrahydrofolate \n",
" \n",
" \n",
" 113 \n",
" CHEBI:30402 \n",
" tungstopterin \n",
" \n",
" \n",
"
\n",
"
114 rows × 2 columns
\n",
"
"
],
"text/plain": [
" cofactorId cofactorName\n",
"0 CHEBI:597326 pyridoxal 5'-phosphate\n",
"1 CHEBI:18420 Mg(2+)\n",
"2 CHEBI:60240 a divalent metal cation\n",
"3 CHEBI:30413 heme\n",
"4 CHEBI:29105 Zn(2+)\n",
".. ... ...\n",
"109 CHEBI:61721 chlorophyll b\n",
"110 CHEBI:73095 divinyl chlorophyll a\n",
"111 CHEBI:73096 divinyl chlorophyll b\n",
"112 CHEBI:57453 (6S)-5,6,7,8-tetrahydrofolate\n",
"113 CHEBI:30402 tungstopterin\n",
"\n",
"[114 rows x 2 columns]"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# list of cofactors\n",
"cofactors = obj_pycom.get_cofactor_list()\n",
"cofactors"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function `get_disease_list()` from `pycom` will get you list of diseases. You can either use the `diseaseId` or the `diseaseName`."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" diseaseId \n",
" diseaseName \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" DI-04420 \n",
" Intellectual developmental disorder, autosomal... \n",
" \n",
" \n",
" 1 \n",
" DI-00085 \n",
" Alzheimer disease 1 \n",
" \n",
" \n",
" 2 \n",
" DI-00097 \n",
" Cerebral amyloid angiopathy, APP-related \n",
" \n",
" \n",
" 3 \n",
" DI-00262 \n",
" Chanarin-Dorfman syndrome \n",
" \n",
" \n",
" 4 \n",
" DI-01042 \n",
" Spastic paraplegia 42, autosomal dominant \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 6039 \n",
" DI-05800 \n",
" Wieacker-Wolff syndrome, female-restricted \n",
" \n",
" \n",
" 6040 \n",
" DI-01041 \n",
" Spastic paraplegia 33, autosomal dominant \n",
" \n",
" \n",
" 6041 \n",
" DI-05703 \n",
" Neurodevelopmental disorder with dysmorphic fa... \n",
" \n",
" \n",
" 6042 \n",
" DI-06050 \n",
" Intellectual developmental disorder, autosomal... \n",
" \n",
" \n",
" 6043 \n",
" DI-04662 \n",
" Paget disease of bone 6 \n",
" \n",
" \n",
"
\n",
"
6044 rows × 2 columns
\n",
"
"
],
"text/plain": [
" diseaseId diseaseName\n",
"0 DI-04420 Intellectual developmental disorder, autosomal...\n",
"1 DI-00085 Alzheimer disease 1\n",
"2 DI-00097 Cerebral amyloid angiopathy, APP-related\n",
"3 DI-00262 Chanarin-Dorfman syndrome\n",
"4 DI-01042 Spastic paraplegia 42, autosomal dominant\n",
"... ... ...\n",
"6039 DI-05800 Wieacker-Wolff syndrome, female-restricted\n",
"6040 DI-01041 Spastic paraplegia 33, autosomal dominant\n",
"6041 DI-05703 Neurodevelopmental disorder with dysmorphic fa...\n",
"6042 DI-06050 Intellectual developmental disorder, autosomal...\n",
"6043 DI-04662 Paget disease of bone 6\n",
"\n",
"[6044 rows x 2 columns]"
]
},
"execution_count": 46,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# list of diseases\n",
"diseases = obj_pycom.get_disease_list()\n",
"diseases\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Function `get_organism_list()` from `pycom` will get you list of diseases. You can either use the `organismId` or the `nameScientific` or `nameCommon` or any categories in the `taxonomy`."
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" organismId \n",
" nameScientific \n",
" nameCommon \n",
" taxonomy \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" 561445 \n",
" African swine fever virus (isolate Pig/Kenya/K... \n",
" ASFV \n",
" :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
" \n",
" \n",
" 1 \n",
" 10500 \n",
" African swine fever virus (isolate Tick/Malawi... \n",
" ASFV \n",
" :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
" \n",
" \n",
" 2 \n",
" 561443 \n",
" African swine fever virus (isolate Tick/South ... \n",
" ASFV \n",
" :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
" \n",
" \n",
" 3 \n",
" 561444 \n",
" African swine fever virus (isolate Warthog/Nam... \n",
" ASFV \n",
" :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
" \n",
" \n",
" 4 \n",
" 10498 \n",
" African swine fever virus (strain Badajoz 1971... \n",
" Ba71V \n",
" :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
" \n",
" \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 14316 \n",
" 31581 \n",
" Rotavirus A (isolate RVA/Pig/Australia/TFR-41/... \n",
" RV-A \n",
" :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
" \n",
" \n",
" 14317 \n",
" 31579 \n",
" Rotavirus A (isolate RVA/Pig/Australia/BEN144/... \n",
" RV-A \n",
" :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
" \n",
" \n",
" 14318 \n",
" 10918 \n",
" Rotavirus A (strain RVA/Pig/Russia/K/1987) \n",
" RV-A \n",
" :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
" \n",
" \n",
" 14319 \n",
" 31580 \n",
" Rotavirus A (isolate RVA/Pig/Australia/BMI-1/1... \n",
" RV-A \n",
" :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
" \n",
" \n",
" 14320 \n",
" 47664 \n",
" Populus tremula x Populus tremuloides \n",
" Hybrid aspen \n",
" :Eukaryota:Viridiplantae:Streptophyta:Embryoph... \n",
" \n",
" \n",
"
\n",
"
14321 rows × 4 columns
\n",
"
"
],
"text/plain": [
" organismId nameScientific \\\n",
"0 561445 African swine fever virus (isolate Pig/Kenya/K... \n",
"1 10500 African swine fever virus (isolate Tick/Malawi... \n",
"2 561443 African swine fever virus (isolate Tick/South ... \n",
"3 561444 African swine fever virus (isolate Warthog/Nam... \n",
"4 10498 African swine fever virus (strain Badajoz 1971... \n",
"... ... ... \n",
"14316 31581 Rotavirus A (isolate RVA/Pig/Australia/TFR-41/... \n",
"14317 31579 Rotavirus A (isolate RVA/Pig/Australia/BEN144/... \n",
"14318 10918 Rotavirus A (strain RVA/Pig/Russia/K/1987) \n",
"14319 31580 Rotavirus A (isolate RVA/Pig/Australia/BMI-1/1... \n",
"14320 47664 Populus tremula x Populus tremuloides \n",
"\n",
" nameCommon taxonomy \n",
"0 ASFV :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
"1 ASFV :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
"2 ASFV :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
"3 ASFV :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
"4 Ba71V :Viruses:Varidnaviria:Bamfordvirae:Nucleocytov... \n",
"... ... ... \n",
"14316 RV-A :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
"14317 RV-A :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
"14318 RV-A :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
"14319 RV-A :Viruses:Riboviria:Orthornavirae:Duplornaviric... \n",
"14320 Hybrid aspen :Eukaryota:Viridiplantae:Streptophyta:Embryoph... \n",
"\n",
"[14321 rows x 4 columns]"
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# list of organisms\n",
"organisms = obj_pycom.get_organism_list()\n",
"organisms"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" \n",
" name \n",
" \n",
" \n",
" \n",
" \n",
" 0 \n",
" Acetylcholine receptor inhibiting toxin \n",
" \n",
" \n",
" 1 \n",
" Actin-binding \n",
" \n",
" \n",
" 2 \n",
" Activator \n",
" \n",
" \n",
" 3 \n",
" Acyltransferase \n",
" \n",
" \n",
" 4 \n",
" Allosteric enzyme \n",
" \n",
" \n",
" ... \n",
" ... \n",
" \n",
" \n",
" 191 \n",
" Viral short tail ejection system \n",
" \n",
" \n",
" 192 \n",
" Viral exotoxin \n",
" \n",
" \n",
" 193 \n",
" Chloride channel impairing toxin \n",
" \n",
" \n",
" 194 \n",
" Proton-gated sodium channel impairing toxin \n",
" \n",
" \n",
" 195 \n",
" Translocase \n",
" \n",
" \n",
"
\n",
"
196 rows × 1 columns
\n",
"
"
],
"text/plain": [
" name\n",
"0 Acetylcholine receptor inhibiting toxin\n",
"1 Actin-binding\n",
"2 Activator\n",
"3 Acyltransferase\n",
"4 Allosteric enzyme\n",
".. ...\n",
"191 Viral short tail ejection system\n",
"192 Viral exotoxin\n",
"193 Chloride channel impairing toxin\n",
"194 Proton-gated sodium channel impairing toxin\n",
"195 Translocase\n",
"\n",
"[196 rows x 1 columns]"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#full list of helper functions to get searchable terms on other biological categories\n",
"obj_pycom.get_biological_process_list()\n",
"obj_pycom.get_cellular_component_list()\n",
"obj_pycom.get_developmental_stage_list()\n",
"obj_pycom.get_domain_list()\n",
"obj_pycom.get_ligand_list()\n",
"obj_pycom.get_ptm_list()\n",
"obj_pycom.get_molecular_function_list()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3-conda (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}