{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Getting started: PyCoM Local\n", "\n", "PyCoM, the Python interface for PyCoMDB can both run locally and remotely.\n", "- Run locally to run large-scale analyses. Requires 115GB of disk space for the database. \n", "- Run remotely to run small-scale analyses. No disk space required. Follow this tutorial `00_Getting_Started_Remotely.ipynb`.\n", "\n", "This is a crash course on how to use the local variant of the Python interface for the PyCoM database.\n", "1. [Installation](#installation)\n", "2. [Initialise Pycom object](#initialise-pycom-object)\n", "3. [Supported query keywords](#supported-query-keywords)\n", "4. [Paginate the results](#paginate-the-results)\n", "5. [Load coevolution matrices](#load-coevolution-matrices)\n", "6. [Adding biological data to dataframe](#adding-biological-data-to-dataframe)\n", "\n", "More indepth tutorials are available here: [https://pycom.brunel.ac.uk/tutorials.html](https://pycom.brunel.ac.uk/tutorials.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Installation\n", "\n", "Install the PyCom package:\n", "\n", "`pip3 install git+https://github.com/scdantu/pycom`\n", "\n", "Note: **Requires Python 3.8 or higher**\n", "\n", "Download the `pycom.db` and `pycom.mat` files from [https://pycom.brunel.ac.uk/downloads](https://pycom.brunel.ac.uk/downloads)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Initialise PyCom object\n", "\n", "Import the required classes and create a PyCom object:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.476196Z", "start_time": "2023-07-15T16:43:49.178452Z" } }, "outputs": [], "source": [ "from pycom import PyCom, ProteinParams\n", "\n", "pyc = PyCom(db_path='~/docs/pycom.db', mat_path='~/docs/pycom.mat')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Query the database\n", "\n", "Query the database by passing a dictionary of keywords:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.543273Z", "start_time": "2023-07-15T16:43:49.477430Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uniprot_idneffsequence_lengthsequenceorganism_idhelix_fracturn_fracstrand_frachas_ptmhas_pdbhas_substratematrix
0P0111112.817189MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...96060.3492060.0158730.227513111None
1P0111212.841189MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...96060.3174600.0317460.359788111None
2P0111612.626189MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI...96060.3756610.0317460.328042111None
3P6207012.754204MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT...96060.2990200.0196080.220588111None
4Q9UNW19.554487MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY...96060.0000000.0000000.000000001None
\n", "
" ], "text/plain": [ " uniprot_id neff sequence_length \\\n", "0 P01111 12.817 189 \n", "1 P01112 12.841 189 \n", "2 P01116 12.626 189 \n", "3 P62070 12.754 204 \n", "4 Q9UNW1 9.554 487 \n", "\n", " sequence organism_id helix_frac \\\n", "0 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.349206 \n", "1 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.317460 \n", "2 MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVI... 9606 0.375661 \n", "3 MAAAGWRDGSGQEKYRLVVVGGGGVGKSALTIQFIQSYFVTDYDPT... 9606 0.299020 \n", "4 MLRAPGCLLRTSVAPAAALAAALLSSLARCSLLEPRDPVASSLSPY... 9606 0.000000 \n", "\n", " turn_frac strand_frac has_ptm has_pdb has_substrate matrix \n", "0 0.015873 0.227513 1 1 1 None \n", "1 0.031746 0.359788 1 1 1 None \n", "2 0.031746 0.328042 1 1 1 None \n", "3 0.019608 0.220588 1 1 1 None \n", "4 0.000000 0.000000 0 0 1 None " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entries = pyc.find({\n", " ProteinParams.ENZYME: '3.*.*.*',\n", " ProteinParams.DISEASE: 'cancer', # string search, case-insensitive\n", "})\n", "\n", "entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Alternatively, query the database by passing keyword arguments:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.742472Z", "start_time": "2023-07-15T16:43:49.544060Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
uniprot_idneffsequence_lengthsequenceorganism_idhelix_fracturn_fracstrand_frachas_ptmhas_pdbhas_substratematrix
0P113109.930421MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK...96060.5178150.0166270.180523111None
1Q658P39.677488MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR...96060.1577870.0000000.086066110None
2Q1679510.997377MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG...96060.3633950.0371350.124668110None
3O952999.244355MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG...96060.0000000.0000000.000000110None
4P138048.627333MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR...96060.3003000.0270270.333333110None
\n", "
" ], "text/plain": [ " uniprot_id neff sequence_length \\\n", "0 P11310 9.930 421 \n", "1 Q658P3 9.677 488 \n", "2 Q16795 10.997 377 \n", "3 O95299 9.244 355 \n", "4 P13804 8.627 333 \n", "\n", " sequence organism_id helix_frac \\\n", "0 MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... 9606 0.517815 \n", "1 MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... 9606 0.157787 \n", "2 MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... 9606 0.363395 \n", "3 MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... 9606 0.000000 \n", "4 MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... 9606 0.300300 \n", "\n", " turn_frac strand_frac has_ptm has_pdb has_substrate matrix \n", "0 0.016627 0.180523 1 1 1 None \n", "1 0.000000 0.086066 1 1 0 None \n", "2 0.037135 0.124668 1 1 0 None \n", "3 0.000000 0.000000 1 1 0 None \n", "4 0.027027 0.333333 1 1 0 None " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "entries = pyc.find(\n", " cofactor='FAD', # string search, case-insensitive\n", " has_ptm=True,\n", " has_disease=True,\n", ")\n", "\n", "entries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Supported query keywords\n", "* `uniprot_id`: The UniProt ID of the protein.\n", "* `sequence`: The amino acid sequence of protein to search for. (full match)\n", "* `min_length` / `max_length`: Min/Max number of residues in the protein.\n", "* `min_helix` / `max_helix`: Min/Max percentage of helical structure in the protein.\n", "* `min_turn` / `max_turn`: Min/Max percentage of turn structure in the protein.\n", "* `min_strand` / `max_strand`: Min/Max percentage of beta strand structure in the protein.\n", "* `organism`: Taxonomic name of the genus / species of the protein. (case-insensitive)\n", " * Species name or any parent taxonomic level can be used. (`pyc.get_organism_list()` for full list)\n", " * Surround with `:` to get precise results\n", " * `:homo:` returns `Homo sapiens` & `Homo sapiens neanderthalensis`)\n", " * `homo` also returns **homo**eomma, t**homo**mys, and *hundreds* others\n", "* `organism_id`: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use `organism` instead)\n", "* `cath`: CATH classification of the protein (`3.40.50.360` or `3.40.*.*` or `3.*`).\n", "* `enzyme`: Enzyme Commission number of the protein. (`1.3.1.3` or `1.3.*.*` or `1.*`).\n", "* `has_substrate`: Whether the protein has a known substrate. (`True`/`False`)\n", "* `has_ptm`: Whether the protein has a known post-translational modification. (`True`/`False`)\n", "* `has_pbd`: Whether the protein has a known PDB structure. (`True`/`False`)\n", "* `disease`: The disease associated with the protein. (name of disease, case-insensitive, e.g `cancer`)\n", " * Use `pyc.get_disease_list()` for full list.\n", " * `cancer` searches for `Ovarian cancer`, `Lung cancer`, ...\n", "* `disease_id`: The ID of the disease associated with the protein. (`DI-02205`, get_disease_list()\n", "* `has_disease`: Whether the protein is associated with a disease. (`True`/`False`)\n", "* `cofactor`: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g `Zn(2+)`])\n", "* `cofactor_id`: The ID of the cofactor associated with the protein. (`CHEBI:00001`, get_cofactor_list())\n", "* `biological_process`: Biological process associated with the protein. (e.g `antiviral defense`, use `pyc.get_biological_process_list()` for full list)\n", "* `cellular_component`: Cellular component associated with the protein. (e.g `nucleus`, use `pyc.get_cellular_component_list()` for full list\n", "* `domain`: Domain associated with the protein. (e.g `zinc-finger`, use `pyc.get_domain_list()` for full list)\n", "* `ligand`: Ligand associated with the protein. (e.g `zinc`, use `pyc.get_ligand_list()` for full list\n", "* `molecular_function`: Molecular function associated with the protein. (e.g `antioxidant activity`, use `pyc.get_molecular_function_list()` for full list\n", "* `ptm`: Post-translational modification associated with the protein. (e.g `phosphoprotein`, use `pyc.get_ptm_list()` for full list\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Paginate the results\n", "\n", "Before loading coevolution matrices, it is recommended to paginate the results, as the matrices can take up a lot of memory.\n", "\n", "Here is an example of making a large query, then paginating the results:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.766123Z", "start_time": "2023-07-15T16:43:49.742886Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 2958 entries with length <= 20\n", "Found 100 entries on page 1\n" ] } ], "source": [ "entries = pyc.find(max_length=20)\n", "print(f'Found {len(entries)} entries with length <= 20')\n", "\n", "page = pyc.paginate(entries, page=1, per_page=100) # get first n entries (default 100)\n", "print(f'Found {len(page)} entries on page 1')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Load coevolution matrices\n", "\n", "Now the coevolution matrices can be loaded for the paginated results.\n", "\n", "This loads them into the `matrix` column of the dataframe." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.885056Z", "start_time": "2023-07-15T16:43:49.782638Z" } }, "outputs": [ { "data": { "text/plain": [ "array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,\n", " 0.00000000e+00],\n", " [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,\n", " 4.54485416e-07],\n", " [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,\n", " 2.98023224e-07],\n", " [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,\n", " 2.23517418e-07],\n", " [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,\n", " 0.00000000e+00]])" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pyc.load_matrices(page)\n", "\n", "page.iloc[0].matrix # show the coevolution matrix for the first entry" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, the matrices are loaded as a `numpy.ndarray`. Different formats can be specified.\n", "\n", "Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "ExecuteTime": { "end_time": "2024-02-01T14:27:07.547109Z", "start_time": "2024-02-01T14:27:07.384618Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Numpy: \n", "Pandas: \n", "List: \n" ] } ], "source": [ "from pycom import MatrixFormat\n", "\n", "resultsNumpy = pyc.load_matrices(page, mat_format=MatrixFormat.NUMPY) # default\n", "resultsPandas = pyc.load_matrices(page, mat_format=MatrixFormat.PANDAS)\n", "resultsList = pyc.load_matrices(page, mat_format=MatrixFormat.LIST)\n", "\n", "print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')\n", "print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')\n", "print(f'List: {type(resultsList.iloc[0].matrix)}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Adding biological data to dataframe\n", "\n", "**This is supported in the local variant only!**\n", "\n", "PyCom contains a lot of additional protein annotation info. This is not loaded by default, but can be added it needed.\n", "\n", "The list of cofactors, diseases, and organisms can loaded by calling:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "ExecuteTime": { "end_time": "2023-07-15T16:43:49.927594Z", "start_time": "2023-07-15T16:43:49.911153Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cofactorIdcofactorName
0CHEBI:597326pyridoxal 5'-phosphate
1CHEBI:18420Mg(2+)
2CHEBI:60240a divalent metal cation
3CHEBI:30413heme
4CHEBI:29105Zn(2+)
.........
109CHEBI:61721chlorophyll b
110CHEBI:73095divinyl chlorophyll a
111CHEBI:73096divinyl chlorophyll b
112CHEBI:57453(6S)-5,6,7,8-tetrahydrofolate
113CHEBI:30402tungstopterin
\n", "

114 rows × 2 columns

\n", "
" ], "text/plain": [ " cofactorId cofactorName\n", "0 CHEBI:597326 pyridoxal 5'-phosphate\n", "1 CHEBI:18420 Mg(2+)\n", "2 CHEBI:60240 a divalent metal cation\n", "3 CHEBI:30413 heme\n", "4 CHEBI:29105 Zn(2+)\n", ".. ... ...\n", "109 CHEBI:61721 chlorophyll b\n", "110 CHEBI:73095 divinyl chlorophyll a\n", "111 CHEBI:73096 divinyl chlorophyll b\n", "112 CHEBI:57453 (6S)-5,6,7,8-tetrahydrofolate\n", "113 CHEBI:30402 tungstopterin\n", "\n", "[114 rows x 2 columns]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cofactors = pyc.get_cofactor_list()\n", "diseases = pyc.get_disease_list()\n", "organisms = pyc.get_organism_list()\n", "\n", "cofactors" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "ExecuteTime": { "end_time": "2024-02-01T14:47:02.321024Z", "start_time": "2024-02-01T14:46:47.586969Z" } }, "outputs": [ { "data": { "text/plain": [ "uniprot_id P15291\n", "neff 7.854\n", "sequence_length 398\n", "sequence MRLREPLLSGSAAMPGASLQRACRLLVAVCALHLGVTLVYYLAGRD...\n", "organism_id 9606\n", "helix_frac 0.198492\n", "turn_frac 0.030151\n", "strand_frac 0.163317\n", "has_ptm 1\n", "has_pdb 1\n", "has_substrate 1\n", "matrix None\n", "cofactor_x [Mn(2+)]\n", "biological_process [Lipid metabolism]\n", "cath_class 3.90.550.10\n", "coding_sequence_diversity [Alternative initiation]\n", "cofactor_y [Mn(2+)]\n", "developmental_stage NaN\n", "disease_name [Congenital disorder of glycosylation 2D]\n", "disease_id [DI-00349]\n", "enzyme_commission 2.4.1.-\n", "ligand [Manganese, Metal-binding]\n", "molecular_function [Glycosyltransferase, Transferase]\n", "organism_name Homo sapiens\n", "taxonomy [Eukaryota, Metazoa, Chordata, Craniata, Verte...\n", "pdb_id [2AE7, 2AEC, 2AES, 2AGD, 2AH9, 2FY7, 2FYA, 2FY...\n", "cellular_component [Cell membrane, Cell projection, Golgi apparat...\n", "domain [Signal-anchor, Transmembrane, Transmembrane h...\n", "ptm NaN\n", "substrate [D-glucose + UDP-alpha-D-galactose = H(+) + la...\n", "Name: 0, dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "loader = pyc.get_data_loader()\n", "\n", "entries = pyc.find(uniprot_id='P15291')\n", "\n", "# Add the protein's cofactors to the dataframe\n", "entries = loader.add_cofactors(entries)\n", "\n", "# The following functions are supported, data taken directly from UniProt\n", "entries = loader.add_biological_processes(entries)\n", "entries = loader.add_cath_class(entries) # Protein's CATH\n", "entries = loader.add_coding_sequence_diversity(entries) # https://www.uniprot.org/help/keywords\n", "entries = loader.add_cofactors(entries) # Cofactors\n", "entries = loader.add_developmental_stage(entries)\n", "entries = loader.add_diseases(entries) # The diseases associated with the protein\n", "entries = loader.add_enzyme_commission(entries) # Protein's EC\n", "entries = loader.add_ligand(entries) # Ligands\n", "entries = loader.add_molecular_function(entries)\n", "entries = loader.add_organism_name(entries)\n", "entries = loader.add_organism_taxonomy(entries)\n", "entries = loader.add_pdbs(entries) # Experimental PDB IDs of protein\n", "entries = loader.add_protein_cellular_component(entries)\n", "entries = loader.add_protein_domain(entries)\n", "entries = loader.add_ptm(entries) # Protein's Post-translational modifications\n", "entries = loader.add_substrates(entries) # Protein's substrates\n", "\n", "entries.iloc[0]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 1 }