{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Getting started: PyCoM Remote (online)\n",
"\n",
"Working with PyCom remotely is encouraged only with smaller datasets as it does not support loading of the biological features and it is slow in comparison.\n",
"\n",
"1. [Differences from local setup](#differences-from-local-setup)\n",
"2. [Initialize the PyCom class](#initialize-the-pycom-class)\n",
"4. [Query the database](#query-the-database)\n",
"5. [Supported query keywords](#supported-query-keywords)\n",
"5. [Pagination](#pagination)\n",
"6. [Load coevolution matrices](#load-coevolution-matrices)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Differences from local setup\n",
"\n",
"There are slight differences in the API when using PyCom remotely.\n",
"\n",
"* Querying the PyCoMdb returns a paginated dataframe with max 100 entries per page, or 10 if loading matrices.\n",
"* The `pyc.paginate` and `pyc.load_matrices` methods are **not** available\n",
" * `pyc.find(..., page=1, per_page=100)` is used for pagination\n",
" * `pyc.find(..., matrix=True)` is used for loading matrices\n",
"* The helper methods for loading additional biological data into the dataframe (`pyc.data.*`) are **not yet** available."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Initialize the PyCom class\n",
"Import the PyCom class and initialize it with `remote=True`:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.439751Z",
"start_time": "2024-02-01T15:42:28.898165Z"
}
},
"outputs": [],
"source": [
"from pycom import PyCom, ProteinParams\n",
"import pandas as pd\n",
"\n",
"pyc = PyCom(remote=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Query the database\n",
"\n",
"Query the database by passing a dictionary of conditions:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.613424Z",
"start_time": "2024-02-01T15:42:29.454493Z"
}
},
"outputs": [
{
"data": {
"text/plain": " uniprot_id neff sequence_length \\\n0 O43464 8.095 458 \n1 O60260 9.579 465 \n2 O75787 8.590 350 \n3 P09936 7.605 223 \n4 P31930 10.682 480 \n\n sequence organism_id helix_frac \\\n0 MAAPRAGRGAGWSLRAWRALGGIRWGRRPRLTPDLRALLTSGTSDP... 9606 0.122271 \n1 MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFA... 9606 0.174194 \n2 MAVFVVLLALVAGVLGNEFSILKSPGSVVFRNGNWPIPGERIPDVA... 9606 0.085714 \n3 MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPA... 9606 0.390135 \n4 MAASVVCRAATAGAQVLLRARRSPALLRTPALRSTATFAQALQFVP... 9606 0.410417 \n\n turn_frac strand_frac has_ptm has_pdb has_substrate \n0 0.034934 0.286026 1 1 1 \n1 0.073118 0.273118 1 1 1 \n2 0.011429 0.000000 1 1 0 \n3 0.053812 0.242152 1 1 1 \n4 0.020833 0.122917 0 1 0 ",
"text/html": "
\n\n
\n \n \n | \n uniprot_id | \n neff | \n sequence_length | \n sequence | \n organism_id | \n helix_frac | \n turn_frac | \n strand_frac | \n has_ptm | \n has_pdb | \n has_substrate | \n
\n \n \n \n 0 | \n O43464 | \n 8.095 | \n 458 | \n MAAPRAGRGAGWSLRAWRALGGIRWGRRPRLTPDLRALLTSGTSDP... | \n 9606 | \n 0.122271 | \n 0.034934 | \n 0.286026 | \n 1 | \n 1 | \n 1 | \n
\n \n 1 | \n O60260 | \n 9.579 | \n 465 | \n MIVFVRFNSSHGFPVEVDSDTSIFQLKEVVAKRQGVPADQLRVIFA... | \n 9606 | \n 0.174194 | \n 0.073118 | \n 0.273118 | \n 1 | \n 1 | \n 1 | \n
\n \n 2 | \n O75787 | \n 8.590 | \n 350 | \n MAVFVVLLALVAGVLGNEFSILKSPGSVVFRNGNWPIPGERIPDVA... | \n 9606 | \n 0.085714 | \n 0.011429 | \n 0.000000 | \n 1 | \n 1 | \n 0 | \n
\n \n 3 | \n P09936 | \n 7.605 | \n 223 | \n MQLKPMEINPEMLNKVLSRLGVAGQWRFVDVLGLEEESLGSVPAPA... | \n 9606 | \n 0.390135 | \n 0.053812 | \n 0.242152 | \n 1 | \n 1 | \n 1 | \n
\n \n 4 | \n P31930 | \n 10.682 | \n 480 | \n MAASVVCRAATAGAQVLLRARRSPALLRTPALRSTATFAQALQFVP... | \n 9606 | \n 0.410417 | \n 0.020833 | \n 0.122917 | \n 0 | \n 1 | \n 0 | \n
\n \n
\n
"
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries = pyc.find({\n",
" ProteinParams.DISEASE: 'parkinson', # string search, case-insensitive\n",
"}, page=1)\n",
"\n",
"entries.head()"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.618536Z",
"start_time": "2024-02-01T15:42:29.614729Z"
}
},
"outputs": [
{
"data": {
"text/plain": "{'page': 1, 'total_pages': 1, 'total_results': 9}"
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# tells us total number of entries in the search results\n",
"entries.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Alternatively, query the database by passing keyword arguments:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.775073Z",
"start_time": "2024-02-01T15:42:29.620458Z"
}
},
"outputs": [
{
"data": {
"text/plain": " uniprot_id neff sequence_length \\\n0 P11310 9.930 421 \n1 Q658P3 9.677 488 \n2 Q16795 10.997 377 \n3 O95299 9.244 355 \n4 P13804 8.627 333 \n\n sequence organism_id helix_frac \\\n0 MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... 9606 0.517815 \n1 MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... 9606 0.157787 \n2 MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... 9606 0.363395 \n3 MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... 9606 0.000000 \n4 MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... 9606 0.300300 \n\n turn_frac strand_frac has_ptm has_pdb has_substrate \n0 0.016627 0.180523 1 1 1 \n1 0.000000 0.086066 1 1 0 \n2 0.037135 0.124668 1 1 0 \n3 0.000000 0.000000 1 1 0 \n4 0.027027 0.333333 1 1 0 ",
"text/html": "\n\n
\n \n \n | \n uniprot_id | \n neff | \n sequence_length | \n sequence | \n organism_id | \n helix_frac | \n turn_frac | \n strand_frac | \n has_ptm | \n has_pdb | \n has_substrate | \n
\n \n \n \n 0 | \n P11310 | \n 9.930 | \n 421 | \n MAAGFGRCCRVLRSISRFHWRSQHTKANRQREPGLGFSFEFTEQQK... | \n 9606 | \n 0.517815 | \n 0.016627 | \n 0.180523 | \n 1 | \n 1 | \n 1 | \n
\n \n 1 | \n Q658P3 | \n 9.677 | \n 488 | \n MPEEMDKPLISLHLVDSDSSLAKVPDEAPKVGILGSGDFARSLATR... | \n 9606 | \n 0.157787 | \n 0.000000 | \n 0.086066 | \n 1 | \n 1 | \n 0 | \n
\n \n 2 | \n Q16795 | \n 10.997 | \n 377 | \n MAAAAQSRVVRVLSMSRSAITAIATSVCHGPPCRQLHHALMPHGKG... | \n 9606 | \n 0.363395 | \n 0.037135 | \n 0.124668 | \n 1 | \n 1 | \n 0 | \n
\n \n 3 | \n O95299 | \n 9.244 | \n 355 | \n MALRLLKLAATSASARVVAAGAQRVRGIHSSVQCKLRYGMWHFLLG... | \n 9606 | \n 0.000000 | \n 0.000000 | \n 0.000000 | \n 1 | \n 1 | \n 0 | \n
\n \n 4 | \n P13804 | \n 8.627 | \n 333 | \n MFRAAAPGQLRRAASLLRFQSTLVIAEHANDSLAPITLNTITAATR... | \n 9606 | \n 0.300300 | \n 0.027027 | \n 0.333333 | \n 1 | \n 1 | \n 0 | \n
\n \n
\n
"
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries = pyc.find(\n",
" cofactor='FAD', # string search, case-insensitive\n",
" has_ptm=True,\n",
" has_disease=True,\n",
" page=1\n",
")\n",
"\n",
"entries"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.778237Z",
"start_time": "2024-02-01T15:42:29.776210Z"
}
},
"outputs": [
{
"data": {
"text/plain": "{'page': 1, 'total_pages': 1, 'total_results': 5}"
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"entries.attrs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Supported query keywords\n",
"* `uniprot_id`: The UniProt ID of the protein.\n",
"* `sequence`: The amino acid sequence of protein to search for. (full match)\n",
"* `min_length` / `max_length`: Min/Max number of residues in the protein.\n",
"* `min_helix` / `max_helix`: Min/Max percentage of helical structure in the protein.\n",
"* `min_turn` / `max_turn`: Min/Max percentage of turn structure in the protein.\n",
"* `min_strand` / `max_strand`: Min/Max percentage of beta strand structure in the protein.\n",
"* `organism`: Taxonomic name of the genus / species of the protein. (case-insensitive)\n",
" * Species name or any parent taxonomic level can be used. (`pyc.get_organism_list()` for full list)\n",
" * Surround with `:` to get precise results\n",
" * `:homo:` returns `Homo sapiens` & `Homo sapiens neanderthalensis`)\n",
" * `homo` also returns **homo**eomma, t**homo**mys, and *hundreds* others\n",
"* `organism_id`: Precise NCBI Taxonomy ID of the species of the protein. (prefer to use `organism` instead)\n",
"* `cath`: CATH classification of the protein (`3.40.50.360` or `3.40.*.*` or `3.*`).\n",
"* `enzyme`: Enzyme Commission number of the protein. (`1.3.1.3` or `1.3.*.*` or `1.*`).\n",
"* `has_substrate`: Whether the protein has a known substrate. (`True`/`False`)\n",
"* `has_ptm`: Whether the protein has a known post-translational modification. (`True`/`False`)\n",
"* `has_pdb`: Whether the protein has a known PDB structure. (`True`/`False`)\n",
"* `disease`: The disease associated with the protein. (name of disease, case-insensitive, e.g `cancer`)\n",
" * Use `pyc.get_disease_list()` for full list.\n",
" * `cancer` searches for `Ovarian cancer`, `Lung cancer`, ...\n",
"* `disease_id`: The ID of the disease associated with the protein. (`DI-02205`, get_disease_list()\n",
"* `has_disease`: Whether the protein is associated with a disease. (`True`/`False`)\n",
"* `cofactor`: The cofactor associated with the protein. (name of cofactor, case-insensitive, e.g `Zn(2+)`])\n",
"* `cofactor_id`: The ID of the cofactor associated with the protein. (`CHEBI:00001`, get_cofactor_list())\n",
"* `biological_process`: Biological process associated with the protein. (e.g `antiviral defense`, use `pyc.get_biological_process_list()` for full list)\n",
"* `cellular_component`: Cellular component associated with the protein. (e.g `nucleus`, use `pyc.get_cellular_component_list()` for full list\n",
"* `domain`: Domain associated with the protein. (e.g `zinc-finger`, use `pyc.get_domain_list()` for full list)\n",
"* `ligand`: Ligand associated with the protein. (e.g `zinc`, use `pyc.get_ligand_list()` for full list\n",
"* `molecular_function`: Molecular function associated with the protein. (e.g `antioxidant activity`, use `pyc.get_molecular_function_list()` for full list\n",
"* `ptm`: Post-translational modification associated with the protein. (e.g `phosphoprotein`, use `pyc.get_ptm_list()` for full list\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pagination\n",
"\n",
"Remote PyCom automatically paginates results. The default page size is 10 entries, but can be changed with `pyc.find(..., per_page=100)`.\n",
"The maximum page size is 100 entries, or 10 if loading matrices.\n",
"\n",
"When loading more entries than the page size, just set the `page` parameter to the page number:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.877192Z",
"start_time": "2024-02-01T15:42:29.779019Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Page 1: 100 entries, Page 2: 100 entries, Total: 200 entries\n"
]
}
],
"source": [
"page1 = pyc.find(max_length=20, page=1, per_page=100) # get first 100 entries with length <= 20\n",
"page2 = pyc.find(max_length=20, page=2, per_page=100) # get entries 101-200 with length <= 20\n",
"\n",
"# pages can be concatenated\n",
"\n",
"pages = pd.concat([page1, page2], ignore_index=True)\n",
"\n",
"print(f'Page 1: {len(page1)} entries, Page 2: {len(page2)} entries, Total: {len(pages)} entries')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Load coevolution matrices\n",
"\n",
"Coevolution matrices can be loaded by setting the `matrix` param: `pyc.find(..., matrix=True)`.\n",
"\n",
"This loads them into the `matrix` column of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:29.941035Z",
"start_time": "2024-02-01T15:42:29.877211Z"
}
},
"outputs": [
{
"data": {
"text/plain": "array([[0.00000000e+00, 2.16066837e-07, 1.56462193e-07, 0.00000000e+00,\n 0.00000000e+00],\n [2.16066837e-07, 0.00000000e+00, 4.61935997e-07, 4.54485416e-07,\n 4.54485416e-07],\n [1.56462193e-07, 4.61935997e-07, 0.00000000e+00, 2.98023224e-07,\n 2.98023224e-07],\n [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 0.00000000e+00,\n 2.23517418e-07],\n [0.00000000e+00, 4.54485416e-07, 2.98023224e-07, 2.23517418e-07,\n 0.00000000e+00]])"
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"results = pyc.find(max_length=20, page=1, matrix=True)\n",
"\n",
"results.iloc[0].matrix # show the coevolution matrix for the first entry"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By default, the matrices are loaded as a `numpy.ndarray`. Different formats can be specified.\n",
"\n",
"Here is an example of the matrices being loaded as Pandas DataFrames and 2d-lists:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2024-02-01T15:42:46.470353Z",
"start_time": "2024-02-01T15:42:46.293812Z"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Numpy: \n",
"Pandas: \n",
"List: \n"
]
}
],
"source": [
"from pycom import MatrixFormat\n",
"\n",
"resultsNumpy = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.NUMPY)\n",
"resultsPandas = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.PANDAS)\n",
"resultsList = pyc.find(max_length=20, page=1, matrix=True, mat_format=MatrixFormat.LIST)\n",
"\n",
"print(f'Numpy: {type(resultsNumpy.iloc[0].matrix)}')\n",
"print(f'Pandas: {type(resultsPandas.iloc[0].matrix)}')\n",
"print(f'List: {type(resultsList.iloc[0].matrix)}')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3-conda (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}