{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "be136a57-32fc-4fb3-acb5-c94ec734adfc",
   "metadata": {},
   "source": [
    "# Identify the specificity of an ABC transporter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12ef9647-e124-4586-8653-c8ace9345cd9",
   "metadata": {},
   "source": [
    "In this example, we show how to use PyOpal to align a putative ABC transporter to reference sequences from the [Transporter Classification Database](https://tcdb.org/download.php) to identify their substrate specificity.\n",
    "\n",
    "ATP-binding cassette (ABC) form a superfamily of transporter proteins found in all domains of life. Bacterial ABC transporters are involved in the translocation of substrates such nutrients, ions, or toxins. In bacteria, they are expressed as multi-protein system composed of 4 structural units:\n",
    "\n",
    "![abc.jpg](https://ars.els-cdn.com/content/image/1-s2.0-S0882401022003473-gr1_lrg.jpg)\n",
    "\n",
    "*Figure from* [Akhtar & Turner. (2022)](https://www.sciencedirect.com/science/article/pii/S0882401022003473).\n",
    "\n",
    "The transmembrane domain (TMD) exhibits high variability to accomodate for specific substrate. The TCDB contains proteins belonging to the ABC superfamily and lists their substrate specificity. Using Opal, we can find the substrate specificity for a putative ABC transporter by aligning against the database.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8bae9451-9ada-4ba9-a045-a9044000e1d4",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pyopal\n",
    "pyopal.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "15c5c176-39db-4463-b9fe-380c08b19a53",
   "metadata": {},
   "source": [
    "## Building an ABC transporter database\n",
    "\n",
    "Let's start by building a database of ABC transporters, which corresponds to the [3.A.1](https://tcdb.org/search/result.php?tc=3.A.1) superfamily in the TCDB. We can download the protein sequences directly from the [Download](https://tcdb.org/download.php) page of the TCDB, and use [Biopython](https://biopython.org) to parse the records. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4174791e-3772-410e-b36c-c417bd69f5d5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import io\n",
    "from urllib.request import urlopen\n",
    "\n",
    "import Bio.SeqIO\n",
    "\n",
    "records = []\n",
    "with urlopen(\"https://tcdb.org/public/tcdb\") as res:\n",
    "    reader = io.TextIOWrapper(res)\n",
    "    for record in Bio.SeqIO.parse(reader, \"fasta\"):\n",
    "        tcid = record.id.split(\"|\")[-1]\n",
    "        if tcid.startswith(\"3.A.1\"):\n",
    "            records.append(record)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0fa46e56-e00e-4acb-9a87-7284d56013cb",
   "metadata": {},
   "source": [
    "At this stage, we have a list of `SeqRecord` objects. We can create a PyOpal database from these by extracting the raw sequence from each record:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bf77d791-d047-4eb1-baab-c72e29d7caef",
   "metadata": {},
   "outputs": [],
   "source": [
    "database = pyopal.Database([bytes(record.seq) for record in records])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "48c766c9-7928-417c-bbf7-7e6330e53e0e",
   "metadata": {},
   "source": [
    "`database` now contains the sequences encoded and stored to allow fast querying with PyOpal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6ba00f9-ca1c-4a6b-8cc9-cd40b36d7500",
   "metadata": {},
   "outputs": [],
   "source": [
    "import statistics \n",
    "\n",
    "print(\"sequences:\", len(database))\n",
    "print(\"average length:\", statistics.mean(database.lengths))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7504237-ff44-4a15-890f-1371db49a436",
   "metadata": {},
   "source": [
    "## Searching the best hit for a query"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c043dc84-9bcc-47ae-aca8-40e5fe6fcc76",
   "metadata": {},
   "source": [
    "Consider we have a query sequence with unknown function that was annotated with an ABC transporter activity: for instance Cj1587c ([CAL35684.1](https://www.ncbi.nlm.nih.gov/protein/CAL35684.1)), a putative ABC transporter from [Campylobacter jejuni](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=192222) which has been annotated with the [PF00005](https://www.ncbi.nlm.nih.gov/protein/AAT77331.1) domain. We can download the sequence directly from the NCBI Protein database:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a1f1a2f5-dc98-473a-9e62-e25d85218252",
   "metadata": {},
   "outputs": [],
   "source": [
    "import textwrap\n",
    "\n",
    "url = \"https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=protein&report=fasta&id=112360883\"\n",
    "with urlopen(url) as res:\n",
    "    reader = io.TextIOWrapper(res)\n",
    "    query = Bio.SeqIO.read(reader, \"fasta\")\n",
    "\n",
    "print(f\">{query.id}\")\n",
    "print(*textwrap.wrap(str(query.seq)), sep=\"\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5594ecf-60ff-4d66-b90f-52ffb056c9d9",
   "metadata": {},
   "source": [
    "After having built the database, we can align the query using the fast SIMD code from Opal. In addition, independent chunks of the library can be processed in parallel using multithreading for increased performance. This is handled automatically by the `pyopal.align` method. Since we are only interested in getting the most similar sequence, we can use the `score` mode (the default) to avoid the more expensive alignment computation, and find the result with the highest score. Because we are comparing full sequences, we want to use the `nw` (Needleman-Wunsch) algorithm to instruct PyOpal to compute global alignments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "428f22fa-76fc-4aa0-aaed-5535a5abcba4",
   "metadata": {},
   "outputs": [],
   "source": [
    "results = list(pyopal.align(bytes(query.seq), database, algorithm=\"nw\"))\n",
    "best = max(results, key=lambda result: result.score)\n",
    "best"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "876ede9c-5857-4a37-9c4a-ade0cb9aa2cf",
   "metadata": {},
   "source": [
    "PyOpal databases do not store any metadata about the target sequences, and the result only contain the index of the target in the database. We can use this index to recover the record corresponding to the highest scoring sequence:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f413981d-155f-47a3-9f29-68a0242a66f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(records[best.target_index].description)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "86879865-eae9-4b75-a4d3-af30ef13a1c6",
   "metadata": {},
   "source": [
    "## Checking all scores"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3e48b62a-b35f-4a79-b47b-eaad954e3dfb",
   "metadata": {},
   "source": [
    "To make sure we are not selecting a false-positive, we may want to visualize the distribution of scores over the database. Let's sort the results by ascending score:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "40441d09-cdf5-4b6f-95a6-47681c6835d1",
   "metadata": {},
   "outputs": [],
   "source": [
    "results.sort(key=lambda result: result.score, reverse=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4ca615be-6175-43d5-80af-e43560f608ce",
   "metadata": {},
   "source": [
    "We can plot the scores of the top 30 scoring sequences to have an idea of the score distribution:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "04801611-df56-4b5d-8618-f4f31d83b62b",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "\n",
    "plt.plot([result.score for result in results[:30]], 'o-')\n",
    "plt.ylabel(\"Score\")\n",
    "plt.xlabel(\"Target\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9008f407-330a-4b15-bf2e-04802f9abe07",
   "metadata": {},
   "source": [
    "It seems like there is only a single very high scoring hits in our scores, so we can safely use it for the rest of the analysis."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c2b8c97e-569a-4e43-919d-d58700c143b0",
   "metadata": {},
   "source": [
    "## Aligning to the best hit"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e2a2b18-4b49-4916-aedc-532d7b26e3a6",
   "metadata": {},
   "source": [
    "If we want to get the alignments for our best sequences, we have to call `pyopal.align` again in `full` mode. But because we already know which sequences we are interested in, we don't have to re-align against the full database, only against the proteins of interest. We can extract a subset of a database by index using the `extract` method, and run the alignment only on that subset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31c30f33-1f21-437f-978f-a6cb6bfccb90",
   "metadata": {},
   "outputs": [],
   "source": [
    "subdb = database.extract([results[0].target_index])\n",
    "alignment = next(pyopal.align(bytes(query.seq), subdb, mode=\"full\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17b89cf3-9976-4c26-9577-8e2410aab557",
   "metadata": {},
   "source": [
    "The alignment can be rendered from the `FullResult` from PyOpal with the following function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96922ed1-2484-4b71-a56c-826efdb58eaf",
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_alignment(query_seq, target_seq, alignment):\n",
    "    i = j = 0   \n",
    "    qline = []\n",
    "    aline = []\n",
    "    tline = []\n",
    "\n",
    "    for c in alignment:\n",
    "        if c == \"M\" or c == \"X\":\n",
    "            qline.append(query_seq[i])\n",
    "            tline.append(target_seq[j])\n",
    "            aline.append(\"|\" if c == \"M\" else \".\")\n",
    "            i += 1\n",
    "            j += 1\n",
    "        elif c == \"I\":\n",
    "            qline.append(\"-\")\n",
    "            tline.append(target_seq[j])\n",
    "            aline.append(\" \")\n",
    "            j += 1\n",
    "        elif c == \"D\":\n",
    "            qline.append(query_seq[i])\n",
    "            tline.append(\"-\")\n",
    "            aline.append(\" \")\n",
    "            i += 1\n",
    "    \n",
    "    print(\"\".join(qline))\n",
    "    print(\"\".join(aline))\n",
    "    print(\"\".join(tline))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd56cbee-4b11-4760-a54d-a37bbbf054f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\">\", records[best.target_index].id)\n",
    "print_alignment(str(query.seq), str(records[best.target_index].seq), alignment.alignment)\n",
    "print()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "07e3c863-092f-499e-9902-eb0cf92b7b91",
   "metadata": {},
   "source": [
    "## Getting the specificity"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec3b8365-94f9-4d08-a20e-18f76a296faf",
   "metadata": {},
   "source": [
    "Now that we know the TCDB protein most similar to our query, we can check its substrate specificity. First we download the mapping from the TCDB, and extract it to a dictionary:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "849af36b-7382-4fd2-863d-5442d92841f7",
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import collections\n",
    "\n",
    "substrates = {}\n",
    "with urlopen(\"https://tcdb.org/cgi-bin/substrates/getSubstrates.py\") as res:\n",
    "    reader = csv.reader(io.TextIOWrapper(res), dialect=\"excel-tab\")\n",
    "    for row in reader:\n",
    "        substrates[row[0]] = row[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "53c6a7d1-8555-484e-9735-669fd6f01c56",
   "metadata": {},
   "source": [
    "Now we can get the substrate specificity of our best hit:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "eb378651-caac-43be-85fd-0761142167f3",
   "metadata": {},
   "outputs": [],
   "source": [
    "best_tcid = records[best.target_index].id.split(\"|\")[-1]\n",
    "substrates[best_tcid]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bda1a0ed-32c3-4876-95a3-5a972580a3ea",
   "metadata": {},
   "source": [
    "It seems like our mysterious ABC transporter is [microcin](https://en.wikipedia.org/wiki/Microcin) transporter! This is consistent with the annotation from KEGG, which annotated [Cj1587c](https://www.genome.jp/dbget-bin/www_bget?cje:Cj1587c) as [K06159](https://www.genome.jp/entry/K06159) (multidrug/microcin transport system)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}