Identify the specificity of an ABC transporter#

In this example, we show how to use PyOpal to align a putative ABC transporter to reference sequences from the Transporter Classification Database to identify their substrate specificity.

ATP-binding cassette (ABC) form a superfamily of transporter proteins found in all domains of life. Bacterial ABC transporters are involved in the translocation of substrates such nutrients, ions, or toxins. In bacteria, they are expressed as multi-protein system composed of 4 structural units:

abc.jpg

Figure from Akhtar & Turner. (2022).

The transmembrane domain (TMD) exhibits high variability to accomodate for specific substrate. The TCDB contains proteins belonging to the ABC superfamily and lists their substrate specificity. Using Opal, we can find the substrate specificity for a putative ABC transporter by aligning against the database.

[1]:
import pyopal
pyopal.__version__
[1]:
'0.7.3'

Building an ABC transporter database#

Let’s start by building a database of ABC transporters, which corresponds to the 3.A.1 superfamily in the TCDB. We can download the protein sequences directly from the Download page of the TCDB, and use Biopython to parse the records.

[2]:
import io
from urllib.request import urlopen

import Bio.SeqIO

records = []
with urlopen("https://tcdb.org/public/tcdb") as res:
    reader = io.TextIOWrapper(res)
    for record in Bio.SeqIO.parse(reader, "fasta"):
        tcid = record.id.split("|")[-1]
        if tcid.startswith("3.A.1"):
            records.append(record)

At this stage, we have a list of SeqRecord objects. We can create a PyOpal database from these by extracting the raw sequence from each record:

[3]:
database = pyopal.Database([bytes(record.seq) for record in records])

database now contains the sequences encoded and stored to allow fast querying with PyOpal.

[4]:
import statistics

print("sequences:", len(database))
print("average length:", statistics.mean(database.lengths))
sequences: 2709
average length: 462.62827611664824

Searching the best hit for a query#

Consider we have a query sequence with unknown function that was annotated with an ABC transporter activity: for instance Cj1587c (CAL35684.1), a putative ABC transporter from Campylobacter jejuni which has been annotated with the PF00005 domain. We can download the sequence directly from the NCBI Protein database:

[5]:
import textwrap

url = "https://www.ncbi.nlm.nih.gov/sviewer/viewer.cgi?save=file&db=protein&report=fasta&id=112360883"
with urlopen(url) as res:
    reader = io.TextIOWrapper(res)
    query = Bio.SeqIO.read(reader, "fasta")

print(f">{query.id}")
print(*textwrap.wrap(str(query.seq)), sep="\n")
>CAL35684.1
MPFIMELLKQNKLKLISFLLFSFITSAVGVLTLVFINDYLLKNAQNIPIFYFIVLLLIFFISSTIVELGL
SIFGQNFIFKMQRRVVKQILDTPLLRVAKVGKARILASLGSDVRNISFGLLRLPDFLQSSILILCTSVYL
CYLSPQIFALCVVWIMVIFITNNFLMMKVYQYFRKARENDDALQNNYQNILDGHKELLINRDRAKLYYED
EFENNARLKKKNSTLGNLFNNLSNNWTNVTLLALVGVEFYLALKFEWASVADATTIALSILFLRTPLVSM
IGSFPTLLLAKIALDKIAKLELDDYIEGFKKTHYISEWKKISFRNTQFAYEENFHLNPVNIELKKGELVF
LIGKNGSGKSTFCMLLTGLFKPSKGGIYVDDTLIDDKNLDEYRSLISAVFSDFHLFTKTLNKENFASEEK
IAFWLEFLELKNKTSVKDNELTLTKLSTGQKKRLAMLIALLEERDILVLDEWAADQDPVFRRFFYKKLLP
LLKEQGKTIFAITHDDAYFDSADRIFLAQNGEISELKGENIKELAKNLVEKFD

After having built the database, we can align the query using the fast SIMD code from Opal. In addition, independent chunks of the library can be processed in parallel using multithreading for increased performance. This is handled automatically by the pyopal.align method. Since we are only interested in getting the most similar sequence, we can use the score mode (the default) to avoid the more expensive alignment computation, and find the result with the highest score. Because we are comparing full sequences, we want to use the nw (Needleman-Wunsch) algorithm to instruct PyOpal to compute global alignments.

[6]:
results = list(pyopal.align(bytes(query.seq), database, algorithm="nw"))
best = max(results, key=lambda result: result.score)
best
[6]:
ScoreResult(1136, score=1535)

PyOpal databases do not store any metadata about the target sequences, and the result only contain the index of the target in the database. We can use this index to recover the record corresponding to the highest scoring sequence:

[7]:
print(records[best.target_index].description)
gnl|TC-DB|P33941|3.A.1.113.3 ABC transporter ATP-binding protein yojI - Escherichia coli.

Checking all scores#

To make sure we are not selecting a false-positive, we may want to visualize the distribution of scores over the database. Let’s sort the results by ascending score:

[8]:
results.sort(key=lambda result: result.score, reverse=True)

We can plot the scores of the top 30 scoring sequences to have an idea of the score distribution:

[9]:
import matplotlib.pyplot as plt

plt.plot([result.score for result in results[:30]], 'o-')
plt.ylabel("Score")
plt.xlabel("Target")
plt.show()
../_images/examples_abc_20_0.svg

It seems like there is only a single very high scoring hits in our scores, so we can safely use it for the rest of the analysis.

Aligning to the best hit#

If we want to get the alignments for our best sequences, we have to call pyopal.align again in full mode. But because we already know which sequences we are interested in, we don’t have to re-align against the full database, only against the proteins of interest. We can extract a subset of a database by index using the extract method, and run the alignment only on that subset:

[10]:
subdb = database.extract([results[0].target_index])
alignment = next(pyopal.align(bytes(query.seq), subdb, mode="full"))

The alignment can be rendered from the FullResult from PyOpal with the following function:

[11]:
def print_alignment(query_seq, target_seq, alignment):
    i = j = 0
    qline = []
    aline = []
    tline = []

    for c in alignment:
        if c == "M" or c == "X":
            qline.append(query_seq[i])
            tline.append(target_seq[j])
            aline.append("|" if c == "M" else ".")
            i += 1
            j += 1
        elif c == "I":
            qline.append("-")
            tline.append(target_seq[j])
            aline.append(" ")
            j += 1
        elif c == "D":
            qline.append(query_seq[i])
            tline.append("-")
            aline.append(" ")
            i += 1

    print("".join(qline))
    print("".join(aline))
    print("".join(tline))
[12]:
print(">", records[best.target_index].id)
print_alignment(str(query.seq), str(records[best.target_index].seq), alignment.alignment)
print()
> gnl|TC-DB|P33941|3.A.1.113.3
MPFI----MELLKQNKLKL-ISFLLFSFITS-AVGVLTLVFINDYLLK---N-AQNIPI-FYFIVLLLIFFI-SSTIVELGLSIFGQNFIF-KMQRRVVKQILDTPLLRVAKVGKARILASLGSDVRNISFGLLRLPDFLQSSIL--I-LCTSV-YLCYLSPQIF--ALCVVWIMVIFI-TNNFLMMKVYQY--FRKARENDDALQNNYQNILDGHKELLINR-----DRAKL--YYEDE-FENNARLKKKNSTLGNLFN-NLSNNWTN-VTLL-ALVGVE--FYLALK--FEWASVADATTIALSILFLRTPLVSMIGSFPTLLLAKIALDKIAKLE-LDDYIEGF-KKTHYISEWKKI-SFRNTQFA-YEENFHLNPVNIELKKGELVFLIGKNGSGKSTFCMLLTGLFKPSKGGIYVDDTLID--D--K-NLDEYRSLISAVFSDFHLFTKT-LNKE-N--FASEEKI-AF-----WLEFL-ELKNKTSVKDNELTLTKLSTGQKKRLAMLIALLEERDILVLDEWAADQDPVFRRFFYKKLLPLLKEQGKTIFAITHDDAYFDSAD-RIFLAQNGEISELKGENI-KELAKN
||||    .|.....||.. | |....|.|. | ..|||..|...|..   . |  .|. .|||.  ...|. ..|.|...|..|.... | ...|.||||...|....|.|..||.|.||||||.....|||...| .| || |  | .|| . ||.... | |  .| |.|.|...| |...||...  |  ||..|.|....|..|.|.|||..||.||.     ||..|  |.... .|   .. |    | .||. ||.|...| | .| | |..|  |.  ||  .|| | |..|..|||||||.|..|..||||.|..|..|..|..|... .| |.... . ..|.....|| | |||..| |...|.|...|.|||.||||.|||||||..||||||..|..|.|     |.|.|..  .  | ||.|.||||.|..||...|..| ..|. .  ||   || ..     |||.| .. |  | |   |||.|||||.|.|.||.|||||..||||||||||.|||.||..||||..|.|||||||.|||.||..|||. | ..||..|||.||. ...| ...|..
MELLVLVWRQYRWPFISVMAL-SLASAALGIGL-IAFINQRLIETADTSLLVLP--EFLGLLLLL--MAVTLGSQLALTTLGHHFVYRL-RSEFIKRILDTHVERIEQLGSASLLAGLTSDVRNITIAFVRLPELVQ-GI-IL-TIGSAAYL-WMLSGKML-L-VTAIW-MAITIWGGFVLVARVYKHM--ATLRETEDKLYTDFQTVLEGRKELTLNRERAEYVFNNLYIPDAQEYRHHIIR---AD-T----F-HLSAVNWSNIMMLGA-IGLV-FWMANSLG--WADTNVA-A-TYSLTLLFLRTPLLSAVGALPTLLTAQVAFNKLNKFALAPFKAE-FPRPQAF-PNWQTLELRNVT-FAYQDNAFSVGPINLTIKRGELLFLIGGNGSGKSTLAMLLTGLYQPQSGEI-----LLDGKPVSGEQPEDYRKLFSAVFTDVWLFDQLLGPEGKPANPQLVEK---WLAQLKMAHKLELSNGRI-V--N-L---KLSKGQKKRVALLLALAEERDIILLDEWAADQDPHFRREFYQVLLPLMQEMGKTIFAISHDDHYFIHADRL-LEMRNGQLSELTGEE-RDAASRDAVAR

Getting the specificity#

Now that we know the TCDB protein most similar to our query, we can check its substrate specificity. First we download the mapping from the TCDB, and extract it to a dictionary:

[13]:
import csv
import collections

substrates = {}
with urlopen("https://tcdb.org/cgi-bin/substrates/getSubstrates.py") as res:
    reader = csv.reader(io.TextIOWrapper(res), dialect="excel-tab")
    for row in reader:
        substrates[row[0]] = row[1]

Now we can get the substrate specificity of our best hit:

[14]:
best_tcid = records[best.target_index].id.split("|")[-1]
substrates[best_tcid]
[14]:
'CHEBI:64627;microcin'

It seems like our mysterious ABC transporter is microcin transporter! This is consistent with the annotation from KEGG, which annotated Cj1587c as K06159 (multidrug/microcin transport system).