I was analyzing a short, relatively uninteresting section of a fairly common gene present in all bacteria and a few eukaryotic organisms, gyrase B (GyrB). My goal was to design a couple of primers I could use to polymerase chain reaction (PCR) amplify a specific E. coli I was attempting to further characterize as part of my dissertation work. The software package I had been using to do the alignments I would need is called VectorNTI. It’s a fairly clunky and somewhat dated program but I was familiar with it and only needed some of the most basic functions for the work. Over the course of the day I had imported over 150 complete genome sequences from various E. coli strains off of the NCBI database. Some of the data consisted of non-assembled contigs from the short read archive (SRA) database. This was rather annoying as I would then need to assemble and partially annotate the contigs before I could use them.

A different software package was required to do these assemblies and the program was extremely computationally intensive. One assembly might take two to three hours depending on the quality of the input data. While the assemblies were running my computer slowed to a crawl as the processors worked in overdrive to analyze all the data. End result was further delays in an already time intensive process. Simultaneously I would BLAST my initial proposed primer sequences to make sure they would appropriately detect the E. coli I was interested in while not cross-reacting with anything else. Usually I only had one or two single nucleotide polymorphisms (SNPs) that I could exploit for my purposes.

It was tedious work, time consuming, and not very fun. It was a process I had repeated on numerous occasions for other genes with other target organisms. There was no reason to suspect that this time things would be any different. After a few hundred iterations I would no doubt hit upon at least a few potential candidate primer sets. At that point I would then need to confirm the in silico findings by empirical screening in the lab across a subset of our fairly vast collection of bacteria. Very basic stuff.

At some point I noticed something in the data I had never seen before. There was a large segment of sequence containing a seemingly out of place string of bases. I had seen enough of the particular section of the gene I was looking at to know this was highly unusual. It was only present in one of the genomes I was examining. There did not seem to be anything out of the ordinary about this particular strain of E. coli which made me even more curious.

I quickly converted the DNA sequence to RNA so I could translate the sequence into the relevant protein using the RNA codon table I always kept handy. The RNA sequence was as follows: AAAAUUCUUCUUAAAAUUCUUCUUAAAAUUCUUCUU…etc. The 12-mer tandem repeat continued on for 144 bases total. The corresponding amino acids coded for were as follows: Lysine-Isoleucine-Leucine-Leucine. Much like the bases in DNA are abbreviated by letters each amino acid has a one letter abbreviation as well. Lysine is K, Isoleucine is I, and Leucine is L. When fully translated to abbreviated protein code the sequence read as KILLKILLKILLKILL….etc.

My mind went numb at that point and I don’t remember much after that. Eventually I came back to reality and there was blood on my hands. A gore covered knife lay on the ground at my feet. What had I done? and more importantly why had I done it?