DNA 16/07/2020
What is it?
DNA (Deoxyribonucleic Acid) is the physical mechanism by which many organisms
store instructions for their developement and function, and this genetic code is resonsible for traits
such as eye colour and many others in humans. Skipping the detailed chemistry, many will be familiar
with the idea of DNA as the double helix; two chains of molecules linked together and forming
helical twists. The molecules forming DNA in humans go by the names of
Adenine (A), Thymine (T), Guanine (G), and Cytosine (C) and are themselves complicated molecular structures.
For our purpose we will take it as given that DNA is formed of these molecules in two chains: one chain
which encodes actual information, and a second chain which exists to aid stability. This second chain is
formed by two simple rules (in standard DNA), simply if the first chain has an A at a particular position
then the second will have a T (and vice versa) but if a G or C is in the first chain the second will
have a C or G in the same position respectively. So an example of a valid pair of chains is
$$ ACACTCAGCGGT $$
$$ TGTGAGTCGCCA $$
and an invalid example is
$$ GCGGTTGATTAA $$
$$ C{\color{red}C}CCAAC{\color{red}G}AATT $$
The Maths: Number Systems and Bases
Mathematically speaking the interesting part about DNA is that we have exactly four characters to
play with. Consider computers, in the computer world all information (including numbers) is represented
with just a sequence of \(1\)s and \(0\)s; ON and OFF signals in an electrical circuit. The language of
computers is Binary. So then what, mathematically speaking, is the language of DNA? Well there
are four characters, so it's four-ary more commonly known as Quarternary numbers. What we
are dealing with here are different number systems, called bases.
Let's take a step back a second and think about the numbers we are all familiar with, the decimal
numbers \(0,1,2,3,4,5,6,7,8,9,10,11,\ldots\). We call these base-\(10\), because there are \(10\)
digits (\(0\) to \(9\)) that make up how we write down any number, although we of course use a \(.\) to
represent fractional numbers but this is a notational convenience. So let's throw down a few equations
$$ 0 = 0\times 10^{0}, \enspace\enspace 6 = 6\times 10^{0}, \enspace\enspace 27 = 2\times 10^{1}+7\times 10^{0}, \enspace\enspace 172 = 1\times 10^{2} + 7\times 10^{1} + 2\times 10^{0} $$
These appear obvious, and they are, but there is a pattern here. Notice we are writing base-\(10\) numbers
and we see \(10\) appearing in every number. Specifically we can write any base-\(10\) number
as a sum of powers of \(10\) each multiplied by just one of the of the \(10\) digits \(0,1,2,3,4,5,6,7,8,9\).
and really our notation \(325\) is a shorthand for "3 lots of 100, 2 lots of 10, and 5 lots of 1". Again remarking
on this it's obvious, what's the point it's just how it is? The point is that for any base we can
generalise this to write numbers in that base, so in Binary , we have to character \(0\) and \(1\) and instead
of powers of \(10\) a number will be a sum of powers of \(2\) each multiplied by either a \(0\) or a \(1\) e.g
$$ 101_{2} = 1\times 2^2 + 0\times 2^{1} + 1\times 2^{0} = 5 = 5\times 10^{0} = 5_{10} $$
where I'm now writing a subscript \(x_{b}\) to say that the number \(x\) is to be understood as
being written in base \(b\). So in maths we like to genralise everything, so we can now write down
any (integer for now) number in any base as the sum
$$x_{b} = d_{N-1}d_{N}\cdots d_{1}d_{0} = d_{N-1}\times b^{N-1} + d_{N-2}\times b^{N-2} + \cdots + d_{1}\times b^{1} +d_{0}\times b^{0} = \sum_{i=0}^{N}d_{i}b^{i} $$
where \(x_{b}\) is a number in base \(b\) with \(N\) digits \(d_{0},d_{1},d_{2},\ldots d_{N-1}\). For
fractional values we can relax the low bound of the sum index \(i\) to be -\(M\), where \(M\) is the number
of decimal places, notice this will naturally give negative powers and so fractions.
Armed with this we can return to DNA, we already noticed that DNA has for characters therefore
we can think of each character as a digit in a base four number system. Identifying \(A=0,C=1,G=2,T=3\) we
can write a string of DNA as a number
$$ ATCGTGC \equiv 0312321_{4} = 3513_{10}$$
Since we have represented DNA in base four, we can convert it to any other base. Tricks
exist for conversion, and I'm not going to write these down in detail, essentially to go from base \(10\) to base \(b\) we can
use devision and remainder to convert e.g (with all numbers in base \(10\))
$$ 143 / 4 = 35 \text{ remainder } 3 $$
$$ 35 / 4 = 8 \text{ remainder } 3 $$
$$ 8 / 4 = 2 \text{ remainder } 0 $$
$$ 2 / 4 = 0 \text{ remainder } 2 $$
and (you can check for yourself)
$$ 143_{10} = 2033_{4} $$
Text to DNA and Back Again
Now that we have learned all of this what can we do with it? Well one fun application is
that since computers use base \(2\), DNA can be written in base \(4\), and we can
convert bases we can write DNA as Binary or Binary as DNA. More interestingly since computers
often display text, text must be represented as (Binary) numbers somehow, so can we convert text
into a DNA sequence? Yes!
Unfortunately there is one slight bump in the road, yes text is written in Binary
for computers, but for humans who program computers it is actually more convenient to
encode text as base-16 numbers! Of course, we can actually just ignore that for this post since we know
how to convert bases (in the future I may even write about base-16, AKA Hexadecimal numbers).
For now though before playing with the code,
there is one point to clear up. We can convert from text (Binary/Hex) to base-\(4\) easy enough, and we know how base pairs form so
the second DNA strand will be automatically taken care of, but how can we decode the information conatining strand? How do we
know when when a sequence of DNA should be read as a character? Since there are more than four letters we want to encode, a single letter
is going to be formed of multiple base-\(4\) digits. DNA also has this problem, it exists to be read, so DNA evolved a solution (which is also
studied in the theory of codes like Morse-code) stop sequences. A stop sequence tells us that the characters we have read up to
now should be interpreted as a single piece of information (a letter in our case), DNA itself also includes start sequences for extra
safety. One DNA stop sequence is \(TAG\), which is what we will be using. Let's look at an example, say (arbitrarily) that the
sequences \(ATAC, GCAT, AC, CTG\) represent the letters "dna!" to encode this as a DNA sequence (with our stop characters so
that we can decode it later) we simply write down the sequence.
$$ \text{"dna!" } \equiv ATAC{\color{green}TAG}GCAT{\color{green}TAG}AC{\color{green}TAG}CTG{\color{green}TAG}$$
But this introduces a new problem, what if a letter is represented by \(TAGC\) or \(GTAG\)? Programmatically
we can handle this by reading the DNA sequence in fours, we read a set of four DNA characters and
convert it to a letter, then we skip the next three characters which should be \(TAG\) and read the next four until the
end of the sequence. We can even check
that the three characters after a letter are \(TAG\) to make sure the DNA sequence is not corrupted somehow.
The grand finale then is a program which we can use to re-write any sequence of text as a strand of DNA! As mentioned
before I've implemented this in Julia (DNAText.jl [Julia]),
and also on this site I've made a Javascript application you can try for yourself!