Archive for May, 2013

Typogenetics in F# – Part I

Author’s note: I originally wrote this back in 2009, but never got around to publishing it.

Gödel, Escher, Bach: an Eternal Golden Braid is Douglas Hofstadter’s tour de force of formal systems and the ‘strange loops’ that can arise within them.

Gödel, Escher, Bach

Gödel, Escher, Bach

Such strange loops can, it is claimed, give meaning to symbols where none existed. Written in 1979 it has become classic in its time and a must read for anybody with an interest in computation, computability or the nature of intelligence. If you haven’t read GEB, you should order it now, and take a relaxing vacation for a week or so, where you’ll have the time and mental capacity to internalize its contents, before reading on…

In Chapter XVI of GEB, Self-Ref and Self-Rep, Hofstadter introduces a simple game called Typogenetics — short for Typographical Genetics — which is inspired by the real-world workings of molecular biology. Although vastly simplified, typogenetics captures some essential aspects of cell biology whilst jetissoning the messy biochemical reality. In essence the model is this: Strands consisting of a sequence of bases are translated through a fixed coding into the sequence of amino acids in an enzyme. Enzymes then in turn act upon strands, manipulating the sequence of bases in the strand, copying to new strands and splitting strands. These new strands can also be expressed into enzymes, which act upon them to create new strands, and so on.

In parallel with ploughing through GEB, I’ve been experimenting with F# for software development. First-class functions, partial application and pattern matching — all core features of the F# language — are useful abilities when symbolic manipulation is called for, and the techniques I’ve picked up from Expert F# seemed directly applicable to an functional software simulation of the Typogenetics system.

The typogenetic code in F#

The typogenetic alphabet has only four letters, A,C, G and T – which represent the four chemical bases of classical genetics, Adenine, Cytosine, Guanine and Thyamine.

The structure of DNA. From the U.S. National Library of Medicine.

The structure of DNA. From the U.S. National Library of Medicine.

We can represent this in F# as a discrimiated union,

type Base =
    | Adenine
    | Cytosine
    | Guanine
    | Thyamine

However, writing out literal strands of bases as lists of Bases would soon become tedious, so we provide a convenience function strand, which constructs a list of bases from a character string,

let strand s =
    let fromLetter c =
        match c with
        | 'A' -> Adenine
        | 'C' -> Cytosine
        | 'G' -> Guanine
        | 'T' -> Thyamine
        |  _  -> failwith("Invalid Base code") fromLetter s |> Seq.toList

which can be used thus in F# Interactive,

val it : Base list =
  [Thyamine; Adenine; Guanine; Adenine; Thyamine; Cytosine; Cytosine; Adenine;
   Guanine; Thyamine; Cytosine; Cytosine; Adenine; Cytosine; Adenine; Thyamine;
   Cytosine; Guanine; Adenine]
F# offers special short syntax for functions which consist only of a match, using the function keyword. Using this the above function can be shortened to,

let strand s =
    let fromLetter = function
        | 'A' -> Adenine
        | 'C' -> Cytosine
        | 'G' -> Guanine
        | 'T' -> Thyamine
        |  _  -> failwith("Invalid Base code") fromLetter s |> Seq.toList

Enzymes are sequences of amino acids and in typogenetics there are 15 different amino acids from which these sequences can be constructed. Each amino acid has a three letter name. This can also be represented in F# as a discriminated union,

type AminoAcid =
    | Cut
    | Del
    | Swi
    | Mvr
    | Mvl
    | Cop
    | Off
    | Ina
    | Inc
    | Ing
    | Int
    | Rpy
    | Rpu
    | Lpy
    | Lpu

When strands of bases are transcribed into enzymes, the units of coding are called codons — each consist of duplet (2-tuple) of consecutive bases on the strand, starting from the beginning of the strand. Since a stand may have an odd number of bases, a singlet may be left over at the end of the strand. We represent duplets and singlets in F# as a discriminated union, with parameterised type-constructors taking one Base in the case of Singlet, and two in the case of Duplet.

type Codon =
    | Duplet  of Base * Base
    | Singlet of Base

We can now use pattern matching, using the function sugar, to convert a list of Bases representing a strand into a list of Codons.

let rec codons = function
    | x0::x1::xs -> Duplet(x0, x1)::(codons xs)
    | x::xs      -> Singlet(x)::(codons xs)
    | []         -> []

We can test these functions in F# interactive,

> strand "TAGATCCAGTCCACATCGA" |> codons;;
val it : Codon list =
  [Duplet (Thyamine,Adenine); Duplet (Guanine,Adenine);
   Duplet (Thyamine,Cytosine); Duplet (Cytosine,Adenine);
   Duplet (Guanine,Thyamine); Duplet (Cytosine,Cytosine);
   Duplet (Adenine,Cytosine); Duplet (Adenine,Thyamine);
   Duplet (Cytosine,Guanine); Singlet Adenine]

Each codon duplet in typogenetics codes for a specific amino acid. Trailing singlets are ignored. The genetic code is defined by the following matrix.

An Adenine, Adenine pair is defined as a ‘punctuation mark’ which marks the boundary between two genes. Each each gene translates into a single enzyme.

The following matrix specifies the coding:

The typogenetic code

The typogenetic code

This information can be encoded by the following function,

let decode = function
    | Duplet(Adenine,  Adenine)  -> None
    | Duplet(Adenine,  Cytosine) -> Some(Cut)
    | Duplet(Adenine,  Guanine)  -> Some(Del)
    | Duplet(Adenine,  Thyamine) -> Some(Swi)
    | Duplet(Cytosine, Adenine)  -> Some(Mvr)
    | Duplet(Cytosine, Cytosine) -> Some(Mvl)
    | Duplet(Cytosine, Guanine)  -> Some(Cop)
    | Duplet(Cytosine, Thyamine) -> Some(Off)
    | Duplet(Guanine,  Adenine)  -> Some(Ina)
    | Duplet(Guanine,  Cytosine) -> Some(Inc)
    | Duplet(Guanine,  Guanine)  -> Some(Ing)
    | Duplet(Guanine,  Thyamine) -> Some(Int)
    | Duplet(Thyamine, Adenine)  -> Some(Rpy)
    | Duplet(Thyamine, Cytosine) -> Some(Rpu)
    | Duplet(Thyamine, Guanine)  -> Some(Lpy)
    | Duplet(Thyamine, Thyamine) -> Some(Lpu)
    | Singlet(_)                 -> None

Since two of the codons do not code for an amino acid, an option type is used to wrap the return value.
The transcript function, shown below, applies this decoding to the list of codons to produce a list of amino-acid elements — in fact option AminoAcid elements — since the result may contain None values.

let transcript = decode

Referring back to the decode function, we can see that None is returned for the Duplet(Adenine, Adenine) punctuation mark and for any Singlet. We now need to split the list of amino acids into separate enzymes on the None delimiters using this function,

let split delimiter list =
    let rec split' list' =
        match list' with
        | []                       -> [[]]
        | x::xs when x = delimiter -> [] :: split' xs
        | x::xs                    -> let sxs = split' xs
                                      (x :: sxs.Head) :: sxs.Tail
    split' list

used like this,

> strand "TAGATCCAGTAACCACATCGA" |> codons |> transcript |> split None;;
val it : AminoAcid option list list =
  [[Some Rpy; Some Ina; Some Rpu; Some Mvr; Some Int];
   [Some Mvl; Some Cut; Some Swi; Some Cop]; []]

As you can see, this may result in empty lists, which we can filter out using the standard List.filter function

strand "TAGATCCAGTAACCACATCGA" |> codons |> transcript |> split None
                               |> List.filter (fun x -> not x.IsEmpty);;

val enzymes : AminoAcid option list list =
  [[Some Rpy; Some Ina; Some Rpu; Some Mvr; Some Int];
   [Some Mvl; Some Cut; Some Swi; Some Cop]]

and then we unwrap the element values from the now unnecessary option types with the help of a small function,

> let unpack list = Option.get list

> strand "TAGATCCAGTAACCACATCGA" |> codons |> transcript |> split None
                                 |> List.filter (fun x -> not x.IsEmpty)
                                 |> unpack;;
val it : AminoAcid list list =
  [[Rpy; Ina; Rpu; Mvr; Int]; [Mvl; Cut; Swi; Cop]]

giving us a tidy list of the enzymes, each of which is in turn a chain of amino acids, produced by translation of the genes in the original strand of bases. The single strand in our example contains two “genes” which express two different enzymes. Finally, we wrap all this up in a single function called express as we’ll be needing this gene expression machinery later.

Next time, we’ll look at how to model binding of the enzymes we have expressed back to strands of bases, the next step in closing the Strange Loop of typogenetics.

Categories: .NET, computing, F#, software, Uncategorized Tags: