Archive for the ‘Genetics’ Category

Error Control Coding in Biology Implies Design, Part 4 (of 5)

Friday, December 12th, 2008

Keith McPherson

Photo of KeithMcPhersonKeith McPherson received his Master of Science in Electrical Engineering from Georgia Institute of Technology in 1993, and currently works as an electrical engineer in Melbourne, FL, in the fields of communications and signal processing.

 --------------------------------------

Anyone up for tic-tac-toe and genetics? It’s not exactly a game, but grab a cup of coffee and let’s explore the intriguing design in genetic systems.

In parts 1, 2, and 3 of this series, we observed three features of genetic systems that suggest design. In this article we look at isomorphic systems to find yet another analogy in the genetic code.

Analogy: Genetic Code-Like (GCL) Binary Representation

Recall the genetic code mapping table discussed in part 1 and part 2 of this series. This table describes the mapping between the 64 codons and the 20 amino acids.

Researchers in Venice, Italy, have identified a specific and unique number system and have used it to mathematically model the genetic code. In this work, the 64 codons and the 20 amino acids are assigned to numerical elements within the system, referred to as the genetic code-like (GCL) binary representation. The GCL binary representation is a unique model that holds true to the specific redundancy features in the natural genetic code mapping.1 It is a mathematical model of the underlining physical/chemical processes related to genetic information processing—a so-called structural isomorphism.

An isomorphism is a one-to-one correspondence between the elements of two sets such that the result of an operation on elements of one set corresponds to the result of the analogous operation on their images in the other set. If two sets are isomorphic with respect to certain properties, then those properties that are true of one of the sets must also be true of the other.

For example, a six-sided die and a bag from which a number 1 through 6 is chosen are isomorphic. As another example, tic-tac-toe and the “game of 15” are isomorphic. In the game of 15, players take turns saying a number between 1 and 9. Numbers may not be repeated. Both players aim to say three numbers that add up to 15. Although perhaps not obvious, the defining characteristics of this number game are identical to those of tic-tac-toe. It turns out that both games are based on the well-known (to mathematicians) magic square.

The GCL binary representation and the genetic code are also isomorphic systems (sets). So, characteristics that are true of the GCL binary representation must also be true of the genetic code. (See here and here for more details of isomorphic systems.)

What are the characteristics of the GCL binary representation? The European researchers noted that this mathematical model exhibits:2

  • Palindromic symmetry
  • Parity symmetry
  • Organized redundancy
  • A rich mathematical structure

Such elegant symmetry, organization, and structure speak of a code that has been designed for a purpose—no mere afterthought of evolutionary chance events.

Also, the GCL binary representation makes possible the existence of error detection/correction codes that operate along the strands of DNA. A parity code (as discussed in part 3 of this series) is one example of such a technique.

If a parity code or similar technique functions along strands of DNA using the GCL binary representation, then dependence must exist in the genetic data along the strand. In other words, the data must be correlated to some degree.

Assuming the GCL binary representation and using two different robust statistical analysis methods, the research team discovered significant short- and long-range correlation peaks in actual DNA sequences. These results confirm that actual DNA sequences using this specific model satisfy a basic prerequisite for such error-minimizing techniques.

The team noted that “an error-control mechanism implies the organization of the redundancy in a mathematically structured way,” and that “[t]he genetic code exhibits a strong mathematical structure that is difficult to put in relation with biological advantages other than error correction.”

Thus, scientific advance has uncovered a peculiar and unique mathematical model that accounts for the key properties of the genetic code. This model exhibits symmetry, organized redundancy, and a mathematical structure that would be vital for the existence of error-coding techniques operating along the DNA strands. Actual DNA data tested using this model gives a strong hint that such further error-coding techniques may very well exist and provides impetus for future study in this area. Purposeful creation seems a reasonable conclusion.

Part 5 (the last entry) of this series will discuss the impact of these analogies on William Paley’s watchmaker argument.

Notes/References:

  1. It describes exactly the 1st level of degeneracy (i.e., redundancy) in the natural genetic code (i.e., the number of codons which map to specific amino acids), and gives deep insight into the 2nd level of degeneracy (i.e., the association between specific codons and specific amino acids).

  2. A full treatment of how the GCL binary representation displays these features is very technical and difficult to communicate without active use of visual aids. Nevertheless, this note is a brief attempt. See here for the necessary visual aid. Table 3 shows the actual GCL binary representation of the natural genetic code. The numerical elements in the model are associated with biochemical elements in the genetic code. The whole numbers 0–23 along the table edges map to amino acids. The 6-bit binary strings map to codons. The degeneracy number is shown in the center of the table, along with the amino acids coded for. The degeneracy number indicates how many different codons code for each particular amino acid. A careful inspection will show that all 64 codons are present, along with all 20 amino acids. Palindromic symmetry is seen as a reflection through the middle of the table in a left to right fashion. Palindromic amino acids are always associated in pairs. For example, tryptophan (Trp) and methionine (Met) are a pair of palindromic amino acids. Note that if the 6-bit codewords are folded on top of each other through the middle of the table, they form bitwise negated pairs, reflecting a very peculiar mathematical structure for palindromic symmetry. Also note the symmetry evident in the parity, again through the middle of the table. The light entries are odd parity (odd number of 1’s) and the dark entries are even parity (even number of 1’s). The interested reader is referred to the journal paper for further details.

Error Control Coding in Biology Implies Design, Part 3 (of 5)

Friday, December 5th, 2008

Keith McPherson

Photo of KeithMcPhersonKeith McPherson received his Master of Science in Electrical Engineering from Georgia Institute of Technology in 1993, and currently works as an electrical engineer in Melbourne, FL, in the fields of communications and signal processing.


   ------------------------------------

Parts 1 and 2 of this series observed that biological genetic systems function as information-processing systems, and a case was made for coding techniques that protect the genetic data. As a specific example, the genetic code appears designed to minimize the effects of errors in a way that is directly analogous to Gray codes. Gray codes are commonly used by engineers to protect data processed by many modern digital communications systems.

We now turn our attention to another analogy.

Analogy: Complementary Base Pairing Parity Code

A useful concept to have in mind to better appreciate this analogy is Hamming distance.1 Good codes reduce the probability of error by increasing the minimum Hamming distance between codewords relative to the distance that would have been obtained if no code (or a less powerful code) was used. This minimum distance of a code is like the weakest link in a chain and, therefore, characterizes the code’s strength.

As the code’s minimum distance is increased, it is easier for a recipient to detect a message with no errors. Consider a scavenger hunt game where you “hide” something for a toddler to find, for example, a large blue ball among a pile of smaller white balls. The intent is to make the target blue ball stand out among the others. This exercise is roughly analogous to high Hamming distance. The large blue ball (the intended message) among a number of smaller white balls (errors) exhibits a relatively large dissimilarity, and so the blue ball is easily visible and detectable.

On the other hand, as the code’s minimum distance is decreased the recipient is more likely to make errors in message detection. Now consider an adult scavenger hunt where objects are “hidden” in plain sight because they blend in so nicely in their surroundings. It is difficult to “see” an object you are looking for, even if you are staring straight at it. The intent here is to make the object blend in with the objects around it. This is roughly analogous to low Hamming distance. The target object (the intended message) is hidden among very similar objects (errors) and exhibits a relatively large similarity. Thus, the desired object is not easily visible and detectable.

In digital communications perhaps one of the simplest examples of an error-detecting code (see here and here) is an even parity code. This code is used on a binary message frame (i.e., a sequence of binary digits, 1’s and 0’s). In this code, one parity bit is added to a message frame and its value is chosen to “round” the frame “value” out, to make the message stand out more among the possibilities by increasing the code’s minimum distance (like the example with the blue ball). This allows the recipient to more easily detect that an error has occurred if it detects a “non-round value” (i.e., a white ball). Values that are “round” or “non-round” have precise mathematical definitions in coding theory. The main point is that all parity codes, and the even parity code in particular, impart a precise mathematical structure to the protected (coded) data. This mathematical structure increases the minimum distance between valid codewords and allows for more robust error detection. (See here for more information on parity codes used in engineering.)

Recall that the DNA is a double-strand structure, specifically a double helix. And the four nucleotide bases in the DNA chemical alphabet are A, C, G, and T. Nucleotides A and T are complementary, as are G and C, and these pairings are the basis for the double-stranded structure, where each strand carries the same information as the other strand. Research into the chemical bonds at work between these complementary base pairs reveals that the natural nucleotide alphabet has been chosen to minimize the probability that a given nucleotide on one strand will be incorrectly paired with a partner on the opposite strand. More specifically, a researcher found that the nucleotides used for the DNA chemical alphabet actually form an even parity code. (See research work here and here.)

A convention was used to consistently assign binary values (i.e., 1 or 0) to certain features associated with the four nucleotides that comprise the chemical alphabet in DNA. The relevant features are the relative size of the nucleotide, and its donor-acceptor pattern. The donor-acceptor pattern is relevant for hydrogen bonding. Hydrogen bonds are formed between a nucleotide and its partner on the opposite strand. Careful observation of the resulting binary values reveals that they form an even bit parity code. To be more precise, the relative size of a nucleotide is related to its hydrogen donor-acceptor (D/A) pattern as a parity bit.2

In nature, there are actually 16 nucleotides. Why did nature settle on these specific four, and why only four? At first glance, one may impugn a designer’s inefficiency because there are more nucleotides that could have been used to increase the size of the alphabet, leading to a more efficient genetic code and protein synthesis mechanism. In fact, there has been speculation along these lines. The researcher used the same binary convention to determine the binary representation for the other 12 nucleotides. Upon close inspection, he found that the 16 total nucleotides can be arranged using this framework as eight belonging to the even parity set, and eight belonging to the odd parity set, where the natural alphabet uses a subset of the even parity nucleotides. “Nucleotide Hamming distance” is maximized as a result, as is typical for parity codes, leading to a robust mechanism for error minimization.

The resulting specific four nucleotides emerge as optimal. From this perspective, the genetic machinery is directly analogous to a 1-bit, even-parity code decoder as used routinely in engineering applications. Such a decoder is the optimal way to recover the intended message when a parity code has been used.

The parity code model is an interpretation that readily flows from the relevant chemical bonds that bind the complementary nucleotide pairs. It is a way to mathematically represent or express what is happening at a chemical level. The researcher comments that:

The purine-pyrimidine and hydrogen donor-acceptor patterns governing nucleotide recognition are shown to correspond formally to a digital error-detecting (parity) code, suggesting that factors other than physiochemical issues alone shaped the natural nucleotide alphabet…When this error-coding approach is coupled with chemical constraints, the natural alphabet of A, C, G, and T emerges as the optimal solution for nucleotides.3

In summary, we have seen that an error-detecting code (a parity code) is at work to minimize incorrect bonding between nucleotide pairs on the complementary strands of DNA. For DNA replication to be accurate it is critical that the strands be the true complement of each other. We furthermore note that the specific code used by DNA, an even parity code, is a mainstay in modern communications systems.

The next article in this series will examine another coding analogy between modern digital communications systems and the genetic information-processing system.

Notes/References:

  1. See here and here for more information on Hamming distance.

  2. Refer to Figure 1 here. A and G are larger nucleotides called purines, C and T are smaller nucleotides called pyrimidines. Lone pairs are rich in electrons and participate in weak bonding with hydrogen atoms to form hydrogen bonds between complementary pairs. Hydrogen atoms are also referred to as hydrogen donors, and lone pairs as hydrogen acceptors. A binary “1” was assigned for hydrogen (i.e., hydrogen donors, D). A binary “0” was assigned to “lone pairs” (i.e., hydrogen acceptors, A). A binary “1” was assigned to the smaller nucleotides (i.e., pyrimidines). A binary “0” was assigned to the larger nucleotides (i.e., purines).

  3. Dónall A. Mac Dónaill, “A Parity Code interpretation of Nucleotide Alphabet Composition,” ChemComm 18 (2002): 2062-63.

Error Control Coding in Biology Implies Design, Part 2 (of 5)

Friday, November 28th, 2008

Keith McPherson

Photo of KeithMcPhersonKeith McPherson received his Master of Science in Electrical Engineering from Georgia Institute of Technology in 1993, and currently works as an electrical engineer in Melbourne, FL, in the fields of communications and signal processing.

      -------------------------------

In part 1 of this series we learned how the genetic system is an information-processing system, and outlined several reasons why we could expect to find coding techniques in play to protect the genetic data. Such coding techniques are known and used by engineers to protect the data processed by many modern digital communications systems.

We now turn our attention to a few analogies of such coding techniques.

Analogy: Optimality of the Genetic Code and Gray Mapping

The first analogy will show from a qualitative and quantitative perspective that the genetic code is in fact an optimal (or near optimal) mapping from codons to amino acids. (See here for a table describing the genetic code.)

The genetic code seems optimized to the specific nucleotide error probabilities quite well, as is the case for a good code from an engineering perspective. For example, the first and third nucleotides of a codon (see here) are more likely to be misread during translation, and this error appears to be taken into account in the genetic code mapping. These most common errors, or mutations, translate the desired codon into a codon that codes either for the same amino acid, or for an amino acid that has very similar physicochemical properties, thus minimizing function loss. This is similar to Gray codes used in digital communications.

More specifically, the genetic code seems to be specifically designed to code for the same or very similar amino acids for the most common types of substitution mutations (errors), thereby minimizing protein function loss. In like fashion, Gray codes used in engineering are specifically designed to code for the most similar bit patterns for the most common types of symbol errors, thereby minimizing information loss.

I noticed the similarity to Gray coding after reading a paper by Dr. Fazale Rana in 2002. The Gray code interpretation was highlighted by Manish Gupta in a paper published in 2006. Gupta plotted the 64 codons used in the genetic code in terms of nucleotide distance (see Figure 3 here), and remarked on the correspondence to Gray codes used in engineering. The concept of nucleotide distance and the illustrated plot establishes the validity of the Gray map interpretation.

Recall from part 1 that many genetic code mappings are possible due to the high level of redundancy. Therefore, from a qualitative perspective, and from an engineering perspective, the genetic code is superb, perhaps much better than one may expect from a naturalistic perspective.

Recent work shows just how remarkable the natural code is. (See here and here.) Researchers studying the error-minimizing properties of the genetic code noticed that prior work concluded that the natural code ranked in the top 0.02 percent for efficiency, but that the prior work overlooked bias in mutations.1 When this bias is taken into account, the natural code makes a radical leap forward from the top 0.02 percent to literally one in a million.

Dr. Fazale Rana comments further on the error-minimizing properties of the genetic code:

The genetic code’s error-minimization properties are actually more dramatic than these results indicate. When researchers calculated the error-minimization capacity of one million randomly generated genetic codes, they discovered that the error-minimization values formed a distribution where the naturally occurring genetic code’s capacity occurred outside the distribution. Researchers estimate the existence of 1018 possible genetic codes possessing the same type and degree of redundancy as the universal genetic code. All of these codes fall within the error-minimization distribution. This finding means that of 1018 possible genetic codes, few, if any, have an error-minimization capacity that approaches the code found universally in nature. 2

In summary, qualitative and quantitative evidence suggests that the natural genetic code is highly optimized and, in fact, tuned to the most common type of errors (mutations). In addition, this work highlights an underlying analogy between the genetic system and modern communications systems—the so-called Gray code.

The next article in this series will look at another coding analogy between modern digital communications systems and the genetic information-processing system.

Notes/References:

  1. Bias includes the fact that not all codons are equally mistranslated to other codons, and that certain nucleotide positions within the codon are more prone to error. Purine/purine and pyrimidine/pyrimidine errors (transition mutations) are more common than purine/pyrimidine errors (transversion mutations), and the ranking of the positions is 3rd, 1st, and 2nd in terms of being more error prone.

  2. Fazale Rana, “FYI: I.D. in DNA; Deciphering Design in the Genetic Code,” Facts for Faith, Quarter 1, 2002, 14-23.