"By information I mean the specification of the amino acid sequence in the protein... Information means here the precise determination of sequence, either of bases in the nucleic acid or in amino acid residues in the protein."
- Francis Crick
Yesterday, I got hit by a bus. A big, fat, red, speeding bus. I'll try to describe what that felt like.
I have always thought of the genomic complement associated with any organism as merely a sequence of characters akin to human languages. I fear that this perspective is astronomically simplistic and barely scratches the surface of what DNA really is. The analogy breaks down because DNA encodes (and it really does encode not just code) the instructions required mainly to synthesise protein products. Therefore, unlike it's linguistic counterpart, there is a physicality that is wholy unaccounted for. In fact, I fear that this seems to detract from its linguistic sense because we are tempted to conclude its inferiority due to the the lack of semantic value. However, I would argue that this absense of semanticness masks the presence of something that language lacks entirely: physicality. It is this property that confers onto DNA vastly more intelligence that I at first supposed.
Sequences of characters used to convey meaning are purely imaginary. They have no physicality at all. With enough effort and persuasion it is possible to completely restructure human literal language in such a way as to be unrecognisable from its prior form. This is because the nature of literal language is bound by convention: society (an emergent entity, according to F. Hayek) convenes the use of a set of symbols which are paired with their mental counterparts. The mental language therefore acquires an external (non-mental) but still aphysical representation. Regardless, this external form has no material consequences; it is merely representational. The best illustration on how empty this representation can be is to recall the Rosetta Stone, which was instrumental in exposing 'dead' languages.
But DNA is a physical language. It has physical consquences. The sequence of bases is essential for material processes so much so that point mutations can completely ablate the intended function. The closest analogy of how this works is to think of 3D printing.
3D printing works by converting digital representations into physical objects by assembling the final result one layer at a time. The 3D printer converts the 3D object into a sequence of contours, which it then proceeds to trace in physical space using some malleable medium. Most 3D printers can print objects stored in a file format called stereolithography format or STL. STL files are very simple: they store the object using 3D points arranged as a sequence of triangles. However, the comparison to DNA breaks down because of the intermediate processing step which creates the contours for actual printing. With DNA there is no such step; the sequences are simply read as-is and downstream processing modifies the product into the final form.
As a physical language, the genomic sequence conveys a set of intended physical interactions; that is, once the final product (proteomic or otherwise) is complete, its physical form exerts a set of physical behaviours through which it mediates its action. Every biophysical event is the result of such spatiotemporal interactions which must occur with atomic precision in order to complete successfully. This ranges from simple linear proteins to gigantic macromolecular complexes consisting of tens to thousands of proteomic and non-proteomic actors. For example, the spliceosome, a macromolecular machine which effects excision of unwanted subsequences of DNA from genomic transcripts, assembles and disassembles at its reaction sites in very highly 'coordinated' fashion and with suprising speed and regularity that it behaves like magic.
This is a good point at which to highlight that the final function that a genomic sequence performs requires that the encoded content is decoded. For a protein-coding transcript (the result of making a copy for use) the main decoding involves interpreting the quarternary sequence consisting of DNA bases A, C, T and G into an amino acid sequence. The result sequence is constructed from an alphabet of 20 'characters'. Underneath this arrangement are well-defined functional domains which exert the required behaviour. Furthermore, the precise sequence confers up the protein the ability to somehow be deformed into its three-dimensional functional form to work correctly. It may also be required to coalese together with many other similar proteins to be useful. To my untrained mind, these functional domains may actually constitute the 'useful words' that bear the physical meaning encoded.
The main point here is that the processing pipeline, in order to deliver the final useful protein product (or whatever else is produced), needs to act faithfully without ever really knowing what the eventual product will be. For example, splicing must be precise even if there is nothing functional immediately present in the unspliced form. While we can readily discern high frequency sequence patterns (logos) present at splice sites, these tend to be statistical rather than discrete, meaning that the splicing machinery should somehow overcome potential errors and still deliver useful products. A useful analogy would be to meaningfully edit the binary base64-encoded characters of a zipped binary stream, an almost impossible task. And yet the subcellular (not even cellular) machinery is able to accomplish this, not just as an end in itself, but merely as a means to support life capable of observing it all.
If that were all to it then it would be remarkable. But we have hardly scratched the surface. Every DNA processing machine is also encoded in DNA. In other words, the physical instructions to be used to produce protein X as well as the physical instructions to be use to produce the machines to produce protein X are held in the same source. DNA as an information library also stores the instructions on how to build the library. It is a self-referencing and self-contained source of information.
DNA is not merely a 'language of the cell' for by saying this we severely trivialise it of it being an encoding of not just information but physically-contingent self-manufacturing instructions in several layers of complexity. In other words, DNA is a superlanguage unlike anything we have in all creation.
I believe I have barely articulated what being hit by a bus really means but I hope that in the shadows of this article you can at least spy a sense of what it feels like.