A Taxonomy of Data

Posted 7 months ago | Originally written on 2 Oct 2023

When we talk about data, we have in mind the raw digital representation associated with one or more events. This we must distinguish from knowledge itself and information, which I will not attempt to define here.

There are three main types of data:

  1. Media.
  2. Measurements.
  3. Sequences.

The type of data has a profound impact on how one works with it. In this article, I am only referring to digital data. Analogue counterparts retain the same distinctions but are handled in analogue ways.

One interesting consequence of this taxonomy is the functional distinction between the different types. Sequences are used to convey knowledge (or information), media are used to convey perceptual experiences while measurements are used to convey physics. For example, a film is a combination of visioaural media that audiences consume but the substance of the very film are first encapsulated in the script, score, and song, all of which are sequences. The film production process is an orderly set of tasks used to create both visual and aural recordings (media).

Often, advanced computer algorithms are used to try to extract sequences from media e.g. OCR, image recognition, speech-to-text. In other words, knowledge (information) from perceptual data. However, the emergence of new technologies such as large language models (LLMs) has enabled sensible analysis of sequence data.

On the other hand, 2/3D dimensional computer graphics are performed on measurements (spaces) then converted into media (images) through a rendering process.

Media

Media is any form of data destined for perception. In effect, media is used to transfer experiences. Images, video and audio are the best forms of media. I would not treat text as a form of media because there is nothing perceptual about the text in an of itself. For example, the word 'two' bears no semantics of the idea of the number 2 in and of itself. We know this implicitly because the idea of 2 exists in various forms, which we refer to as languages.

Because media are intended for perception, it makes sense to refer to the fidelity of the data i.e. the perceptual acquity they support. A high fidelity image or sound is one which is as close in identity to the analogue counterpart.

There are several challenges about working with media:

  • Fidelity is proportion to bulk. The better quality the data the more voluminous it is. This has implications for storage, transfer, processing and interaction.
  • The semantic value of the data is in the mind of the percepient. For an image to be semantically processed e.g. labeling objects in the scene, a human observer would first have to somehow convey what the semantic entities are.
  • They are subject to the perceptual peculiarities of the percepient. For example, I am made to understand that colour does not really exist. It is possible to introduce illusions not present in the data.
  • While media are resilient to lossy compression they can also suffer from artefacts due to this.

Measurements

Measurements are numerical values, either discrete or continuous, that indicate physical quantities. By this token, measurements are the result of capturing some physical state. They can be as familiar as the count of an entity and as obscure as measures of nuclear effects. In a similar fashion, measurements bear the notion of fidelity but in relation to how representative they are of the physical phenomenon. This fidelity is referred to as accuracy or resolution---the degree to which the measures can capture fine grain detail. Counting measures have a higher tendency to be accurate but would suffer the most from errors.

One interesting measurement is time, which is a fundamental measure usually associated with all other forms of data.

Sequences

Any assembly of symbols in a definite order constitute a sequence. In contrast to media and measurement, sequences capture logical knowledge i.e. a sequence of representations about some logical (not physical) state. This is applied to text, DNA and proteins, as well as more exotic forms such hashes and cryptographic keys. The essential property about sequences is their reliance on integrity: any modification can render them into junk. For example, point mutations in genomic sequences can spell the difference between life and death. For this reason, sequences must be preserved even if compressed (lossless compression).

Similar to media, the semantic content exists outside of the data. Text only means something within a linguistic framework. Similarly, DNA and protein sequences only have meaning biologically and this occurs independent of the perceived sequence.