Data Science vs. Statistics

Posted 7 months, 3 weeks ago | Originally written on 2 Oct 2023

Data science has become such a promiment field that it is almost impossible to get an answer that resonates across multiple sources. Typically, there will be a mention of statistics, data, coding, machine learning and, quite often, AI but occasionally sources toss in 'scientific computing' and 'the scientific method' for good measure. It seems to me that we are yet to get a clear consensus on this and the obvious consequence is a blunt impact that the field will have.

I believe the emphasis should be on 'data', as the name suggests. But we have to be careful because there are several types of data, which have a bearing on how one works on them. I distinguish between three main types of data:

  • media: data intendend for perception (visual, aural, haptic etc.);
  • measurements: digital estimates of analog physical quantities (air pressure, wind speed, altitude, temperature etc.);
  • sequences: collations of symbols (text, DNA etc.);

I cannot think of any other type of data but if one comes to mind please drop me a line.

Next, we need to contrast between the work that would be done by a statistician and that done by a data scientist.

The role of a statistician is to impose specific theoretical frameworks on either a set of observations (usually measurements) or the results of a controlled experiment to infer something about a population. It is assumed that it is infeasible to perform the measurement or experiment on the population so that an appropriate (often random) sample is assessed for the measures of interest. Consequently, the statistician will always refer to measures of confidence about the resulting statistics.

However, with data science, practitioners often have access to such vast collections of data that they can be assumed to be complete. Such aggregations of 'big data' can provide new types of insights which can be more reliable and obviate the need for sampling. For this reason, the tools employed to handle such data must be automated (think, programmable), the models can scale far beyond what any mind can perceive and their insights can be remarkably accurate. For example, training a deep learning algorithm on every available book is no longer a statistical exercise much in the same way that a census that could accurately account for every inhabitant of a country is not one.

In data science proper, practitioners are at liberty to dispense with classical statistical techniques in favour of using machine learning---whereby appropriate algorithms can exploit the vast swathes of data to learn any inherent relationships that can then be applied to future incarnations of the same data.

This is the data science approach.