Data visualisation is a key piece of the analysis process. At Mango, we consider the ability to create compelling visualisations to be sufficiently important that we include it as one of the core attributes of a data scientist on our data science radar.
Although visualisation of data is important in order to communicate the results of an analysis to stakeholders, it also forms a crucial part of the exploratory process. In this stage of analysis, the basic characteristics of the data are examined and explored.
The real value of data analyses lies in accurate insights, and mistakes in this early stage can lead to the realisation of the favourite adage of many statistics and computer science professors: “garbage in, garbage out”.
Whilst it can be tempting to jump straight into fitting complex models to the data, overlooking exploratory data analysis can lead to the violation of the assumptions of the model being fit, and so decrease the accuracy and usefulness of any conclusions to be drawn later.
This point was demonstrated in a beautifully simplified way by statistician Francis Anscombe, who in 1973 designed a set of small datasets, each showing a distinct pattern of results. Whilst each of the four datasets comprising Anscombe’s Quartet have identical or near identical means, variances, correlations between variables, and linear regression lines, they all highlight the inadequacy of using simple summary statistics in exploratory data analysis.
The accompanying Shiny app allows you to view various aspects of each of the four datasets. The beauty of Shiny’s interactive nature is that you can quickly change between each dataset to really get an in-depth understanding of their similarities and differences.
The code for the Shiny app is available on github.