2.1 Data Exploration

Until very recently, most data problems in human endeavours have been linked to engineering (that is to say, to the design of objects and machines) and to the sciences (namely, the formulation of theories and falsification of hypotheses).

For instance, engineers may equip their machines with sensors and use the data that they collect to assess and evaluate the machines’ behaviours under various controlled conditions and, ultimately, to improve their functionality.

Scientists, on the other hand, typically collect data through experimental design to test the validity of their theories. But scientific experiments are expensive;2 and generate few data points, relatively-speaking.

As data scientist Damian Mingle puts it, however, modern data analysis is a different beast:

Discovery is no longer limited by the collection and processing of data, but rather management, analysis, and visualization. [22]

In the 21\(^{\text{st}}\) century, not only is there more data to collect and analyze, but it overwhelmingly comes in a digital format (as opposed to the traditional analog paper format) and is mostly derived from observations (rather than generated by designed experiment). Data problems are still solved empirically, theoretically, and through computation and simulation, as has been the case historically,3 but also via data exploration and data visualization.

So what can actually be done with the data, once it has been collected and processed? We think of

  • analysis as the collection of processes by which we extract actionable insights from the data, and

  • visualization as the process of presenting data, calculations, and analysis outputs in a visual format.

Visualization of data prior to analysis (data exploration) can help simplify the analytical process; visualization following analysis (communication) allows for the analysis results to be presented to various stakeholders (see Figure 2.1).

In this chapter, we focus on the role of data visualization prior to analysis and on how to represent multi-dimensional observations on 2D surfaces (such as a poster or the pages of a report or dashboard).

The (messy) analytical process is iterative and  allows for multiple false starts. Visualization plays a major role in the data exploration and communication stages.

Figure 2.1: The (messy) analytical process is iterative and allows for multiple false starts. Visualization plays a major role in the data exploration and communication stages.

2.1.1 Pre-Analysis Uses

Prior to the analysis of the data proper, it is paramount for the data to be explored and for basic questions to be asked (and answered):

  • what system does the data represent, in terms of objects, attributes, relationships?

  • how does it represent this system? in other words, what is the data model?

  • where does the data come from? who collected it and processed it? when did this take place? for what purpose?

  • assuming that the data comes in a flat file format, what do the rows represent? what about the columns?

  • is there enough information (or metadata) to answer these questions? where could more information be found?

In the data exploration context, data visualization is used to set the stage by helping analysts:

  • detect invalid entries and outliers;

  • shape the data transformations (binning, standardization, Box-Cox transformations, dimension reduction, etc.);

  • get a sense for the data (data analysis as an art form, exploratory analysis), and

  • identify hidden data structures (clustering, associations, patterns which may inform the next stage of analysis, etc.).


Pre-analysis uses of data visualization.Pre-analysis uses of data visualization.Pre-analysis uses of data visualization.Pre-analysis uses of data visualization.

Figure 2.2: Pre-analysis uses of data visualization.

2.1.2 Data Exploration in Action: the Algae Bloom Dataset

Consider the algae blooms dataset consisting of 4 variables (Cl, NO3, NH4, season) and 340 observations, found in the UCI Machine Learning Repository [24], [25].

algae_blooms <- read.csv("Data/algae_blooms.csv", stringsAsFactors = TRUE)
algae_blooms <- algae_blooms[,c("season","Cl","NO3","NH4")]
'data.frame':   340 obs. of  4 variables:
 $ season: Factor w/ 4 levels "autumn","spring",..: 4 2 1 2 1 4 3 1 4 4 ...
 $ Cl    : num  60.8 57.8 40 77.4 55.4 ...
 $ NO3   : num  6.24 1.29 5.33 2.3 10.42 ...
 $ NH4   : num  578 370 346.7 98.2 233.7 ...

The first few observations are as follows:

season Cl NO3 NH4
winter 60.800 6.238 578.000
spring 57.750 1.288 370.000
autumn 40.020 5.330 346.667
spring 77.364 2.302 98.182
autumn 55.350 10.416 233.700
winter 65.750 9.248 430.000

The dataset is summarized below:

season Cl NO3 NH4
autumn:80 Min. : 0.222 Min. : 0.000 Min. : 5.00
spring:84 1st Qu.: 10.994 1st Qu.: 1.147 1st Qu.: 37.86
summer:86 Median : 32.470 Median : 2.356 Median : 107.36
winter:90 Mean : 42.517 Mean : 3.121 Mean : 471.73
NA 3rd Qu.: 57.750 3rd Qu.: 4.147 3rd Qu.: 244.90
NA Max. :391.500 Max. :45.650 Max. :24064.00
NA NA’s :16 NA’s :2 NA’s :2

Is it possible to determine what system the data represent from the summary alone? Or where it comes from, why it was collected, and so forth? One would need a fair amount of clairvoyance to answer these questions without metadata. Give it a try now; what can you come up with?

As it happens, the algae bloom dataset is a collection of chemical, biological, and physical characteristics related to samples of European rivers taken over a one-year period (prior to 1999), with the goal of “protecting rivers and streams by monitoring chemical concentrations and algae communities. [25]

With this context in hand, we understand that Cl, N03, and NH4 represent, respectively, the “concentration” of chlorine, nitrate, and ammonium in various European river samples, collected over the 4 seasons of a one-year period.

The numerical summary above provides us with a number of item of interests:

  1. the distribution of samples during the year seems fairly uniform, with nearly a quarter of the samples collected in each season;

  2. there are, respectively, 16, 2, and 2 observations for which the levels of Cl, NO3, and NH4 are unavailable (although no information is available to indicate whether any of the observations have multiple missing values);

  3. all available measurements are non-negative, as befit concentration levels for various chemical compounds;

  4. the measurement ranges for each numerical variable have different magnitudes (\(\approx\) 400 for Cl, 50 for NO3, and 25000 for NH4);

  5. the jump between the 3rd quartile and the maximum measurement for Cl and NO3 is ofone order of magnitude, but it of two orders of magnitude for NH4;

  6. and so on.

While a fair amount of insight can be derived from that particular numerical summary, a number of questions remain unanswered. Let us explore this dataset further.

Can we get a more sophisticated understanding than the one provided by the numerical summary? In Figure 2.3, for instance, we see that that 2 of the dataset instances have exactly 3 missing values (Cl, NO3, and NH4); the 14 remaining observations with missing values are those for which only Cl is unavailable.

algae_blooms$NAs = is.na(algae_blooms$Cl) + 
  is.na(algae_blooms$NO3) + is.na(algae_blooms$NH4)
algae_blooms$index = 1:nrow(algae_blooms)
algae_blooms = dplyr::mutate(algae_blooms, 
ggplot2::ggplot(algae_blooms, ggplot2::aes(x=index, y=NAs)) +
  ggplot2::geom_point(ggplot2::aes(colour=NAs)) + ggplot2::theme_classic() + 
  ggplot2::xlab("Observation") + ggplot2::ylab("Number of missing values")
Number of missing values per case in the *algae blooms* dataset.

Figure 2.3: Number of missing values per case in the algae blooms dataset.

The same chart also shows that there are two contiguous blocks of observations with missing values (around observations 66 and 225, roughly speaking). This suggests that there could be data collection issues – a group of measurement slips might have been misplaced, or a student intern might have misunderstood how to test the water samples, say – but as no information is available about the process, we can at best provide possible (if not potential) explanations for the existence of this pattern, which may only prove to be an artefact of the sorting process, in the final analysis.

We can also expand our understanding of the various measurements by plotting univariate distributions instead of relying on their respective 6-point numerical summaries. For instance, the non-missing values of Cl (of which there are 324) range from \(0.222\) to \(391.500\), with a mean level of \(42.517\) and a median of \(32.470\). What does any of this mean, in practice?

From the value of the median, we know that half the measurements fall between \(0.222\) and \(32.470\), and half fall between \(32.470\) and \(391.500\). The observations of the second half fall in a longer interval than those of the first half, so we would expect the measurement to be denser in the high-level regime than in the low-level one. This is borne out by the histogram of Cl measurements.

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=Cl)) +   
  ggplot2::geom_histogram() + ggplot2::geom_rug() +                                
  ggplot2::xlab("Chlorine (Cl-)") + 
Histogram of `Cl` levels in the *algae blooms* dataset (extract), with rug chart.

Figure 2.4: Histogram of Cl levels in the algae blooms dataset (extract), with rug chart.

The numerical summary hints at the presence of outliers in these measurements (since the median is substantially smaller than the mean), and the visual display provides a clear picture – the measurements above 250 are quite likely to be outliers (either due to measurement errors or because they are indicative of particularly unrepresentative sampling sites). The small clusters of observations around the 150 and the 200 marks are also suspicious, but it could simply be the reality of the measurements in the field. Perhaps these measurements were taken downstream of some chemical factory, say?

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=Cl,fill=season)) +   
  ggplot2::geom_histogram() +  
  ggplot2::facet_wrap(.~season) + 
  ggplot2::xlab("Chlorine (Cl-)") + 

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=NO3)) +   
  ggplot2::geom_histogram() +                                
  ggplot2::xlab("Nitrate (NO3-)") + 

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=NO3,fill=season)) +   
  ggplot2::geom_histogram() +  
  ggplot2::facet_wrap(.~season) + 
  ggplot2::xlab("Nitrate (NO3-)") + 

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=NH4)) +   
  ggplot2::geom_histogram() +                                
  ggplot2::xlab("Ammonium (NH4+)") + 

ggplot2::ggplot(algae_blooms,ggplot2::aes(x=NH4,fill=season)) +   
  ggplot2::geom_histogram() +  
  ggplot2::facet_wrap(.~season) + 
  ggplot2::xlab("Ammonium (NH4+)") + 

ggplot2::ggplot(algae_blooms,ggplot2::aes(size=NH4, x=Cl,y=NO3)) + 
  ggplot2::geom_point() + ggplot2::theme_classic()

algae_blooms.2 = dplyr::filter(algae_blooms,NO3<20 & Cl<250)

ggplot2::ggplot(algae_blooms.2,ggplot2::aes(x=NH4)) +   
  ggplot2::geom_histogram() +                                
  ggplot2::xlab("Ammonium (NH4+)") + 

ggplot2::ggplot(algae_blooms.2,ggplot2::aes(x=NH4,fill=season)) +   
  ggplot2::geom_histogram() +  
  ggplot2::facet_wrap(.~season) + 
  ggplot2::xlab("Ammonium (NH4+)") + 

ggplot2::ggplot(algae_blooms.2,ggplot2::aes(size=NH4, x=Cl,y=NO3)) + 
  ggplot2::geom_point() + ggplot2::theme_classic()

ggplot2::ggplot(algae_blooms.2,ggplot2::aes(size=NH4,x=Cl,y=NO3,fill=season)) +   
  ggplot2::geom_point(pch=21) +  
  ggplot2::facet_wrap(.~season) + 

Without observation-specific context, it is nearly impossible to gauge how likely the above explanation holds, but at the very least, the chart highlights potential problem areas that any eventual analysis will have to address.


A. Knapp, How Much Does it Cost to Find a Higgs Boson? Forbes, Jul. 2012.
@DamianMingle, Twitter.
Leadership and Success, Leadership Journey: Richard Feynman.”
D. Dua and C. Graff, UCI machine learning repository.” University of California, Irvine, School of Information; Computer Sciences, 2017.
J. Strackeljan, COIL competition dataset.” ERUDIT, 1999.

  1. The cost of finding the Higgs Boson at CERN’s Large Hadron Collider was estimated as 13.25 billion USD in 2012 [21].↩︎

  2. There were exceptions, of course. Richard Feynman, for instance, was known to “solve problems by putting himself in the place of an atom or an electron, essentially asking himself what he would do if he were an atomic or subatomic particle. [23]↩︎