5.5 Basic Visualizations in R

Whenever we analyze data, the first thing we should do is look at it:

  • for each variable, what are the most common values?

  • How much variability is present?

  • Are there any unusual observations?

Producing graphics for data analysis is relatively simple. Producing graphics for publication is relatively more complex and typically requires a great deal of tweaking to achieve the desired appearance. Base R provides a number of functions for visualizing data. In this section, we will provide sufficient guidance so that most desired effects can be achieved, but further investigation of the documentation and experimentation will doubtless be necessary for specific needs (advanced visualization functionality is provided by ggplot2).

For now, let’s discuss some of the most common basic methods.

5.5.1 Scatterplots

The most common plotting function in R is the plot() function. It is a generic function, meaning, it calls various methods according to the type of the object passed which is passed to it. In the simplest case, we can pass in a vector and we get a scatter plot of magnitude vs index.

More generally, we can pass in two vectors and a scatter plot of the points formed by matching coordinates is displayed.

For example, the command plot(c(1,2),c(3,5)) would plot the points \((1,3)\) and \((2,5)\).

plot(c(1,2),c(3,5))

Here is a more concrete example showing how to plot the graph of the sine function in the range from \(-\pi\) to \(\pi\).

x <- seq(-pi,pi,0.1)
plot(x, sin(x))

We can add a title to our plot with the parameter main. Similarly, xlab and ylab can be used to label the x-axis and y-axis respectively.

The curve is made up of circular black points. This is the default setting for shape and colour. This can be changed by using the argument type.

It accepts the following strings (with given effect)

  • p – points

  • l – lines

  • b – both points and lines

  • c – empty points joined by lines

  • o – overplotted points and lines

  • s – stair steps

  • h – histogram-like vertical lines

  • n – does not produce any points or lines

Similarly, we can specify the colour using the argument col.

plot(x, sin(x),
 main=expression("Graph of the sine function, over [-" * pi * "," * pi * "]"),
 ylab="sin(x)",
 type="l",
 col="blue")

Calling plot() multiple times will have the effect of plotting the current graph on the same window, replacing the previous one.

However, we may sometimes wish to overlay the plots in order to compare the results. This is made possible by the functions lines() and points(), which add lines and points respectively, to the existing plot.

plot(x, sin(x),
 main="Overlaying Graphs",
 ylab="",
 type="l",
 col="blue")
lines(x,cos(x), col="red")
legend("topleft",
      c("sin(x)","cos(x)"),
      fill=c("blue","red")
)

The legend() function allows for the appropriate display in the plot.

5.5.2 Barplots

Barplots can be created in R using the barplot() function.

We can supply a vector or matrix to this function, and it will display a bar chart with bar heights equal to the magnitude of the elements in the vector.

Let us suppose that we have a vector of maximum temperatures for seven days, as follows.

max.temp <- c(22, 27, 26, 24, 23, 26, 31)

We can make a bar chart out of this data using a simple command.

barplot(max.temp)

This function can take on a number of arguments (details are available in the help file, which can be queried by entering ?barplot in the console).

Frequently-used arguments include:

  • main to specify the title

  • xlab and ylab to provide labels for the axes

  • names.arg to provide a name for each bar

  • col to define colour, etc.

We can also transpose the plot to have horizontal bars by providing the argument horiz = TRUE, such as in the following:

barplot(max.temp,
  main = "Maximum Temperatures in a Week",
  xlab = "Degree Celsius",
  ylab = "Day",
  names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
  col = "darkred",
  horiz = TRUE)

At times, we may be interested in displaying the count or magnitude for categories in the data. For instance, consider the following vector of age measurements for 21 first-year college students.

age <- c(22,17,18,18,17,18,19,18,16,18,18,18,18,18,19,19,20,17,18,17,31)

Calling barplot(age) does not provide the required plot.

barplot(age)

Indeed, this call plots 21 bars with heights corresponding to students’ age – the display does not provide the frequency count for each age.

The values can be quickly found using the table() function, as shown below.

table(age)
age
16 17 18 19 20 22 31 
 1  4 10  3  1  1  1 

Plotting this data will produce the required barplot (in the code, the argument densityis used to determine the texture of the fill).

barplot(table(age),
  main="Age Count of 21 Students",
  xlab="Age",
  ylab="Count",
  border="red",
  col="blue",
  density=10
)

The chart respects the relative proportions, but R cannot tell that there should be (empty) categories for age=21 and between 22 and 31.

This can be remedied by first setting new factor levels for the age measurements, and re-running the code.

age=factor(age,levels=c(15:31))
barplot(table(age),
  main="Age Count of 21 Students",
  xlab="Age",
  ylab="Count",
  border="red",
  col="blue",
  density=10
)

5.5.3 Histograms

Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the \(x\)-axis and displaying the frequency of scores in each bin on the \(y\)-axis.

We can create histograms in R with hist():

  • the option freq=FALSE creates a plot based on probability densities rather than frequencies;

  • the breaks option controls the number of bins: the default produces equally spaced breaks when defining the cells of the histogram.

For illustrative purposes, we will use several of the variables from the Motor Trend Car Road Tests (mtcars) dataset provided in the base R installation.

Here are four variations on a histogram.

par(mfrow=c(2,2))
hist(mtcars$mpg)    
hist(mtcars$mpg,
     breaks=12,
     col="red",
     xlab="Miles Per Gallon",
     main="Colored histogram with 12 bins")
hist(mtcars$mpg,
     freq=FALSE,
     breaks=12,
     col="red",
     xlab="Miles Per Gallon",
main="Histogram, rug plot, density curve")
rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2)
x <- mtcars$mpg
h<-hist(x,
        breaks=12,
        col="red",
        xlab="Miles Per Gallon",
        main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=40)
yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
box()

The first histogram demonstrates the default plot with no specified options: five bins are created, and the default axis labels and titles are printed.

In the second histogram, 12 bins have been specified, as well as a red fill for the bars, and more attractive (enticing?) and informative labels and title.

The third histogram uses the same colours, number of bins, labels, and titles as the previous plot but adds a density curve and rug-plot overlay. The density curve is a kernel density estimate and is described in the next section. It provides a smoother description of the distribution of scores. The call to function lines() overlays this curve in blue (and a width twice the default line thickness); a rug plot is a one-dimensional representation of the actual data values.

The fourth histogram is similar to the second but with a superposed normal curve and a box around the figure. The code for superposing the normal curve comes from a suggestion posted to the R-help mailing list by Peter Dalgaard. The surrounding box is produced by the box() function.

5.5.4 Curves

Given an expression for a function \(y(x)\), we can plot the values of \(y\) for various values of \(x\) in a given range.

This can be accomplished using the R function curve().

We can plot the simple polynomial function \(y=3x^{2}+x\) in the range \(x=[1,10]\) as follows:

curve(3*x^2 + x, from=1, to=10, n=300, xlab="xvalue",
      ylab="yvalue", col="blue", lwd=2, 
      main="Plot of (3x^2 + x)")

The important parameters of the curve() function are:

  • the first parameter is the mathematical expression of the function to plot, written in the format for writing mathematical operations in or LaTeX;

  • two numeric parameters (from and to) that represent the endpoints of the range of \(x\);

  • the integer n that represents the number of equally-spaced values of \(x\) between the from and to points;

  • the other parameters (xlab, ylab, col, lwd, main) have their usual meaning.

5.5.5 Boxplots

A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary:

  • the minimum \(Q_0\);

  • the lower quartile \(Q_1\) (25th percentile);

  • the median \(Q_2\) (50th percentile);

  • the upper quartile \(Q_3\) (75th percentile), and

  • the maximum \(Q_4\).

The belt of the boxplot is found at \(Q_2\), its box boundaries at \(Q_1\) and \(Q_3\), and its whiskers at \(5/2Q_1-3/2Q_3\) and \(5/2Q_3-3/2Q_1\).

boxplot(mtcars$mpg, main="Box plot", ylab="Miles per Gallon")

For normally distributed variables, it also displays observations which may be identified as potential outliers (which it to say, values outside the box-and-whiskers range \([5/2Q_1-3/2Q_3,5/2Q_3-3/2Q_1]\)).

boxplot(c(mtcars$mpg,35), main="Box plot", ylab="Miles per Gallon")

5.5.6 Examples

Let’s take a look at some other basic R plots.

Scatterplots

On a scatter plot, the features to study depend on the scale of interest:

  • for “big” patterns, look at:
    • form and direction
    • strength
  • for “small” patterns, look at:
    • deviations from the pattern
    • outliers

We start with the swiss built-in R dataset.

str(swiss) # structure of the swiss dataset
pairs(swiss) # scatter plot matrix for the swiss dataset

'data.frame':   47 obs. of  6 variables:
 $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
 $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
 $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
 $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
 $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
 $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

Let’s focus on one specific pair: Fertility vs. Education.

# raw plot
plot(swiss$Fertility, swiss$Education)

The same plot can be prettified and made more informative:

# add a title and axis labels
plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education",
     main="Education vs Fertility (by province), Switzerland, 1888", 
     las=1)

# add the line of best fit (in red)
abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5) 

# add the smoothing lowess curve (in blue)
lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5) 

# add a legend
legend(75,50, c("Best Fit","Lowess"), lty=c(1,1),
       lwd=c(2.5,2.5),col=c("red","blue")) 

Compare that graph with the one found below:

plot(swiss$Education, xlab="Province", ylab="Education", 
     main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5) 
lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5) 
legend(5,52, c("Best Fit","Lowess"), lty=c(1,1),
       lwd=c(2.5,2.5),col=c("red","blue")) 

Even though R will not balk at producing this graph, it is a nonsense graph. Why, exactly?

The main take-away of this example is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way.

Let’s do some thing similar for the infamous built-in iris dataset.

str(iris) # structure of the dataset
summary(iris) # information on the distributions for each feature
# ?iris # information on the dataset itself
'data.frame':   150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
       Species  
 setosa    :50  
 versicolor:50  
 virginica :50  
                
                
                

The function pairs provides a scatter plot matrix of the dataset; the option lower.panel=NULL removes the lower panels (due to redundancy).

pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, 
      lower.panel=NULL, labels=c("SL","SW","PL","PW"), 
      font.labels=2, cex.labels=4.5) 

We can compare the sepal width and length variables in a manner similar to what we did with the swiss dataset.

plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length", 
     ylab="Sepal Width", main="Sepal Width vs Sepal Length, 
     Anderson's Iris Dataset", las=1,
     bg=c("yellow","black","green")[unclass(iris$Species)])
abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5) 
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1),
       lwd=c(2.5,2.5),col=c("red","blue")) 

There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length?

plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", 
     ylab="Petal Length", main="Sepal Width vs Petal Length, 
     Anderson's Iris Dataset", las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) 
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1),
       lwd=c(2.5,2.5),col=c("red","blue")) 

Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation.

plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", 
     ylab="Petal Length", main="Sepal Width vs Petal Length, 
     Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) 
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) 
legend(2,7, c("Best Fit","Lowess"), lty=c(1,1),
       lwd=c(2.5,2.5),col=c("red","blue")) 

A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph.

Colour can also be used to highlight various data elements. Here, we colour each observation differently according to its species:

plot(iris$Sepal.Length, iris$Sepal.Width, pch=21,
     bg=c("red","green3","blue")[unclass(iris$Species)], 
     main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=)

This can be done on all scatterplots concurrently using pairs.

pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21, 
      bg = c("red", "green3", "blue")[unclass(iris$Species)],
      lower.panel=NULL, labels=c("SL","SW","PL","PW"), 
      font.labels=2, cex.labels=4.5)

Histograms and Bar Charts

In histograms or frequency charts, users should stay on the lookout for bin size effects. For instance, what does the distribution of the Education variable in the swiss dataset look like?

hist(swiss$Education)   # default number of bins

hist(swiss$Education, breaks=10)   # with 10 bins

hist(swiss$Education, breaks=20)   # with 20 bins

The distribution pattern is distintcly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up masking trends if the dataset isn’t large enough.

We can also look for best fits for various parametric distributions, with the default number of bins:

hist(swiss$Education, freq=FALSE, xlab="Education",
     main="Education Distribution, by Province, Switzerland, 1888",
     col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", 
      lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", 
      lwd=4) # fit an exponential distribution
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), 
      add=TRUE, col="green3", lwd=4) # fit a normal distribution
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
       lwd=c(4,4),col=c("darkblue","black", "green3")) 

We can also view the histograms with 10 bins:

hist(swiss$Education, breaks=10, freq=FALSE,xlab="Education",main="Education 
     Distribution, by Province, Switzerland, 1888", col="firebrick1", 
     ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, 
      col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, 
      col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), 
      add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
       lwd=c(4,4),col=c("darkblue","black", "green3")) 

or with 20 bins:

hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education",
     main="Education Distribution, by Province, Switzerland, 1888",
     col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, 
      col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, 
      col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, 
      col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
       lwd=c(4,4),col=c("darkblue","black", "green3")) 

With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised.

And now, can you figure out what is happening with these visualizations of the iris dataset?

hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",
     main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15))

What happens above if we replace freq=FALSE with freq=TRUE?

Let’s look at another feature, now.

hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width",
     main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15))

Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information (this is done through a custom function based on R-blogger.com’s scatterhist() function).

scatterhist = function(x, y, xlab="", ylab=""){
  zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
  layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
  xhist = hist(x, plot=FALSE)
  yhist = hist(y, plot=FALSE)
  top = max(c(xhist$counts, yhist$counts))
  par(mar=c(3,3,1,1))
  plot(x,y)
  par(mar=c(0,3,1,1))
  barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
  par(mar=c(3,0,1,1))
  barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
  par(oma=c(3,3,0,0))
  mtext(xlab, side=1, line=1, outer=TRUE, adj=0, 
    at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
  mtext(ylab, side=2, line=1, outer=TRUE, adj=0, 
    at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}

ds = iris
with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width, 
                     xlab="Sepal.Length", ylab="Sepal.Width"))

Bubble Charts

Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: the size, colour, and shape of the bubbles can also be used to represent different data elements.

For this first example, we’ll take a look at some Statistics Canada 2011 demographic data regarding Canada’s census metropolitan areas (CMA) and census agglomerations (CA).

can.2011=read.csv("data/Canada2011.csv", head=TRUE) 
head(can.2011) 
str(can.2011) 
summary(can.2011[,3:12], maxsum=13) 
  Geographic.code     Geographic.name Province   Region Type pop_2011
1               1          St. John's       NL Atlantic  CMA   196966
2               5         Bay Roberts       NL Atlantic   CA    10871
3              10 Grand Falls-Windsor       NL Atlantic   CA    13725
4              15        Corner Brook       NL Atlantic   CA    27202
5             105       Charlottetown       PE Atlantic   CA    64487
6             110          Summerside       PE Atlantic   CA    16488
  log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011
1     5.294391            20           84542                  78960
2     4.036269           147            4601                   4218
3     4.137512           128            6134                   5723
4     4.434601            94           11697                  11110
5     4.809472            52           28864                  26192
6     4.217168           120            7323                   6941
  occ_rate_2011 med_total_income_2011
1     0.9339736                 33420
2     0.9167572                 24700
3     0.9329964                 26920
4     0.9498162                 27430
5     0.9074279                 30110
6     0.9478356                 27250
'data.frame':   147 obs. of  12 variables:
 $ Geographic.code       : int  1 5 10 15 105 110 205 210 215 220 ...
 $ Geographic.name       : chr  "St. John's" "Bay Roberts" "Grand Falls-Windsor" "Corner Brook" ...
 $ Province              : chr  "NL" "NL" "NL" "NL" ...
 $ Region                : chr  "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...
 $ Type                  : chr  "CMA" "CA" "CA" "CA" ...
 $ pop_2011              : int  196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...
 $ log_pop_2011          : num  5.29 4.04 4.14 4.43 4.81 ...
 $ pop_rank_2011         : int  20 147 128 94 52 120 13 97 67 78 ...
 $ priv_dwell_2011       : int  84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
 $ occ_private_dwell_2011: int  78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...
 $ occ_rate_2011         : num  0.934 0.917 0.933 0.95 0.907 ...
 $ med_total_income_2011 : int  33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ...
   Province            Region              Type              pop_2011      
 Length:147         Length:147         Length:147         Min.   :  10871  
 Class :character   Class :character   Class :character   1st Qu.:  18429  
 Mode  :character   Mode  :character   Mode  :character   Median :  40077  
                                                          Mean   : 186632  
                                                          3rd Qu.:  98388  
                                                          Max.   :5583064  
                                                                           
  log_pop_2011   pop_rank_2011   priv_dwell_2011   occ_private_dwell_2011
 Min.   :4.036   Min.   :  1.0   Min.   :   4601   Min.   :   4218       
 1st Qu.:4.265   1st Qu.: 37.5   1st Qu.:   8292   1st Qu.:   7701       
 Median :4.603   Median : 74.0   Median :  17428   Median :  16709       
 Mean   :4.720   Mean   : 74.0   Mean   :  78691   Mean   :  74130       
 3rd Qu.:4.993   3rd Qu.:110.5   3rd Qu.:  44674   3rd Qu.:  41142       
 Max.   :6.747   Max.   :147.0   Max.   :2079459   Max.   :1989705       
                                                                         
 occ_rate_2011    med_total_income_2011
 Min.   :0.6492   Min.   :22980        
 1st Qu.:0.9173   1st Qu.:27630        
 Median :0.9348   Median :29910        
 Mean   :0.9276   Mean   :31311        
 3rd Qu.:0.9475   3rd Qu.:33255        
 Max.   :0.9723   Max.   :73030        
                  NA's   :5            

Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions.

pairs(can.2011[,c(7,9,10,11,12)])

It’s … underwhelming, if that’s a word.78. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass.

Can anything else be derived using bubble charts? We use median income as the radius for the bubbles, and focus on occupancy rates and population.

radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011,
        circles=radius.med.income.2011, inches=0.45, 
        fg="white", bg="red", xlab="Population (log)", 
        ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")

Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, 
        inches=0.25, fg="white", bg="red",
        xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")

Not much to it. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011, 
        inches=0.15, fg="white",
        bg=c("red","blue","black","green","yellow","violet")[factor(
          can.2011$Region)],
        xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011) - colours represent
      oeographical regions.")

What do you think?

Perhaps the CA distort the full picture (given that they are small and more numerous).

Let’s analyze the subset of CMAs instead.

can.2011.CMA=read.csv("data/Canada2011_CMA.csv", head=TRUE) # import the data
str(can.2011.CMA) #dataset structure (notice the number of observations)
summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution
'data.frame':   33 obs. of  15 variables:
 $ Geographic.code       : int  1 205 305 310 408 421 433 442 462 505 ...
 $ Geographic.name       : chr  "St. John's" "Halifax" "Moncton" "Saint John" ...
 $ Province              : chr  "NL" "NS" "NB" "NB" ...
 $ Region                : chr  "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...
 $ Type                  : chr  "CMA" "CMA" "CMA" "CMA" ...
 $ pop_2011              : int  196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...
 $ log_pop_2011          : num  5.29 5.59 5.14 5.11 5.2 ...
 $ fem_pop_2011          : int  103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...
 $ prop_fem_2011         : num  52.6 52.4 51.1 51.5 50.6 ...
 $ pop_rank_2011         : int  20 13 29 31 26 7 19 27 2 4 ...
 $ priv_dwell_2011       : int  84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
 $ occ_private_dwell_2011: int  78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...
 $ occ_rate_2011         : num  0.934 0.932 0.934 0.921 0.942 ...
 $ med_unemployment_2011 : num  6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...
 $ med_total_income_2011 : int  33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...
   Province            Region              Type              pop_2011      
 Length:33          Length:33          Length:33          Min.   : 118975  
 Class :character   Class :character   Class :character   1st Qu.: 159561  
 Mode  :character   Mode  :character   Mode  :character   Median : 260600  
                                                          Mean   : 700710  
                                                          3rd Qu.: 721053  
                                                          Max.   :5583064  
  log_pop_2011    fem_pop_2011     prop_fem_2011   pop_rank_2011
 Min.   :5.075   Min.   :  63260   Min.   :50.55   Min.   : 1   
 1st Qu.:5.203   1st Qu.:  83243   1st Qu.:51.55   1st Qu.: 9   
 Median :5.416   Median : 135561   Median :52.17   Median :17   
 Mean   :5.553   Mean   : 365093   Mean   :52.05   Mean   :17   
 3rd Qu.:5.858   3rd Qu.: 378690   3rd Qu.:52.41   3rd Qu.:25   
 Max.   :6.747   Max.   :2947971   Max.   :53.17   Max.   :33   
 priv_dwell_2011   occ_private_dwell_2011
 Min.   :  53730   Min.   :  48848       
 1st Qu.:  72817   1st Qu.:  67767       
 Median : 110314   Median : 104237       
 Mean   : 291519   Mean   : 275587       
 3rd Qu.: 294150   3rd Qu.: 282186       
 Max.   :2079459   Max.   :1989705       

We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.

radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi)
symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,
        circles=radius.med.income.2011.CMA, inches=0.25, fg="white",
        bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)],
        ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,
     can.2011.CMA$Geographic.code, cex=0.5)

Part of the problem is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say?

radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,
        circles=radius.pop.rank.2011.CMA, inches=0.25, fg="white",
        bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)],
        ylab="Median Income", xlab="Unemployment Rate")
title("Population Rank, by CMA and CA (2011)")
text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,
     can.2011.CMA$Geographic.code, cex=0.5)

There’s a bit more structure in there, isn’t there? Can you figure out to what regions the colours correspond? (Hint: look at the first digit of the bubble labels, and the frequency of the colours).


  1. We know it’s not ’cause we looked it up. It’s one of the skills we learned in grade school.↩︎