5.5 Basic Visualizations in R
Whenever we analyze data, the first thing we should do is look at it:
for each variable, what are the most common values?
How much variability is present?
Are there any unusual observations?
Producing graphics for data analysis is relatively simple. Producing graphics for publication is relatively more complex and typically
requires a great deal of tweaking to achieve the desired appearance.
Base R
provides a number of functions for visualizing data. In this section, we will provide sufficient guidance so that most desired effects can be achieved, but further investigation of the documentation and experimentation will doubtless be necessary for specific needs (advanced visualization functionality is provided by ggplot2
).
For now, let’s discuss some of the most common basic methods.
5.5.1 Scatterplots
The most common plotting function in R is the plot()
function. It is a generic function, meaning, it calls various methods according to the type of the object passed which is passed to it. In the simplest case, we can pass in a vector and we get a scatter plot of magnitude vs index.
More generally, we can pass in two vectors and a scatter plot of the points formed by matching coordinates is displayed.
For example, the command plot(c(1,2),c(3,5))
would plot the points \((1,3)\) and \((2,5)\).
Here is a more concrete example showing how to plot the graph of the sine function in the range from \(-\pi\) to \(\pi\).
We can add a title to our plot with the parameter main. Similarly,
xlab
and ylab
can be used to label the x-axis and y-axis
respectively.
The curve is made up of circular black points. This is the default setting for shape and colour. This can be changed by using the argument type
.
It accepts the following strings (with given effect)
p
– pointsl
– linesb
– both points and linesc
– empty points joined by lineso
– overplotted points and liness
– stair stepsh
– histogram-like vertical linesn
– does not produce any points or lines
Similarly, we can specify the colour using the argument col
.
plot(x, sin(x),
main=expression("Graph of the sine function, over [-" * pi * "," * pi * "]"),
ylab="sin(x)",
type="l",
col="blue")
Calling plot()
multiple times will have the effect of plotting the current graph on the same window, replacing the previous one.
However, we may sometimes wish to overlay the plots in order to compare the results. This is made possible by the functions lines()
and points()
, which add lines and points respectively, to the existing plot.
plot(x, sin(x),
main="Overlaying Graphs",
ylab="",
type="l",
col="blue")
lines(x,cos(x), col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
)
The legend()
function allows for the appropriate display in the plot.
5.5.2 Barplots
Barplots can be created in R
using the barplot()
function.
We can supply a vector or matrix to this function, and it will display a bar chart with bar heights equal to the magnitude of the elements in the vector.
Let us suppose that we have a vector of maximum temperatures for seven days, as follows.
We can make a bar chart out of this data using a simple command.
This function can take on a number of arguments (details are available in the help file, which can be queried by entering ?barplot
in the console).
Frequently-used arguments include:
main
to specify the titlexlab
andylab
to provide labels for the axesnames.arg
to provide a name for each barcol
to define colour, etc.
We can also transpose the plot to have horizontal bars by providing the argument horiz = TRUE
, such as in the following:
barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
col = "darkred",
horiz = TRUE)
At times, we may be interested in displaying the count or magnitude for categories in the data. For instance, consider the following vector of age measurements for 21 first-year college students.
Calling barplot(age)
does not provide the required plot.
Indeed, this call plots 21 bars with heights corresponding to students’ age – the display does not provide the frequency count for each age.
The values can be quickly found using the table()
function, as shown below.
age
16 17 18 19 20 22 31
1 4 10 3 1 1 1
Plotting this data will produce the required barplot (in the code, the argument density
is used to determine the texture of the fill).
barplot(table(age),
main="Age Count of 21 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
)
The chart respects the relative proportions, but R
cannot tell that there should be (empty) categories for age=21
and between 22 and 31.
This can be remedied by first setting new factor levels for the age measurements, and re-running the code.
age=factor(age,levels=c(15:31))
barplot(table(age),
main="Age Count of 21 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
)
5.5.3 Histograms
Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the \(x\)-axis and displaying the frequency of scores in each bin on the \(y\)-axis.
We can create histograms in R
with hist()
:
the option
freq=FALSE
creates a plot based on probability densities rather than frequencies;the
breaks
option controls the number of bins: the default produces equally spaced breaks when defining the cells of the histogram.
For illustrative purposes, we will use several of the
variables from the Motor Trend Car Road Tests (mtcars
) dataset
provided in the base R
installation.
Here are four variations on a histogram.
par(mfrow=c(2,2))
hist(mtcars$mpg)
hist(mtcars$mpg,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Colored histogram with 12 bins")
hist(mtcars$mpg,
freq=FALSE,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Histogram, rug plot, density curve")
rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2)
x <- mtcars$mpg
h<-hist(x,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=40)
yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x)
lines(xfit, yfit, col="blue", lwd=2)
box()
The first histogram demonstrates the default plot with no specified options: five bins are created, and the default axis labels and titles are printed.
In the second histogram, 12 bins have been specified, as well as a red fill for the bars, and more attractive (enticing?) and informative labels and title.
The third histogram uses the same colours, number of bins, labels, and titles as the previous plot but adds a density curve and rug-plot overlay. The density curve is a kernel density estimate and is described in the next section. It provides a smoother description of the distribution of scores. The call to function lines()
overlays this curve in blue (and a width twice the default line thickness); a rug plot is a one-dimensional representation of the actual data values.
The fourth histogram is similar to the second but with a superposed normal curve and a box around the figure. The code for superposing the normal curve comes from a suggestion posted to the R-help mailing list by Peter Dalgaard. The surrounding box is produced by the box()
function.
5.5.4 Curves
Given an expression for a function \(y(x)\), we can plot the values of \(y\) for various values of \(x\) in a given range.
This can be accomplished using the R
function curve()
.
We can plot the simple polynomial function \(y=3x^{2}+x\) in the range \(x=[1,10]\) as follows:
curve(3*x^2 + x, from=1, to=10, n=300, xlab="xvalue",
ylab="yvalue", col="blue", lwd=2,
main="Plot of (3x^2 + x)")
The important parameters of the curve()
function are:
the first parameter is the mathematical expression of the function to plot, written in the format for writing mathematical operations in or LaTeX;
two numeric parameters (
from
andto
) that represent the endpoints of the range of \(x\);the integer
n
that represents the number of equally-spaced values of \(x\) between thefrom
andto
points;the other parameters (
xlab
,ylab
,col
,lwd
,main
) have their usual meaning.
5.5.5 Boxplots
A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary:
the minimum \(Q_0\);
the lower quartile \(Q_1\) (25th percentile);
the median \(Q_2\) (50th percentile);
the upper quartile \(Q_3\) (75th percentile), and
the maximum \(Q_4\).
The belt of the boxplot is found at \(Q_2\), its box boundaries at \(Q_1\) and \(Q_3\), and its whiskers at \(5/2Q_1-3/2Q_3\) and \(5/2Q_3-3/2Q_1\).
For normally distributed variables, it also displays observations which may be identified as potential outliers (which it to say, values outside the box-and-whiskers range \([5/2Q_1-3/2Q_3,5/2Q_3-3/2Q_1]\)).
5.5.6 Examples
Let’s take a look at some other basic R
plots.
Scatterplots
On a scatter plot, the features to study depend on the scale of interest:
- for “big” patterns, look at:
- form and direction
- strength
- for “small” patterns, look at:
- deviations from the pattern
- outliers
We start with the swiss
built-in R
dataset.
str(swiss) # structure of the swiss dataset
pairs(swiss) # scatter plot matrix for the swiss dataset
'data.frame': 47 obs. of 6 variables:
$ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
$ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
$ Education : int 12 9 5 7 15 7 7 8 7 13 ...
$ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
$ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
Let’s focus on one specific pair: Fertility vs. Education.
The same plot can be prettified and made more informative:
# add a title and axis labels
plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education",
main="Education vs Fertility (by province), Switzerland, 1888",
las=1)
# add the line of best fit (in red)
abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5)
# add the smoothing lowess curve (in blue)
lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5)
# add a legend
legend(75,50, c("Best Fit","Lowess"), lty=c(1,1),
lwd=c(2.5,2.5),col=c("red","blue"))
Compare that graph with the one found below:
plot(swiss$Education, xlab="Province", ylab="Education",
main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5)
lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5)
legend(5,52, c("Best Fit","Lowess"), lty=c(1,1),
lwd=c(2.5,2.5),col=c("red","blue"))
Even though R
will not balk at producing this graph, it is a nonsense graph. Why, exactly?
The main take-away of this example is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way.
Let’s do some thing similar for the infamous built-in iris
dataset.
str(iris) # structure of the dataset
summary(iris) # information on the distributions for each feature
# ?iris # information on the dataset itself
'data.frame': 150 obs. of 5 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
$ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
The function pairs
provides a scatter plot matrix of the dataset; the option lower.panel=NULL
removes the lower panels (due to redundancy).
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21,
lower.panel=NULL, labels=c("SL","SW","PL","PW"),
font.labels=2, cex.labels=4.5)
We can compare the sepal width and length variables in a manner similar to what we did with the swiss
dataset.
plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length",
ylab="Sepal Width", main="Sepal Width vs Sepal Length,
Anderson's Iris Dataset", las=1,
bg=c("yellow","black","green")[unclass(iris$Species)])
abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1),
lwd=c(2.5,2.5),col=c("red","blue"))
There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length?
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length",
ylab="Petal Length", main="Sepal Width vs Petal Length,
Anderson's Iris Dataset", las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1),
lwd=c(2.5,2.5),col=c("red","blue"))
Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation.
plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length",
ylab="Petal Length", main="Sepal Width vs Petal Length,
Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1)
abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5)
lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5)
legend(2,7, c("Best Fit","Lowess"), lty=c(1,1),
lwd=c(2.5,2.5),col=c("red","blue"))
A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph.
Colour can also be used to highlight various data elements. Here, we colour each observation differently according to its species:
plot(iris$Sepal.Length, iris$Sepal.Width, pch=21,
bg=c("red","green3","blue")[unclass(iris$Species)],
main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=)
This can be done on all scatterplots concurrently using pairs
.
pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21,
bg = c("red", "green3", "blue")[unclass(iris$Species)],
lower.panel=NULL, labels=c("SL","SW","PL","PW"),
font.labels=2, cex.labels=4.5)
Histograms and Bar Charts
In histograms or frequency charts, users should stay on the lookout for bin size effects. For instance, what does the distribution of the Education
variable in the swiss
dataset look like?
The distribution pattern is distintcly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up masking trends if the dataset isn’t large enough.
We can also look for best fits for various parametric distributions, with the default number of bins:
hist(swiss$Education, freq=FALSE, xlab="Education",
main="Education Distribution, by Province, Switzerland, 1888",
col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue",
lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black",
lwd=4) # fit an exponential distribution
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)),
add=TRUE, col="green3", lwd=4) # fit a normal distribution
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
lwd=c(4,4),col=c("darkblue","black", "green3"))
We can also view the histograms with 10 bins:
hist(swiss$Education, breaks=10, freq=FALSE,xlab="Education",main="Education
Distribution, by Province, Switzerland, 1888", col="firebrick1",
ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE,
col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE,
col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)),
add=TRUE, col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
lwd=c(4,4),col=c("darkblue","black", "green3"))
or with 20 bins:
hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education",
main="Education Distribution, by Province, Switzerland, 1888",
col="firebrick1", ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE,
col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE,
col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE,
col="green3", lwd=4)
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
lwd=c(4,4),col=c("darkblue","black", "green3"))
With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised.
And now, can you figure out what is happening with these visualizations of the iris
dataset?
hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",
main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15))
What happens above if we replace freq=FALSE
with freq=TRUE
?
Let’s look at another feature, now.
hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width",
main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15))
Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information (this is done through a custom function based on R-blogger.com
’s scatterhist()
function).
scatterhist = function(x, y, xlab="", ylab=""){
zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE)
layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5))
xhist = hist(x, plot=FALSE)
yhist = hist(y, plot=FALSE)
top = max(c(xhist$counts, yhist$counts))
par(mar=c(3,3,1,1))
plot(x,y)
par(mar=c(0,3,1,1))
barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE)
par(oma=c(3,3,0,0))
mtext(xlab, side=1, line=1, outer=TRUE, adj=0,
at=.8 * (mean(x) - min(x))/(max(x)-min(x)))
mtext(ylab, side=2, line=1, outer=TRUE, adj=0,
at=(.8 * (mean(y) - min(y))/(max(y) - min(y))))
}
ds = iris
with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width,
xlab="Sepal.Length", ylab="Sepal.Width"))
Bubble Charts
Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: the size, colour, and shape of the bubbles can also be used to represent different data elements.
For this first example, we’ll take a look at some Statistics Canada 2011 demographic data regarding Canada’s census metropolitan areas (CMA) and census agglomerations (CA).
can.2011=read.csv("data/Canada2011.csv", head=TRUE)
head(can.2011)
str(can.2011)
summary(can.2011[,3:12], maxsum=13)
Geographic.code Geographic.name Province Region Type pop_2011
1 1 St. John's NL Atlantic CMA 196966
2 5 Bay Roberts NL Atlantic CA 10871
3 10 Grand Falls-Windsor NL Atlantic CA 13725
4 15 Corner Brook NL Atlantic CA 27202
5 105 Charlottetown PE Atlantic CA 64487
6 110 Summerside PE Atlantic CA 16488
log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011
1 5.294391 20 84542 78960
2 4.036269 147 4601 4218
3 4.137512 128 6134 5723
4 4.434601 94 11697 11110
5 4.809472 52 28864 26192
6 4.217168 120 7323 6941
occ_rate_2011 med_total_income_2011
1 0.9339736 33420
2 0.9167572 24700
3 0.9329964 26920
4 0.9498162 27430
5 0.9074279 30110
6 0.9478356 27250
'data.frame': 147 obs. of 12 variables:
$ Geographic.code : int 1 5 10 15 105 110 205 210 215 220 ...
$ Geographic.name : chr "St. John's" "Bay Roberts" "Grand Falls-Windsor" "Corner Brook" ...
$ Province : chr "NL" "NL" "NL" "NL" ...
$ Region : chr "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...
$ Type : chr "CMA" "CA" "CA" "CA" ...
$ pop_2011 : int 196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...
$ log_pop_2011 : num 5.29 4.04 4.14 4.43 4.81 ...
$ pop_rank_2011 : int 20 147 128 94 52 120 13 97 67 78 ...
$ priv_dwell_2011 : int 84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
$ occ_private_dwell_2011: int 78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...
$ occ_rate_2011 : num 0.934 0.917 0.933 0.95 0.907 ...
$ med_total_income_2011 : int 33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ...
Province Region Type pop_2011
Length:147 Length:147 Length:147 Min. : 10871
Class :character Class :character Class :character 1st Qu.: 18429
Mode :character Mode :character Mode :character Median : 40077
Mean : 186632
3rd Qu.: 98388
Max. :5583064
log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011
Min. :4.036 Min. : 1.0 Min. : 4601 Min. : 4218
1st Qu.:4.265 1st Qu.: 37.5 1st Qu.: 8292 1st Qu.: 7701
Median :4.603 Median : 74.0 Median : 17428 Median : 16709
Mean :4.720 Mean : 74.0 Mean : 78691 Mean : 74130
3rd Qu.:4.993 3rd Qu.:110.5 3rd Qu.: 44674 3rd Qu.: 41142
Max. :6.747 Max. :147.0 Max. :2079459 Max. :1989705
occ_rate_2011 med_total_income_2011
Min. :0.6492 Min. :22980
1st Qu.:0.9173 1st Qu.:27630
Median :0.9348 Median :29910
Mean :0.9276 Mean :31311
3rd Qu.:0.9475 3rd Qu.:33255
Max. :0.9723 Max. :73030
NA's :5
Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions.
It’s … underwhelming, if that’s a word.78. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass.
Can anything else be derived using bubble charts? We use median income
as the radius for the bubbles, and focus on occupancy rates and population.
radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011,
circles=radius.med.income.2011, inches=0.45,
fg="white", bg="red", xlab="Population (log)",
ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011,
inches=0.25, fg="white", bg="red",
xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)")
Not much to it. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011,
inches=0.15, fg="white",
bg=c("red","blue","black","green","yellow","violet")[factor(
can.2011$Region)],
xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011) - colours represent
oeographical regions.")
What do you think?
Perhaps the CA distort the full picture (given that they are small and more numerous).
Let’s analyze the subset of CMAs instead.
can.2011.CMA=read.csv("data/Canada2011_CMA.csv", head=TRUE) # import the data
str(can.2011.CMA) #dataset structure (notice the number of observations)
summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution
'data.frame': 33 obs. of 15 variables:
$ Geographic.code : int 1 205 305 310 408 421 433 442 462 505 ...
$ Geographic.name : chr "St. John's" "Halifax" "Moncton" "Saint John" ...
$ Province : chr "NL" "NS" "NB" "NB" ...
$ Region : chr "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...
$ Type : chr "CMA" "CMA" "CMA" "CMA" ...
$ pop_2011 : int 196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...
$ log_pop_2011 : num 5.29 5.59 5.14 5.11 5.2 ...
$ fem_pop_2011 : int 103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...
$ prop_fem_2011 : num 52.6 52.4 51.1 51.5 50.6 ...
$ pop_rank_2011 : int 20 13 29 31 26 7 19 27 2 4 ...
$ priv_dwell_2011 : int 84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
$ occ_private_dwell_2011: int 78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...
$ occ_rate_2011 : num 0.934 0.932 0.934 0.921 0.942 ...
$ med_unemployment_2011 : num 6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...
$ med_total_income_2011 : int 33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...
Province Region Type pop_2011
Length:33 Length:33 Length:33 Min. : 118975
Class :character Class :character Class :character 1st Qu.: 159561
Mode :character Mode :character Mode :character Median : 260600
Mean : 700710
3rd Qu.: 721053
Max. :5583064
log_pop_2011 fem_pop_2011 prop_fem_2011 pop_rank_2011
Min. :5.075 Min. : 63260 Min. :50.55 Min. : 1
1st Qu.:5.203 1st Qu.: 83243 1st Qu.:51.55 1st Qu.: 9
Median :5.416 Median : 135561 Median :52.17 Median :17
Mean :5.553 Mean : 365093 Mean :52.05 Mean :17
3rd Qu.:5.858 3rd Qu.: 378690 3rd Qu.:52.41 3rd Qu.:25
Max. :6.747 Max. :2947971 Max. :53.17 Max. :33
priv_dwell_2011 occ_private_dwell_2011
Min. : 53730 Min. : 48848
1st Qu.: 72817 1st Qu.: 67767
Median : 110314 Median : 104237
Mean : 291519 Mean : 275587
3rd Qu.: 294150 3rd Qu.: 282186
Max. :2079459 Max. :1989705
We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.
radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi)
symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,
circles=radius.med.income.2011.CMA, inches=0.25, fg="white",
bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)],
ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,
can.2011.CMA$Geographic.code, cex=0.5)
Part of the problem is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say?
radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,
circles=radius.pop.rank.2011.CMA, inches=0.25, fg="white",
bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)],
ylab="Median Income", xlab="Unemployment Rate")
title("Population Rank, by CMA and CA (2011)")
text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,
can.2011.CMA$Geographic.code, cex=0.5)
There’s a bit more structure in there, isn’t there? Can you figure out to what regions the colours correspond? (Hint: look at the first digit of the bubble labels, and the frequency of the colours).
We know it’s not ’cause we looked it up. It’s one of the skills we learned in grade school.↩︎