## 5.5 Basic Visualizations in R

Whenever we analyze data, the first thing we should do is look at it:

• for each variable, what are the most common values?

• How much variability is present?

• Are there any unusual observations?

Producing graphics for data analysis is relatively simple. Producing graphics for publication is relatively more complex and typically requires a great deal of tweaking to achieve the desired appearance. Base R provides a number of functions for visualizing data. In this section, we will provide sufficient guidance so that most desired effects can be achieved, but further investigation of the documentation and experimentation will doubtless be necessary for specific needs (advanced visualization functionality is provided by ggplot2).

For now, let’s discuss some of the most common basic methods.

### 5.5.1 Scatterplots

The most common plotting function in R is the plot() function. It is a generic function, meaning, it calls various methods according to the type of the object passed which is passed to it. In the simplest case, we can pass in a vector and we get a scatter plot of magnitude vs index.

More generally, we can pass in two vectors and a scatter plot of the points formed by matching coordinates is displayed.

For example, the command plot(c(1,2),c(3,5)) would plot the points $$(1,3)$$ and $$(2,5)$$.

plot(c(1,2),c(3,5)) Here is a more concrete example showing how to plot the graph of the sine function in the range from $$-\pi$$ to $$\pi$$.

x <- seq(-pi,pi,0.1)
plot(x, sin(x)) We can add a title to our plot with the parameter main. Similarly, xlab and ylab can be used to label the x-axis and y-axis respectively.

The curve is made up of circular black points. This is the default setting for shape and colour. This can be changed by using the argument type.

It accepts the following strings (with given effect)

• p – points

• l – lines

• b – both points and lines

• c – empty points joined by lines

• o – overplotted points and lines

• s – stair steps

• h – histogram-like vertical lines

• n – does not produce any points or lines

Similarly, we can specify the colour using the argument col.

plot(x, sin(x),
main=expression("Graph of the sine function, over [-" * pi * "," * pi * "]"),
ylab="sin(x)",
type="l",
col="blue") Calling plot() multiple times will have the effect of plotting the current graph on the same window, replacing the previous one.

However, we may sometimes wish to overlay the plots in order to compare the results. This is made possible by the functions lines() and points(), which add lines and points respectively, to the existing plot.

plot(x, sin(x),
main="Overlaying Graphs",
ylab="",
type="l",
col="blue")
lines(x,cos(x), col="red")
legend("topleft",
c("sin(x)","cos(x)"),
fill=c("blue","red")
) The legend() function allows for the appropriate display in the plot.

### 5.5.2 Barplots

Barplots can be created in R using the barplot() function.

We can supply a vector or matrix to this function, and it will display a bar chart with bar heights equal to the magnitude of the elements in the vector.

Let us suppose that we have a vector of maximum temperatures for seven days, as follows.

max.temp <- c(22, 27, 26, 24, 23, 26, 31)

We can make a bar chart out of this data using a simple command.

barplot(max.temp) This function can take on a number of arguments (details are available in the help file, which can be queried by entering ?barplot in the console).

Frequently-used arguments include:

• main to specify the title

• xlab and ylab to provide labels for the axes

• names.arg to provide a name for each bar

• col to define colour, etc.

We can also transpose the plot to have horizontal bars by providing the argument horiz = TRUE, such as in the following:

barplot(max.temp,
main = "Maximum Temperatures in a Week",
xlab = "Degree Celsius",
ylab = "Day",
names.arg = c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"),
col = "darkred",
horiz = TRUE) At times, we may be interested in displaying the count or magnitude for categories in the data. For instance, consider the following vector of age measurements for 21 first-year college students.

age <- c(22,17,18,18,17,18,19,18,16,18,18,18,18,18,19,19,20,17,18,17,31)

Calling barplot(age) does not provide the required plot.

barplot(age) Indeed, this call plots 21 bars with heights corresponding to students’ age – the display does not provide the frequency count for each age.

The values can be quickly found using the table() function, as shown below.

table(age)
age
16 17 18 19 20 22 31
1  4 10  3  1  1  1 

Plotting this data will produce the required barplot (in the code, the argument densityis used to determine the texture of the fill).

barplot(table(age),
main="Age Count of 21 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
) The chart respects the relative proportions, but R cannot tell that there should be (empty) categories for age=21 and between 22 and 31.

This can be remedied by first setting new factor levels for the age measurements, and re-running the code.

age=factor(age,levels=c(15:31))
barplot(table(age),
main="Age Count of 21 Students",
xlab="Age",
ylab="Count",
border="red",
col="blue",
density=10
) ### 5.5.3 Histograms

Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the $$x$$-axis and displaying the frequency of scores in each bin on the $$y$$-axis.

We can create histograms in R with hist():

• the option freq=FALSE creates a plot based on probability densities rather than frequencies;

• the breaks option controls the number of bins: the default produces equally spaced breaks when defining the cells of the histogram.

For illustrative purposes, we will use several of the variables from the Motor Trend Car Road Tests (mtcars) dataset provided in the base R installation.

Here are four variations on a histogram.

par(mfrow=c(2,2))
hist(mtcars$mpg) hist(mtcars$mpg,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Colored histogram with 12 bins")
hist(mtcars$mpg, freq=FALSE, breaks=12, col="red", xlab="Miles Per Gallon", main="Histogram, rug plot, density curve") rug(jitter(mtcars$mpg))
lines(density(mtcars$mpg), col="blue", lwd=2) x <- mtcars$mpg
h<-hist(x,
breaks=12,
col="red",
xlab="Miles Per Gallon",
main="Histogram with normal curve and box")
xfit<-seq(min(x), max(x), length=40)
yfit<-dnorm(xfit, mean=mean(x), sd=sd(x))
yfit <- yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2) box() The first histogram demonstrates the default plot with no specified options: five bins are created, and the default axis labels and titles are printed. In the second histogram, 12 bins have been specified, as well as a red fill for the bars, and more attractive (enticing?) and informative labels and title. The third histogram uses the same colours, number of bins, labels, and titles as the previous plot but adds a density curve and rug-plot overlay. The density curve is a kernel density estimate and is described in the next section. It provides a smoother description of the distribution of scores. The call to function lines() overlays this curve in blue (and a width twice the default line thickness); a rug plot is a one-dimensional representation of the actual data values. The fourth histogram is similar to the second but with a superposed normal curve and a box around the figure. The code for superposing the normal curve comes from a suggestion posted to the R-help mailing list by Peter Dalgaard. The surrounding box is produced by the box() function. ### 5.5.4 Curves Given an expression for a function $$y(x)$$, we can plot the values of $$y$$ for various values of $$x$$ in a given range. This can be accomplished using the R function curve(). We can plot the simple polynomial function $$y=3x^{2}+x$$ in the range $$x=[1,10]$$ as follows: curve(3*x^2 + x, from=1, to=10, n=300, xlab="xvalue", ylab="yvalue", col="blue", lwd=2, main="Plot of (3x^2 + x)") The important parameters of the curve() function are: • the first parameter is the mathematical expression of the function to plot, written in the format for writing mathematical operations in or LaTeX; • two numeric parameters (from and to) that represent the endpoints of the range of $$x$$; • the integer n that represents the number of equally-spaced values of $$x$$ between the from and to points; • the other parameters (xlab, ylab, col, lwd, main) have their usual meaning. ### 5.5.5 Boxplots A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary: • the minimum $$Q_0$$; • the lower quartile $$Q_1$$ (25th percentile); • the median $$Q_2$$ (50th percentile); • the upper quartile $$Q_3$$ (75th percentile), and • the maximum $$Q_4$$. The belt of the boxplot is found at $$Q_2$$, its box boundaries at $$Q_1$$ and $$Q_3$$, and its whiskers at $$5/2Q_1-3/2Q_3$$ and $$5/2Q_3-3/2Q_1$$. boxplot(mtcars$mpg, main="Box plot", ylab="Miles per Gallon") For normally distributed variables, it also displays observations which may be identified as potential outliers (which it to say, values outside the box-and-whiskers range $$[5/2Q_1-3/2Q_3,5/2Q_3-3/2Q_1]$$).

boxplot(c(mtcars$mpg,35), main="Box plot", ylab="Miles per Gallon") ### 5.5.6 Examples Let’s take a look at some other basic R plots. #### Scatterplots On a scatter plot, the features to study depend on the scale of interest: • for “big” patterns, look at: • form and direction • strength • for “small” patterns, look at: • deviations from the pattern • outliers We start with the swiss built-in R dataset. str(swiss) # structure of the swiss dataset pairs(swiss) # scatter plot matrix for the swiss dataset 'data.frame': 47 obs. of 6 variables:$ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
$Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...$ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
$Education : int 12 9 5 7 15 7 7 8 7 13 ...$ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
$Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ... Let’s focus on one specific pair: Fertility vs. Education. # raw plot plot(swiss$Fertility, swiss$Education) The same plot can be prettified and made more informative: # add a title and axis labels plot(swiss$Fertility, swiss$Education, xlab="Fertility", ylab="Education", main="Education vs Fertility (by province), Switzerland, 1888", las=1) # add the line of best fit (in red) abline(lm(swiss$Education~swiss$Fertility), col="red", lwd=2.5) # add the smoothing lowess curve (in blue) lines(lowess(swiss$Fertility,swiss$Education), col="blue", lwd=2.5) # add a legend legend(75,50, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue")) Compare that graph with the one found below: plot(swiss$Education, xlab="Province", ylab="Education",
main="Education by Province, Switzerland, 1888", las=1)
abline(lm(swiss$Education~row(swiss)[,1]), col="red", lwd=2.5) lines(swiss$Education)
lines(lowess(row(swiss)[,1],swiss$Education), col="blue", lwd=2.5) legend(5,52, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue")) Even though R will not balk at producing this graph, it is a nonsense graph. Why, exactly? The main take-away of this example is that being able to produce a graph doesn’t guarantee that it will be useful or meaningful in any way. Let’s do some thing similar for the infamous built-in iris dataset. str(iris) # structure of the dataset summary(iris) # information on the distributions for each feature # ?iris # information on the dataset itself 'data.frame': 150 obs. of 5 variables:$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Sepal.Length    Sepal.Width     Petal.Length    Petal.Width
Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100
1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300
Median :5.800   Median :3.000   Median :4.350   Median :1.300
Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199
3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800
Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500
Species
setosa    :50
versicolor:50
virginica :50



The function pairs provides a scatter plot matrix of the dataset; the option lower.panel=NULL removes the lower panels (due to redundancy).

pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21,
lower.panel=NULL, labels=c("SL","SW","PL","PW"),
font.labels=2, cex.labels=4.5) We can compare the sepal width and length variables in a manner similar to what we did with the swiss dataset.

plot(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal Length",
ylab="Sepal Width", main="Sepal Width vs Sepal Length,
Anderson's Iris Dataset", las=1,
bg=c("yellow","black","green")[unclass(iris$Species)]) abline(lm(iris$Sepal.Width~iris$Sepal.Length), col="red", lwd=2.5) lines(lowess(iris$Sepal.Length,iris$Sepal.Width), col="blue", lwd=2.5) legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue")) There does not seem to be a very strong relationship between these variables. What can we say about sepal length and petal length? plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", las=1) abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) legend(7,4.35, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue")) Visually, the relationship is striking: the line seems to have a slope of 1! But notice that the axes are unevenly scaled, and have been cutoff away from the origin. The following graph gives a better idea of the situation. plot(iris$Sepal.Length, iris$Petal.Length, xlab="Sepal Length", ylab="Petal Length", main="Sepal Width vs Petal Length, Anderson's Iris Dataset", xlim=c(0,8), ylim=c(0,8), las=1) abline(lm(iris$Petal.Length~iris$Sepal.Length), col="red", lwd=2.5) lines(lowess(iris$Sepal.Length,iris$Petal.Length), col="blue", lwd=2.5) legend(2,7, c("Best Fit","Lowess"), lty=c(1,1), lwd=c(2.5,2.5),col=c("red","blue")) A relationship is still present, but it is affine, not linear as could have been guessed by naively looking at the original graph. Colour can also be used to highlight various data elements. Here, we colour each observation differently according to its species: plot(iris$Sepal.Length, iris$Sepal.Width, pch=21, bg=c("red","green3","blue")[unclass(iris$Species)],
main="Anderson's Iris Data -- Sepal Length vs. Sepal Width", xlim=) This can be done on all scatterplots concurrently using pairs.

pairs(iris[1:4], main = "Anderson's Iris Data", pch = 21,
bg = c("red", "green3", "blue")[unclass(iris$Species)], lower.panel=NULL, labels=c("SL","SW","PL","PW"), font.labels=2, cex.labels=4.5) #### Histograms and Bar Charts In histograms or frequency charts, users should stay on the lookout for bin size effects. For instance, what does the distribution of the Education variable in the swiss dataset look like? hist(swiss$Education)   # default number of bins hist(swiss$Education, breaks=10) # with 10 bins hist(swiss$Education, breaks=20)   # with 20 bins The distribution pattern is distintcly different with 10 and 20 bins. Don’t get too carried away, however: too many bins may end up masking trends if the dataset isn’t large enough.

We can also look for best fits for various parametric distributions, with the default number of bins:

hist(swiss$Education, freq=FALSE, xlab="Education", main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15)) curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue",
lwd=4) # fit a gamma distribution
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4) # fit an exponential distribution curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4) # fit a normal distribution legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3")) We can also view the histograms with 10 bins: hist(swiss$Education, breaks=10, freq=FALSE,xlab="Education",main="Education
Distribution, by Province, Switzerland, 1888", col="firebrick1",
ylim=c(0,0.15))
curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE, col="darkblue", lwd=4) curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE,
col="black", lwd=4)
curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)),
legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1),
lwd=c(4,4),col=c("darkblue","black", "green3")) or with 20 bins:

hist(swiss$Education, breaks=20, freq=FALSE, xlab="Education", main="Education Distribution, by Province, Switzerland, 1888", col="firebrick1", ylim=c(0,0.15)) curve(dgamma(x,shape=mean(swiss$Education)),add=TRUE,
col="darkblue", lwd=4)
curve(dexp(x,rate=1/mean(swiss$Education)),add=TRUE, col="black", lwd=4) curve(dnorm(x,mean=mean(swiss$Education),sd=sd(swiss$Education)), add=TRUE, col="green3", lwd=4) legend(40,0.05, c("Gamma","Exponential","Normal"), lty=c(1,1), lwd=c(4,4),col=c("darkblue","black", "green3")) With a small number of bins, the exponential distribution seems like a good fit, visually. With a larger number of bins, neither of the three families seems particularly well-advised. And now, can you figure out what is happening with these visualizations of the iris dataset? hist(iris$Sepal.Length, freq=FALSE, xlab="Sepal.Length",
main="Sepal.Length Distribution", col="firebrick1", ylim=c(0,0.15)) What happens above if we replace freq=FALSE with freq=TRUE?

Let’s look at another feature, now.

hist(iris$Sepal.Width, freq=FALSE, xlab="Sepal.Width", main="Sepal.Width Distribution", col="firebrick1", ylim=c(0,0.15)) Histograms (1D data representations) can be combined with scatterplots (2D data representations) to provide marginal information (this is done through a custom function based on R-blogger.com’s scatterhist() function). scatterhist = function(x, y, xlab="", ylab=""){ zones=matrix(c(2,0,1,3), ncol=2, byrow=TRUE) layout(zones, widths=c(4/5,1/5), heights=c(1/5,4/5)) xhist = hist(x, plot=FALSE) yhist = hist(y, plot=FALSE) top = max(c(xhist$counts, yhist$counts)) par(mar=c(3,3,1,1)) plot(x,y) par(mar=c(0,3,1,1)) barplot(xhist$counts, axes=FALSE, ylim=c(0, top), space=0)
par(mar=c(3,0,1,1))
barplot(yhist$counts, axes=FALSE, xlim=c(0, top), space=0, horiz=TRUE) par(oma=c(3,3,0,0)) mtext(xlab, side=1, line=1, outer=TRUE, adj=0, at=.8 * (mean(x) - min(x))/(max(x)-min(x))) mtext(ylab, side=2, line=1, outer=TRUE, adj=0, at=(.8 * (mean(y) - min(y))/(max(y) - min(y)))) } ds = iris with(ds, scatterhist(iris$Sepal.Length, iris$Sepal.Width, xlab="Sepal.Length", ylab="Sepal.Width")) #### Bubble Charts Bubble charts are a neat way to show at least 3 variables on the same 2D display. The location of the bubbles’ centre takes care of 2 variables: the size, colour, and shape of the bubbles can also be used to represent different data elements. For this first example, we’ll take a look at some Statistics Canada 2011 demographic data regarding Canada’s census metropolitan areas (CMA) and census agglomerations (CA). can.2011=read.csv("data/Canada2011.csv", head=TRUE) head(can.2011) str(can.2011) summary(can.2011[,3:12], maxsum=13)   Geographic.code Geographic.name Province Region Type pop_2011 1 1 St. John's NL Atlantic CMA 196966 2 5 Bay Roberts NL Atlantic CA 10871 3 10 Grand Falls-Windsor NL Atlantic CA 13725 4 15 Corner Brook NL Atlantic CA 27202 5 105 Charlottetown PE Atlantic CA 64487 6 110 Summerside PE Atlantic CA 16488 log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011 1 5.294391 20 84542 78960 2 4.036269 147 4601 4218 3 4.137512 128 6134 5723 4 4.434601 94 11697 11110 5 4.809472 52 28864 26192 6 4.217168 120 7323 6941 occ_rate_2011 med_total_income_2011 1 0.9339736 33420 2 0.9167572 24700 3 0.9329964 26920 4 0.9498162 27430 5 0.9074279 30110 6 0.9478356 27250 'data.frame': 147 obs. of 12 variables:$ Geographic.code       : int  1 5 10 15 105 110 205 210 215 220 ...
$Geographic.name : chr "St. John's" "Bay Roberts" "Grand Falls-Windsor" "Corner Brook" ...$ Province              : chr  "NL" "NL" "NL" "NL" ...
$Region : chr "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...$ Type                  : chr  "CMA" "CA" "CA" "CA" ...
$pop_2011 : int 196966 10871 13725 27202 64487 16488 390328 26359 45888 35809 ...$ log_pop_2011          : num  5.29 4.04 4.14 4.43 4.81 ...
$pop_rank_2011 : int 20 147 128 94 52 120 13 97 67 78 ...$ priv_dwell_2011       : int  84542 4601 6134 11697 28864 7323 177295 11941 21708 16788 ...
$occ_private_dwell_2011: int 78960 4218 5723 11110 26192 6941 165153 11123 19492 15256 ...$ occ_rate_2011         : num  0.934 0.917 0.933 0.95 0.907 ...
$med_total_income_2011 : int 33420 24700 26920 27430 30110 27250 33400 25500 26710 26540 ... Province Region Type pop_2011 Length:147 Length:147 Length:147 Min. : 10871 Class :character Class :character Class :character 1st Qu.: 18429 Mode :character Mode :character Mode :character Median : 40077 Mean : 186632 3rd Qu.: 98388 Max. :5583064 log_pop_2011 pop_rank_2011 priv_dwell_2011 occ_private_dwell_2011 Min. :4.036 Min. : 1.0 Min. : 4601 Min. : 4218 1st Qu.:4.265 1st Qu.: 37.5 1st Qu.: 8292 1st Qu.: 7701 Median :4.603 Median : 74.0 Median : 17428 Median : 16709 Mean :4.720 Mean : 74.0 Mean : 78691 Mean : 74130 3rd Qu.:4.993 3rd Qu.:110.5 3rd Qu.: 44674 3rd Qu.: 41142 Max. :6.747 Max. :147.0 Max. :2079459 Max. :1989705 occ_rate_2011 med_total_income_2011 Min. :0.6492 Min. :22980 1st Qu.:0.9173 1st Qu.:27630 Median :0.9348 Median :29910 Mean :0.9276 Mean :31311 3rd Qu.:0.9475 3rd Qu.:33255 Max. :0.9723 Max. :73030 NA's :5  Before jumping aboard the bubble chart train, let’s see what the dataset looks like in the scatterplot framework for 5 of the variables, grouped along regions. pairs(can.2011[,c(7,9,10,11,12)]) It’s … underwhelming, if that’s a word.78. There are some interesting tidbits, but nothing jumps as being meaningful beyond a first pass. Can anything else be derived using bubble charts? We use median income as the radius for the bubbles, and focus on occupancy rates and population. radius.med.income.2011<-sqrt(can.2011$med_total_income_2011/pi)
symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011,
fg="white", bg="red", xlab="Population (log)",
ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)") Clearly, an increase in population seems to be associated with (and not necessarily a cause of) a rise in occupancy rates. But the median income seems to have very little correlation with either of the other two variables. Perhaps such a correlation is hidden by the default unit used to draw the bubbles? Let’s shrink it from 0.45 to 0.25 and see if anything pops out.

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011,
inches=0.25, fg="white", bg="red",
xlab="Population (log)", ylab="Occupancy Rate")
title("Total Median Income, by CMA and CA (2011)") Not much to it. But surely there would be a relationship in these quantities if we included the CMA/CA’s region?

symbols(can.2011$log_pop_2011, can.2011$occ_rate_2011, circles=radius.med.income.2011,
inches=0.15, fg="white",
bg=c("red","blue","black","green","yellow","violet")[factor(
can.2011$Region)], xlab="Population (log)", ylab="Occupancy Rate") title("Total Median Income, by CMA and CA (2011) - colours represent oeographical regions.") What do you think? Perhaps the CA distort the full picture (given that they are small and more numerous). Let’s analyze the subset of CMAs instead. can.2011.CMA=read.csv("data/Canada2011_CMA.csv", head=TRUE) # import the data str(can.2011.CMA) #dataset structure (notice the number of observations) summary(can.2011.CMA[,3:12], maxsum=13) # feature by feature distribution 'data.frame': 33 obs. of 15 variables:$ Geographic.code       : int  1 205 305 310 408 421 433 442 462 505 ...
$Geographic.name : chr "St. John's" "Halifax" "Moncton" "Saint John" ...$ Province              : chr  "NL" "NS" "NB" "NB" ...
$Region : chr "Atlantic" "Atlantic" "Atlantic" "Atlantic" ...$ Type                  : chr  "CMA" "CMA" "CMA" "CMA" ...
$pop_2011 : int 196966 390328 138644 127761 157790 765706 201890 151773 3824221 1236324 ...$ log_pop_2011          : num  5.29 5.59 5.14 5.11 5.2 ...
$fem_pop_2011 : int 103587 204579 70901 65834 79770 394935 104026 78105 1971520 647083 ...$ prop_fem_2011         : num  52.6 52.4 51.1 51.5 50.6 ...
$pop_rank_2011 : int 20 13 29 31 26 7 19 27 2 4 ...$ priv_dwell_2011       : int  84542 177295 62403 56775 73766 361447 99913 74837 1696210 526627 ...
$occ_private_dwell_2011: int 78960 165153 58294 52281 69507 345892 91099 70138 1613260 498636 ...$ occ_rate_2011         : num  0.934 0.932 0.934 0.921 0.942 ...
$med_unemployment_2011 : num 6.6 6.2 7.45 6.75 7.8 5.15 6.4 8.9 8.1 6.3 ...$ med_total_income_2011 : int  33420 33400 30690 29910 29560 33760 27620 27050 28870 39170 ...
Province            Region              Type              pop_2011
Length:33          Length:33          Length:33          Min.   : 118975
Class :character   Class :character   Class :character   1st Qu.: 159561
Mode  :character   Mode  :character   Mode  :character   Median : 260600
Mean   : 700710
3rd Qu.: 721053
Max.   :5583064
log_pop_2011    fem_pop_2011     prop_fem_2011   pop_rank_2011
Min.   :5.075   Min.   :  63260   Min.   :50.55   Min.   : 1
1st Qu.:5.203   1st Qu.:  83243   1st Qu.:51.55   1st Qu.: 9
Median :5.416   Median : 135561   Median :52.17   Median :17
Mean   :5.553   Mean   : 365093   Mean   :52.05   Mean   :17
3rd Qu.:5.858   3rd Qu.: 378690   3rd Qu.:52.41   3rd Qu.:25
Max.   :6.747   Max.   :2947971   Max.   :53.17   Max.   :33
priv_dwell_2011   occ_private_dwell_2011
Min.   :  53730   Min.   :  48848
1st Qu.:  72817   1st Qu.:  67767
Median : 110314   Median : 104237
Mean   : 291519   Mean   : 275587
3rd Qu.: 294150   3rd Qu.: 282186
Max.   :2079459   Max.   :1989705       

We can use the median income for the bubbles radius again, but this time we’ll look at population and unemployment.

radius.med.income.2011.CMA<-sqrt(can.2011.CMA$med_total_income_2011/pi) symbols(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011, circles=radius.med.income.2011.CMA, inches=0.25, fg="white", bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)],
ylab="Population (log)", xlab="Unemployment Rate")
title("Total Median Income, by CMA (2011)")
text(can.2011.CMA$log_pop_2011, can.2011.CMA$med_unemployment_2011,
can.2011.CMA$Geographic.code, cex=0.5) Part of the problem is that median income seems to be roughly uniform among CMAs. What if we used rank statistics instead? Switch the radius to population rank, say? radius.pop.rank.2011.CMA<-sqrt(can.2011.CMA$pop_rank_2011/pi)
symbols(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011,
bg=c("red","blue","gray","green","yellow")[factor(can.2011.CMA$Region)], ylab="Median Income", xlab="Unemployment Rate") title("Population Rank, by CMA and CA (2011)") text(can.2011.CMA$med_total_income_2011, can.2011.CMA$med_unemployment_2011, can.2011.CMA$Geographic.code, cex=0.5) 