We facilitated a mini course of R Learning remotely at the University of Dayton in Summer, 2020. To the extent possible, the content of the meetings are recorded here.
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller
In the first session, we will have a short overview of the inferfaces of R and RStudio. Then we will talk about basic syntax and data types in R.
In this section, we introduce some basic syntax in R.
# symbol is for adding comments and notes to your code. In any line of your code, anything after it will not be executed.
## [1] 8
## [1] -2
## [1] -2
## [1] 15
## [1] 0.6
## [1] 0.6
Note: you should fine that 3/5 and 3 / 5 generate the same result. This is because the blank spaces in the code is generally ignored.
## [1] 8
## [1] 3
## [1] 8
## [1] 3
What’s difference between them?
## [1] 8
## Error in eval(expr, envir, enclos): object 'x' not found
## [1] 8
## [1] 1 4 7 9 19
In the first case, \(x\) is an argument in the function mean() while the second case assigns a vector a vector \((1, 4, 7, 9, 19)\) to \(x\) and then finds the mean value of it.
We should use <- as an assignment operator and = for function arguments!
Parentheses, ( ), are used to call functions; Brackets, [ ], are used to obtain values in a data structure, Curly Brackets, { }, are used to denote a block of code in a function or in a conditional statement.
Here, we give examples about the use of ( ) and [ ]. The use of curly brackets will be introduced later.
w <- c(39, 61, 9, 17, 25, 56, 47, 62, 71, 100, 1, 42) # c() combines objects
median(w) # find the median of w
## [1] 44.5
## [1] 9
## [1] 39 61
## [1] 61 9 17
## [1] 61 25 62
## [1] 39 61 9 17 56 47 62 71 100 1 42
## [1] 39 9 17 25 47 1 42
Note: c() can concatenate more than just vectors. We will talk about this later.
In this section, we introduce the basic data types in R.
## [1] "character"
## [1] 5
Note: When defining strings, double quotes " " and single quotes ’ ’ are interchangeably but double quotes are preferred (and character constants are printed using double quotes), so single quotes are normally only used to delimit character constants containing double quotes R Documentation (2020).
If we want to combine two strings into one string, we can use paste() or paste0() function.
## [1] "Hello World!"
## [1] "Hello,World!"
## [1] "Hello, World!"
## [1] "Hello , World!"
## [1] "HelloWorld!"
These two functions could be very useful. Here we give one example.
allfiles1 <- paste("file_", 1:5)
allfiles2 <- paste("file_", 1:5, collapse = "_")
allfiles3 <- paste("file", 1:5, sep = "_")
allfiles1
## [1] "file_ 1" "file_ 2" "file_ 3" "file_ 4" "file_ 5"
## [1] "file_ 1_file_ 2_file_ 3_file_ 4_file_ 5"
## [1] "file_1" "file_2" "file_3" "file_4" "file_5"
A factor object is used to store categorical / qualitative variables.
## [1] A C B B- A C+ D A- B+ C- B
## Levels: A A- B B- B+ C C- C+ D
## [1] "factor"
## [1] "F" "M"
## [1] 11
A numeric object is used to store numeric data in R.
## [1] "numeric"
## [1] "numeric"
## [1] 15.7
## [1] 6
## [1] -3.13
## [1] -3.13 6.00
## [1] -3 2 6 -2 4 3 1 0 4
## [1] -3 3 6 -1 5 3 1 0 4
## [1] -4 2 6 -2 4 2 1 0 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.130 0.000 2.470 1.744 3.850 6.000
An integer object is used to store numeric data without decimals.
x2 <- c(-3.13, 2.47, 6, -1.5, 4.29, 2.72, 1, 0, 3.85)
x3 <- as.integer(x2) # only remain the integers
x3
## [1] -3 2 6 -1 4 2 1 0 3
## [1] "integer"
A logic object contains only two values: TRUE or FALSE.
## [1] FALSE
## [1] FALSE
## [1] TRUE
## [1] "logical"
A complex object is used to store complex values.
## [1] NaN
## Error in eval(expr, envir, enclos): object 'i' not found
## [1] 0+1i
## [1] "complex"
## [1] 3-5i
There are some situations that we may want to create an empty vector. Here is a simple example.
x <- c()
y1 <- vector("character", length=3)
y2 <- character(3)
z1 <- vector("numeric", 5)
z2 <- numeric(5)
w <- rep(NA, 2)
x
## NULL
## [1] "" "" ""
## [1] "" "" ""
## [1] 0 0 0 0 0
## [1] 0 0 0 0 0
## [1] NA NA
We can use these functions to transform the data type.
## [1] "integer"
## [1] "3" "5"
## [1] 1
## [1] 0
gender <- factor(c("M", "F", "F", "M", "M", "M", "F", "M", "F"))
as.numeric(gender) # transform levels in the categorical variables to numbers
## [1] 2 1 1 2 2 2 1 2 1
Note: as.numeric() function can transform logical object to numeric values: TRUE: 1 and FALSE: 0.
In this section, we introduce some useful commends regarding strings and vectors.
toupper() and tolower() functions change the case of characters of a string.
## [1] "HELLO, THE WORLD!"
## [1] "hello, the world!"
## [1] "Hello, the World!" "Good to see you!"
## [1] "HELLO, THE WORLD!" "GOOD TO SEE YOU!"
substring() function is used to obtain parts of a spring.
Usage: substring(x, first, last)
## [1] "the"
print() function prints its argument and return it invisibly.
## [1] "Hello"
## [1] 1
## [1] 1
## [1] 1 3 6
## [1] "Hello!" "1" "3" "6"
## [1] "Hello!" "1" "3" "6"
Note: If we want to print several objects together, we need to combine them first.
grep() finds the pattern in a string and returns the indices (positions)
Usage: grep(pattern, string, value=FALSE)
## [1] "xyz" "xyz" "zxy"
## [1] 1 2 5
gsub() finds the pattern in a string and replace every pattern occurrence with replacement in string.
Usage: gsub(pattern, replacement, string)
## [1] A C B B- A C+ D A- B+ C- B
## Levels: A A- B B- B+ C C- C+ D
## [1] "90" "C" "B" "B-" "90" "C+" "D" "90-" "B+" "C-" "B"
## [1] M F F M M M F M F
## Levels: F M
## [1] "M" "Girl" "Girl" "M" "M" "M" "Girl" "M" "Girl"
## [1] "Boy" "F" "F" "Boy" "Boy" "Boy" "F" "Boy" "F"
## [1] "ABcddefdae"
In this session, we will introduce two basic data structures: matrix and data frame, installing and loading packages, and importing data and writing files.
A matrix is a rectangular array of numbers or other mathematical objects for which operations such as addition and multiplication are defined Matrix.
For example,
\[M=\left(\begin{array}{cccc} a_{11} & a_{12} & \ldots & a_{1n}\\ a_{21} & a_{22} & \ldots & a_{2n}\\ \vdots & \vdots & \ddots &\vdots\\ a_{m1} & a_{m2} & \ldots & a_{mn} \end{array}\right)\]
is called a \(m\times n\) matrix (\(m\) rows and \(n\) columns) where \(m\times n\) is the dimension of \(M\). In addition, \(M[i,j]\) is the element in the \(i^{th}\) row and the \(j^{th}\) column.
Usage: matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE, dimnames = NULL)
We can define matrices directly with numbers assigned in each element.
## [,1] [,2]
## [1,] 7 10
## [2,] 8 11
## [3,] 9 12
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
# Example 2
C <- matrix(nrow=2,ncol=3)
C[1,1] <- 1
C[1,2] <- 3
C[1,3] <- 5
C[2,1] <- 4
C[2,2] <- 7
C[2,3] <- 9
C
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 4 7 9
In the following, we show how to extract rows or columns in a matrix.
## [1] 1 4 7 10
## [1] 4 5 6
## [1] 4 7
Then, the following code chunk shows how to exclude rows or columns in a matrix.
## [,1] [,2]
## [1,] 1 7
## [2,] 2 8
## [3,] 3 9
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [,1] [,2] [,3] [,4]
## [1,] 2 5 8 11
## [2,] 3 6 9 12
There are two multiplication operators for matrices in R. The general matrix multiplication operator.
The first operator, * , does a simple element by element multiplication up to matrices. The second operator, %*%, performs a matrix multiplication between two matrices.
## [,1] [,2]
## [1,] 2 6
## [2,] 4 8
## [,1] [,2]
## [1,] 1 9
## [2,] 4 16
## [,1] [,2]
## [1,] 7 15
## [2,] 10 22
One should note that vector and matrix are different data types in R, though a vector can be seen as a special case of a matrix.
We can use dim() function in R to check the dimension of a matrix or data frame (we will talk about it soon).
a <- 1:3
b <- as.matrix(a) # use as.matrix() to transform a vector to a matrix
dim(a) # check the dimension of a
## NULL
## [1] 3 1
In general, datasets in R are stored as data frames. The structure of a data frame is similar to a matrix but the data stored in columns of a matrix can have only the same data type while a data frame can contain multiple data types in multiple columns. A data frame has column and row names.
Usage: data.frame(…, row.names = NULL, check.rows = FALSE, check.names = TRUE, fix.empty.names = TRUE, stringsAsFactors = FALSE)
# Creating Data Frames
names <- c('David', 'John', 'Mary')
quiz.1 <- c(89, 93, 85)
quiz.2 <- c(91, 88, 90)
Grade <- data.frame(names, quiz.1, quiz.2, stringsAsFactors = TRUE)
Grade
## names quiz.1 quiz.2
## 1 David 89 91
## 2 John 93 88
## 3 Mary 85 90
## 'data.frame': 3 obs. of 3 variables:
## $ names : Factor w/ 3 levels "David","John",..: 1 2 3
## $ quiz.1: num 89 93 85
## $ quiz.2: num 91 88 90
We can use either $, [, or [[ operator to access columns of data frame. Here is a simple example.
## [1] David John Mary
## Levels: David John Mary
## [1] David John Mary
## Levels: David John Mary
## [1] David John Mary
## Levels: David John Mary
We can use colnames() function to obtain all column names of the data.
## [1] "names" "quiz.1" "quiz.2"
## [1] "names" "Quiz_1" "Quiz_2"
Elements in data frames can be obtained like a matrix by providing index for row and column.
## [1] 85
## [1] Mary
## Levels: David John Mary
## [1] 91 88
## Quiz_1 Quiz_2
## 2 93 88
Usage: merge(x, y, by.x, by.y)
Grade1 <- data.frame(students = c('David', 'Gabby', 'Mary'),
quiz_3=c(88, 92, 85),
stringsAsFactors=TRUE)
Grade1
## students quiz_3
## 1 David 88
## 2 Gabby 92
## 3 Mary 85
## names Quiz_1 Quiz_2 quiz_3
## 1 David 89 91 88
## 2 Mary 85 90 85
Note: the result after merging two data frames returns only rows found in both x and y data frames. If we would like to include all rows, we need the additional argument all.x=TRUE.
## names Quiz_1 Quiz_2 quiz_3
## 1 David 89 91 88
## 2 John 93 88 NA
## 3 Mary 85 90 85
When we include the rows with no matching rows in the other data frame, NA will be assigned to the corresponding positions.
The commonly used units that people adopt to share code in R are packages. In general, a package contains code, data, documentation, tests, etc. Most people upload their packages to CRAN, a comprehensive R Archive Network while a few people share their code on GitHub or other web sites. It is recommended that you ONLY download packages from CRAN since these packages are well-maintained.
In order to import packages in RStudio, you need to
know the name of the package.
download the package. Here, we introduce two basic methods:
– Click the Packages tab in RStudio (bottom right window) and then click Install, find Install From: and select Repository (CRAN), type the name of the package in the box under Packages (separate multiple with space or comma) and click Install.
Note: we should leave Install dependencies checked so R will download any additional packages needed in order to use some functions or data in the package you are currently downloading.
– In the Console window, run install.packages(“package’s name”).
Note: It is essential to put the quotation marks around the package’s name.
Note:
Sometimes, warning messages are given in the Console when installing certain packages indicating that the package was built using an older version of R. In general, these warnings can be ignored since they are still compatible with newer versions of R.
You only need to install a package once when the first time you need it. You can always import the package after you install it.
The main difference between library() and require() functions is library() returns an error if the package doesn’t exist while require() returns FALSE and gives a warning.
In this section, we introduce two methods of importing data from some commonly used formats and write files.
Since there are many file types, we will focus on two commonly used file types: text files and comma separated value files. We will use the package readr which is included in tidyverse as it provides a fast and convenient way to read rectangular data (e.g. csv, tsv, and fwf).
readr supports the following file types using the following functions to read files:
Some Common arguments in these functions: - file: can be either a path to a file, a connection, or literal data - col_names: can be either TRUE, FALSE, or a character of column names
In general, these functions will work well. We include the path to a file, and we will obtain a tibble which is a modern reimagining of the data frame. It is much easier to navigate, view, and manipulate the contents of data using a tibble as every row is corresponding to an observation and every column is corresponding with a variable.
The following two code chunks give examples of reading data files. The following two code chunks give examples of reading data files. The first data file can be downloaded here: bike_sharing_data.csv.
library(tidyverse)
df1 <- read_csv("../data/bike_sharing_data.csv")
head(df1) # use head() to read the first six rows of the data
## # A tibble: 6 x 12
## datetime season holiday workingday weather temp atemp humidity windspeed
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/201~ 1 0 0 1 9.84 14.4 81 0
## 2 1/1/201~ 1 0 0 1 9.02 13.6 80 0
## 3 1/1/201~ 1 0 0 1 9.02 13.6 80 0
## 4 1/1/201~ 1 0 0 1 9.84 14.4 75 0
## 5 1/1/201~ 1 0 0 1 9.84 14.4 75 0
## 6 1/1/201~ 1 0 0 2 9.84 12.9 75 6.00
## # ... with 3 more variables: casual <dbl>, registered <dbl>, count <dbl>
## Rows: 17,379
## Columns: 12
## $ datetime <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/...
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weather <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,...
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
Note: glimpse() is a function included in tidyverse.
The second data file: Iris.Data can be obtained from UCI Machine Learning Repository.
The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 5 variables:
df2 <- read_delim("../data/iris.Data", delim=",", col_names = c("sepal_length", "sepal_width", "petal_length", "petal_width", "class"))
glimpse(df2)
## Rows: 150
## Columns: 5
## $ sepal_length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ sepal_width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ petal_length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ petal_width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ class <chr> "Iris-setosa", "Iris-setosa", "Iris-setosa", "Iris-set...
Note: In many programming languages like C, C++, Java, MatLab, Python, Perl, R, a backslash, \, works as an escape character in strings. So in these languages, we need to use either slash, /, or double backslash, \\, in the string in order to get a single backslash for a path.
Similarly, readr provides the following functions to write files:
Some Common arguments in these functions:
In this session, we will use the data diamonds which contains prices of over 50,000 round cut diamonds to study how to make the following graphical displays. This data is included in tidyverse.
Here is a list of common arguments:
Bar chart is the graphical display good for the general audience. Here we study the distribution of the quality of the cut in the data.
Usage: barplot(height, …)
barplot(table(diamonds$cut), col="blue", main="Distribution of Diamond Cut", horiz=TRUE, xlab="Number of Diamonds")
Note:
One can use names to change the names appearing under each bar. For example, names=c(“F”, “G”, “VG”, “P”, “I”).
We can use RGB color code to assign colors.
barplot(table(diamonds$cut), col="#69b3a2", main="Distribution of Diamond Cut", xlab="Number of Diamonds", names=c("F", "G", "VG", "P", "I"))
Similarly, we can use pie chart to study the distribution of the diamond color, from D (best) to J (worst).
The following code chunk shows an advanced setting.
H <- table(diamonds$color)
percent <- round(100*H/sum(H), 1) # calculate percentages
pie_labels <- paste(percent, "%", sep="") # include %
pie(H, main="My Best Piechart", labels=pie_labels, col = 2:8)
legend("topright", c("D","E","F","G","H","I","J"), cex=0.8, fill=2:8)
Tip: Use color palette to choose colors.
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of the prices in the data.
We can find that the distribution of the price is unimodal (one peak), skewed to the right, and has no outliers.
Here we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).
In general, a boxplot is used When we want to compare the distributions of several quantitative variables. In the following we study the distribution of price of diamonds among different quality of the cut.
In order to know how this can be done, we need to know how to define a formula in R.
Useage: A ~ B
boxplot(diamonds$price ~ diamonds$cut, main="Distribution of Price of Diamonds among the Quality of the Cut", xlab="Quality", ylab="Price", col=11:15, cex.lab=1.25, cex.axis=1.25)
We can use the argument data to indicate that variables used are from a given data.
boxplot(price ~ cut + color, data = diamonds, main="Distribution of Price of Diamonds among the Quality of the Cut", xlab="Quality", ylab="Price", cex.lab=1.25, cex.axis=1.25)
The above plot is only for the demonstration purpose. We can see that not names for all categories are shown on the plot which should be improved.
When we want to study the relationship of two quantitative variables, a scatterplot can be used. Here we study the relationship of the diamond price against its weight.
When we want to show how a quantitative variable changes over a period of time, a line plot can be used. Line plots can also be used to compare changes over the same period of time for several groups. Since diamonds dataset is not time series data, it is not appropriate to use line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 (The Weather Channel).
In order to graph a line plot, we need to know two additional arguments
type: “p” to draw only points; “l” to draw only lines; “o” to draw both points and lines
lty: line types. 0=blank; 1=solid; 2=dashed; 4=dotdash, 5=longdash, 6=twodash
Date <- 13:22
Dayton_OH <- c(84, 86, 91, 89, 89, 91, 92, 91, 91, 91)
Houston_TX <- c(100, 97, 96, 94, 94, 94, 93, 93, 92, 91)
Denver_CO <- c(95, 85, 89, 96, 97, 96, 92, 91, 95, 96)
Fargo_ND <- c(86, 80, 84, 87, 90, 87, 83, 84, 87, 89)
df <- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)
plot(Date, Dayton_OH, type="o", col="blue", xlab="Date in July", ylab="Highest Temperature", ylim=c(80, 100))
lines(Date, Houston_TX, type="o", col="red")
lines(Date, Denver_CO, type="o", col="purple")
lines(Date, Fargo_ND, type="o", col="darkgreen")
In this session, we will talk about data manipulation using R package tidyverse. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:
In this session, we will focus on the following key functions in dplyr using the bike_sharing_data.csv.
In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.
The logic when using pipe: object %>% function1 %>% function 2….
First, we load the package, check conflict functions and import the bike_sharing_data.csv.
library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- read_csv("../data/bike_sharing_data.csv")
Now we need to understand each variable before we move on. In this data, we have 12 variables (UCI Machine Learning Repository).
datetime: date and time of the event
season: 1:spring, 2:summer, 3:fall, 4:winter
holiday: 1: holiday, 0: not holiday
workingday: 1: neither weekend nor holiday, 0: otherwise
weather
1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
temp: Normalized temperature in Celsius
atemp: Normalized feeling temperature in Celsius
humidity: Normalized humidity
windspeed: Normalized humidity
casual: count of casual users
registered: count of registered users
count: count of total rental bikes including casual and registered
arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the bike_sharing_data by its temperature and humidity so we can compare the distribution of count in the data.
df1 <- df %>% arrange(temp, humidity)
df1[1:10,c("datetime","temp", "humidity", "count")] # print the first 10 rows to check the result
## # A tibble: 10 x 4
## datetime temp humidity count
## <chr> <dbl> <dbl> <dbl>
## 1 1/4/2012 2:00 0.82 34 1
## 2 1/4/2012 3:00 0.82 34 1
## 3 1/4/2012 4:00 0.82 41 2
## 4 1/4/2012 5:00 0.82 41 14
## 5 1/4/2012 6:00 0.82 41 59
## 6 1/22/2011 6:00 0.82 44 4
## 7 1/22/2011 7:00 0.82 44 13
## 8 1/22/2011 8:00 0.82 44 28
## 9 1/4/2012 7:00 0.82 44 152
## 10 1/4/2012 8:00 0.82 44 315
Some potential questions we may be interested: is there any association between the count of bike sharing and the temperature or the humidity or both?
Note:
filter() is used when we want to filter out entries in a column based o a logical condition. If we would like to study how popular the bike sharing program (count) is during when we have holidays, we can use filter() to filter out data for holidays.
##
## 0 1
## 16879 500
## [1] 17 16 8 2 3 1 5 13 33 47
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 120.00 168.37 265.25 712.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 42 141 187 277 977
What did you find by comparing the summary statistics between holidays’ data and the original data?
Here is an advanced example that shows how we can filter out data for holidays in season 2 and the temperature is higher than 20 degrees.
## [1] 81 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 84.0 243.0 229.2 337.0 712.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 120.00 168.37 265.25 712.00
Or we can filter out data for holidays in season 2 or season 3 and the temperature is higher than 20 degrees.
## [1] 177 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 60.0 229.0 233.9 375.0 712.0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.0 84.0 243.0 229.2 337.0 712.0
group_by() is used to group rows by one or more variables, giving priority to the variable entered first. For example, if we would like to study the weather effect on the bike sharing program, we can group rows by weather categories.
##
## 1 2 3 4
## 11413 4544 1419 3
## # A tibble: 17,379 x 12
## # Groups: weather [4]
## datetime season holiday workingday weather temp atemp humidity windspeed
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1/1/201~ 1 0 0 1 9.84 14.4 81 0
## 2 1/1/201~ 1 0 0 1 9.02 13.6 80 0
## 3 1/1/201~ 1 0 0 1 9.02 13.6 80 0
## 4 1/1/201~ 1 0 0 1 9.84 14.4 75 0
## 5 1/1/201~ 1 0 0 1 9.84 14.4 75 0
## 6 1/1/201~ 1 0 0 2 9.84 12.9 75 6.00
## 7 1/1/201~ 1 0 0 1 9.02 13.6 80 0
## 8 1/1/201~ 1 0 0 1 8.2 12.9 86 0
## 9 1/1/201~ 1 0 0 1 9.84 14.4 75 0
## 10 1/1/201~ 1 0 0 1 13.1 17.4 76 0
## # ... with 17,369 more rows, and 3 more variables: casual <dbl>,
## # registered <dbl>, count <dbl>
The result shows the original data but indicates a group, weather, in our example. In general, summarize() function is used together with group_by() as we group rows for some purposes. For example, we can study the average measures (temp, atemp, humidity, windspeed) for the quantitative variables with different weather condition.
df6 <- df %>% group_by(weather) %>% summarize(ave_temp = mean(temp),
ave_atemp = mean(atemp),
ave_humidity = mean(humidity),
ave_windspeed = mean(windspeed),
cases = n())
df6
## # A tibble: 4 x 6
## weather ave_temp ave_atemp ave_humidity ave_windspeed cases
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 21.0 24.4 57.4 12.8 11413
## 2 2 19.5 22.8 69.9 12.1 4544
## 3 3 18.7 21.8 82.8 14.7 1419
## 4 4 7.65 9.35 88.3 13.7 3
mutate() is used when we would like to add a new variable / column using the other variables in the data. The following code chunk shows how we convert the temperature (temp) and feeling temperature (atemp) from Celsius to Fahrenheit using the equation \[F = \frac{9}{5}\times C+32\] in the data and add two columns to the dataset.
## Rows: 17,379
## Columns: 14
## $ datetime <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/...
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weather <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,...
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
## $ F_temp <dbl> 49.712, 48.236, 48.236, 49.712, 49.712, 49.712, 48.236, ...
## $ F_atemp <dbl> 57.911, 56.543, 56.543, 57.911, 57.911, 55.184, 56.543, ...
We also can use a logical statement to add a new variable in the data.
## Rows: 17,379
## Columns: 15
## $ datetime <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/...
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weather <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,...
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
## $ F_temp <dbl> 49.712, 48.236, 48.236, 49.712, 49.712, 49.712, 48.236, ...
## $ F_atemp <dbl> 57.911, 56.543, 56.543, 57.911, 57.911, 55.184, 56.543, ...
## $ temp_level <chr> "low", "low", "low", "low", "low", "low", "low", "low", ...
##
## high low
## 2685 14694
select() is used when we would like to obtain several variables in the data. For example, if we would like to focus on the study of the quantitative variables in the bike sharing data, we can use the select() function to include all quantitative variables.
## Rows: 17,379
## Columns: 7
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
This action is equivalent to dropping other variables in the data.
## Rows: 17,379
## Columns: 7
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
In this subsection, we talk about other useful functions.
drop_na() is used when we would like to drop rows containing missing values. Since bike sharing data has no missing values, the R built-in data airquality is used here. First, we can use sum() and is.na() functions together to check if there is any missing values in the data.
## [1] 0
## [1] 44
We can find that there are 44 missing values in the data. Now we can use drop_na() to drop rows containing missing values.
## [1] 0
Here we introduce a family of mutate related functions: mutate_all(), mutate_at(), mutate_if().
The following code chunk shows how we can scale each variable (\(\frac{x-\bar{x}}{s_x}\), \(\bar{x}\) is the sample mean and \(s_x\) is the corresponding standard deviation) in the data airquality.
The following code chunk shows how we can transform the data types of season, holiday, workingday, and weather to factor in the bike sharing data.
## Rows: 17,379
## Columns: 12
## $ datetime <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/...
## $ season <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ holiday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ workingday <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weather <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,...
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
df11 <- df %>% mutate_at(c("season", "holiday", "workingday", "weather"), as.factor)
glimpse(df11) # check the new data structure
## Rows: 17,379
## Columns: 12
## $ datetime <chr> "1/1/2011 0:00", "1/1/2011 1:00", "1/1/2011 2:00", "1/1/...
## $ season <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ holiday <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ workingday <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ weather <fct> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3,...
## $ temp <dbl> 9.84, 9.02, 9.02, 9.84, 9.84, 9.84, 9.02, 8.20, 9.84, 13...
## $ atemp <dbl> 14.395, 13.635, 13.635, 14.395, 14.395, 12.880, 13.635, ...
## $ humidity <dbl> 81, 80, 80, 75, 75, 75, 80, 86, 75, 76, 76, 81, 77, 72, ...
## $ windspeed <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 6.0032, 0.0000, ...
## $ casual <dbl> 3, 8, 5, 3, 0, 0, 2, 1, 1, 8, 12, 26, 29, 47, 35, 40, 41...
## $ registered <dbl> 13, 32, 27, 10, 1, 1, 0, 2, 7, 6, 24, 30, 55, 47, 71, 70...
## $ count <dbl> 16, 40, 32, 13, 1, 1, 2, 3, 8, 14, 36, 56, 84, 94, 106, ...
The following code chunk shows how we can scale the quantitative variables in the bike sharing data and round values to the first decimal place.
## # A tibble: 17,379 x 12
## datetime season holiday workingday weather temp[,1] atemp[,1] humidity[,1]
## <chr> <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl>
## 1 1/1/201~ 1 0 0 1 -1.3 -1.1 0.9
## 2 1/1/201~ 1 0 0 1 -1.4 -1.2 0.9
## 3 1/1/201~ 1 0 0 1 -1.4 -1.2 0.9
## 4 1/1/201~ 1 0 0 1 -1.3 -1.1 0.6
## 5 1/1/201~ 1 0 0 1 -1.3 -1.1 0.6
## 6 1/1/201~ 1 0 0 2 -1.3 -1.3 0.6
## 7 1/1/201~ 1 0 0 1 -1.4 -1.2 0.9
## 8 1/1/201~ 1 0 0 1 -1.5 -1.3 1.2
## 9 1/1/201~ 1 0 0 1 -1.3 -1.1 0.6
## 10 1/1/201~ 1 0 0 1 -0.9 -0.7 0.7
## # ... with 17,369 more rows, and 4 more variables: windspeed[,1] <dbl>,
## # casual[,1] <dbl>, registered[,1] <dbl>, count[,1] <dbl>
Note: Why Scaling or Standardizing values?
“How unusual is a value/observation?” The answer depends on the units of measurement.
Variables measured at different scales don’t contribute equally to the analysis.
In this session, we will introduce one data exploration package: DataExplorer and two data visualization packages: ggplot2 and plotly.
We will use the secondary data posted by The COVID Tracking Project, which is a volunteer group assembling data organized by the Atlantic. The data contains the summary COVID-19 information daily in the United States. First, we load necessary packages and import the data.
library(tidyverse)
library(DataExplorer)
df_states <- read_csv("https://covidtracking.com/api/v1/states/daily.csv")
glimpse(df_states)
## Rows: 7,857
## Columns: 41
## $ date <dbl> 20200723, 20200723, 20200723, 20200723, 20...
## $ state <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO", ...
## $ positive <dbl> 2684, 74212, 36259, 0, 152944, 425616, 416...
## $ negative <dbl> 186825, 545315, 410221, 1037, 669769, 6352...
## $ pending <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 26...
## $ hospitalizedCurrently <dbl> 36, 1547, 480, NA, 2966, 8820, 351, 72, 91...
## $ hospitalizedCumulative <dbl> NA, 8995, 2361, NA, 7236, NA, 6133, 10712,...
## $ inIcuCurrently <dbl> NA, NA, NA, NA, 851, 2284, NA, NA, 22, 7, ...
## $ inIcuCumulative <dbl> NA, 1043, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ onVentilatorCurrently <dbl> 1, NA, 107, NA, 617, NA, NA, NA, 9, NA, NA...
## $ onVentilatorCumulative <dbl> NA, 553, 329, NA, NA, NA, NA, NA, NA, NA, ...
## $ recovered <dbl> 787, 32510, 28864, NA, 19737, NA, 5095, 85...
## $ dataQualityGrade <chr> "A", "B", "A+", "C", "A+", "B", "A", "B", ...
## $ lastUpdateEt <chr> "7/23/2020 00:00", "7/23/2020 11:00", "7/2...
## $ dateModified <dttm> 2020-07-23 00:00:00, 2020-07-23 11:00:00,...
## $ checkTimeEt <chr> "07/22 20:00", "07/23 07:00", "07/23 10:46...
## $ death <dbl> 19, 1397, 386, 0, 3063, 8027, 1643, 4410, ...
## $ hospitalized <dbl> NA, 8995, 2361, NA, 7236, NA, 6133, 10712,...
## $ dateChecked <dttm> 2020-07-23 00:00:00, 2020-07-23 11:00:00,...
## $ totalTestsViral <dbl> 189509, 618011, 445467, NA, 822713, 677830...
## $ positiveTestsViral <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 48...
## $ negativeTestsViral <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 35...
## $ positiveCasesViral <dbl> 2684, 72696, 36259, 0, 137710, 425616, 385...
## $ deathConfirmed <dbl> 19, 1357, NA, NA, 2431, NA, NA, 3530, NA, ...
## $ deathProbable <dbl> NA, 40, NA, NA, 152, NA, NA, 880, NA, 58, ...
## $ fips <chr> "02", "01", "05", "60", "04", "06", "08", ...
## $ positiveIncrease <dbl> 65, 2399, 1013, 0, 2335, 12040, 639, 9, 42...
## $ negativeIncrease <dbl> 4111, 7640, 5241, 0, 6397, 101845, 7370, 1...
## $ total <dbl> 189509, 619527, 446480, 1037, 822713, 6778...
## $ totalTestResults <dbl> 189509, 619527, 446480, 1037, 822713, 6778...
## $ totalTestResultsIncrease <dbl> 4176, 10039, 6254, 0, 8732, 113885, 8009, ...
## $ posNeg <dbl> 189509, 619527, 446480, 1037, 822713, 6778...
## $ deathIncrease <dbl> 0, 33, 6, 0, 89, 157, 0, 4, 1, 2, 173, 25,...
## $ hospitalizedIncrease <dbl> 0, 457, 44, 0, 189, 0, 23, 58, 0, 0, 403, ...
## $ hash <chr> "250b2f86b7f497e40057c76c9c34280febdd150b"...
## $ commercialScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeRegularScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ negativeScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ positiveScore <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ score <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ grade <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
The dataset contains 7857 observations of 41 variables up to Thu Jul 23 19:49:04 2020.
In data science, it is important to get to know your data before advanced modeling or further analysis. We should understand what the data are about, what variables we have, the size of the data, how many missing values, what is the data type of each variable, any possible relationships between variables and anything unusual or interesting in the data.
First, we check the basic description for the COVID-19 data using the function plot_intro() in the package DataExplorer.
Then, we study the distribution of missing values in the COVID-19 data using the function plot_missing() in the package DataExplorer.
Since there are 41 variables and some information is beyond interest here, we will focus on the following variables for the data exploration as the other variables depend on some of these variables.
df_states <- df_states %>% select(c(date, state, totalTestResultsIncrease, positiveIncrease, negativeIncrease, deathIncrease, hospitalizedIncrease, death))
glimpse(df_states)
## Rows: 7,857
## Columns: 8
## $ date <dbl> 20200723, 20200723, 20200723, 20200723, 20...
## $ state <chr> "AK", "AL", "AR", "AS", "AZ", "CA", "CO", ...
## $ totalTestResultsIncrease <dbl> 4176, 10039, 6254, 0, 8732, 113885, 8009, ...
## $ positiveIncrease <dbl> 65, 2399, 1013, 0, 2335, 12040, 639, 9, 42...
## $ negativeIncrease <dbl> 4111, 7640, 5241, 0, 6397, 101845, 7370, 1...
## $ deathIncrease <dbl> 0, 33, 6, 0, 89, 157, 0, 4, 1, 2, 173, 25,...
## $ hospitalizedIncrease <dbl> 0, 457, 44, 0, 189, 0, 23, 58, 0, 0, 403, ...
## $ death <dbl> 19, 1397, 386, 0, 3063, 8027, 1643, 4410, ...
Now, we study the frequency distribution of all categorical variables in the data using the function plot_bar() in the package DataExplorer.
Since we only have one categorical variable: state in the data, the above figure shows the frequency distribution of state in the COVID-19 data.
The following code shows the distribution of sum of deathIncrease by states.
Next, we study the distribution of all quantitative variables in the data using the function plot_histogram() in the package DataExplorer.
We study the distributions of positiveIncrease and deathIncrease with respect to states individually using the function plot_boxplot() in the package DataExplorer.
We can study the association between any quantitative variables with a given response variable in the data using the function plot_scatterplot() in the package DataExplorer. Here, we study the association between death and other quantitative variables in the COVID-19 data. In order to reduce the running time, we only sample 1000 rows from the data.
Note: We should note that the variable: date is not a quantitative variable.
plot_scatterplot(df_states %>% filter(state=="OH") %>% select(-c(state)) %>% drop_na(), by = "death", ncol=2)
The above figure only shows the association between death and other quantitative variables in the Ohio COVID-19 data.
We can check the correlation of all quantitative variables in the data using the function plot_correlation() in the package DataExplorer.
In you are new to data exploration and have no ideas about where to start. create_report() function in the package DataExplorer can help to create a report for the data exploration of the data.
create_report(df_states %>% filter(!(state %in% c("DC", "AS", "GU", "MP", "PR", "VI"))), output_file = "report.html", output_dir = "I:/Shared drives/R Short Course 2020 Summer/code")
Note: Use help(“create_report”) to find the usage of create_report().
While we can use the built-in functions in the base package in R to obtain plots, the package ggplot2 creates advanced graphs with simple and flexible commands.
We will continue using the secondary COVID-19 data to show the use of different graphic displays. First, we group the data by the variable: state and find the summarize information for each state.
df_DV <- df_states %>% group_by(state) %>% summarize(
Positive = sum(positiveIncrease),
Negative = sum(negativeIncrease),
Death = sum(deathIncrease),
Hospitalized = sum(hospitalizedIncrease))
DT::datatable(df_DV)
The basic idea of creating plots using ggplot2 is to specify each component of the following and combine them with +.
Note:
We only list some key components here.
See Modify Components of A Theme and Complete Themes for more details about the use of theme.
ggplot() function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays.
The logic when using ggplot() function is: ggplot(data, mapping) + geom_function().
First, we study the distribution of confirmed cases of COVID-19 in the United States up to 2020-07-23.
The following code chunk shows an improved version of the histogram from the able histogram.
p1 <- ggplot(data = df_DV, aes(x = Positive)) + geom_histogram(fill="#20B2AA") +
labs(x = "Confirmed Cases", title = "Distribution of Confirmed Cases of COVID-19 in the United States") +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
p1
Then, we study the total number of confirmed cases in each state in the COVID-19 data.
options(scipen=10000) # put a high number so that R doesn't switch numbers to scientific notation
p2 <- ggplot(data = df_DV, aes(x = state, y = Positive)) + geom_bar(stat = "identity")
p2
Now we study the distribution of daily increased confirmed cases of each state using boxplots.
p3 <- ggplot(data = df_states, aes(x = state, y = positiveIncrease)) +
geom_boxplot() +
labs(title = "Distribution of Daily Increased Confirmed Cases of Each State", y = "Confirmed Cases") +
theme(axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14))
p3
In addition, the relationship between total number of death and confirmed cases in each state is studied using a scatterplot.
p4 <- ggplot(data = df_DV, aes(x = Positive, y = Death)) + geom_point(col="#20B2AA") +
labs(y = "Death in Each State", x = "Confirmed Cases of COVID-19 in Each State")
p4
We can find that the relationship between these two variables are somehow linear. If we would like to fit the data with a simple linear model using these two variables, lm() can be used. The following code chunk shows how we can fit the data with a simple linear model using total number of death and confirmed cases in each state and add the linear line to the scatterplot.
\[\widehat{Death} = b_0 + b_1 \times Positive\]
df_DV$pred.death <- predict(lm(Death ~ Positive, data = df_DV)) # add the prediction from the linear model
p5 <- ggplot(data = df_DV, aes(x = Positive, y = Death)) + geom_point(col="#20B2AA")
p5 + geom_line(aes(y = pred.death))
Since we only have one categorical variable: state in the data and this variable contains 56 categories, it is not ideal to use this data to show the use of facet. Here, we use the data set diamonds in the R package tidyverse to show how we can split one plot into several plots by a categorical variable.
In the end of this section, we talk about how we can use ggplot() function to obtain a map. In order to do this, we need another R package: maps.
The following steps are recommended when we would like to create an United State map with the corresponding information in the data,
Obtain the longitude and latitude of each state using map_data() function. You can use this function to obtain the longitude and latitude of each state of each county in the United States as well.
Mutate joins add columns from our data (df_DV) to the state map data using left_join() function. We need to make sure that names for each state are consistent in both data. (For example, our computers don’t know that OH is the same as Ohio or ohio.)
Use ggplot() and geom_polygon() functions to obtain a basic map first.
Improve the map you obtain in the previous step.
library(maps)
# Retrieve the states map data
state_map <- map_data("state")
# We need to match the state column in our data with the "region" column in state_map
df_DV <- df_DV %>% mutate(region = tolower(state.name[match(df_DV$state, state.abb)]))
# merge state_map data with df_DV data
death_map <- left_join(state_map, df_DV, by = "region")
# Create the map
p6 <- ggplot(death_map, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = Death), color = "white")
p6
In the following, we show an improved version of the map from the above map.
state_location <- data.frame(state = state.abb, long = state.center$x, lat = state.center$y)
new_death_map <- left_join(state_location, df_DV, by = "state")
p7 <- ggplot(death_map, aes(x = long, y = lat)) +
geom_polygon(aes(group = group, fill = Death), color = "black")
p7 + geom_text(aes(label = paste0(state, "\n ", Death)), data = new_death_map, color = "black", size = 3, fontface=3, hjust=0.5, vjust=0.5) +
scale_fill_continuous(low ="#ade8f4", high = "#0077b6") +
theme(legend.position="none", axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank())
The R package plotly can be used to make interactive graphic displays very easy when we already know how to use ggplot() to create graphs.
The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section.
In this session, we will introduce the basic data type: list, the use of conditional statements, creation of functions, and loops.
When data contains several elements that have different lengths / dimensions or types, a list can be used to store the information. The following code chunk shows how we can create a list using list() function.
states <- sample(state.abb, 5) # sample 5 states from the United States
names <- c('David', 'John', 'Mary')
quiz.1 <- c(89, 93, 85)
quiz.2 <- c(91, 88, 90)
Grade <- data.frame(names, quiz.1, quiz.2, stringsAsFactors = TRUE)
Y <- list(states, Grade, "Hello", 3)
Y
## [[1]]
## [1] "MD" "ID" "KY" "WV" "UT"
##
## [[2]]
## names quiz.1 quiz.2
## 1 David 89 91
## 2 John 93 88
## 3 Mary 85 90
##
## [[3]]
## [1] "Hello"
##
## [[4]]
## [1] 3
We can use names() to assign a name for each component in the list.
Then we can obtain values in the list very easy.
## [1] "MD" "ID" "KY" "WV" "UT"
## [1] "MD" "ID" "KY" "WV" "UT"
## [1] 89 93 85
## [1] 89 93 85
## [1] 90
## [1] 90
Conditional statements can be very helpful when we only want to execute the commends under certain conditions.
if statement:
if (cond1==true) {
cmd
}
The following code chunk shows a simple example that examines a number is odd number or not.
## [1] "11 is an odd number."
if … else statement:
if (cond1==true) {
cmd1
} else {
cmd2
}
Here is an improved version of the previous example.
x <- 11
if (x%%2 == 1){
print(paste(x, "is an odd number."))
}else{
print(paste(x, "is not an odd number."))
}
## [1] "11 is an odd number."
When the logical statement is simple and short, it is convenient to use ifelse() function.
ifelse statement:
ifelse(test, true_value, false_value)
In the following code chunk, we sample 10 values from 1, 2, …, 100 with replacement and check if these 10 values are odd or even numbers.
## [1] "Even" "Odd" "Odd" "Even" "Even" "Odd" "Odd" "Even" "Even" "Even"
While we can use lots of functions in the R packages, there are some situations that we may want to define a function by ourselves. Using functions make our code well-organized and readable as well.
Function Structure:
function_name <- function(arg1, arg2, …) {
function body
}
In the following code chunk, we define a function to check if a value is an even number or odd number.
check_numbers <- function(n){
if (n%%2 == 1){
return(paste(n, "is an odd number."))
}else{
return(paste(n, "is an even number."))
}
}
check_numbers(10)
## [1] "10 is an even number."
Question: Can a function call itself?
Yes!! It is called a recursive function. (A function within a function)
All recursive algorithms have the following steps:
Base step (when to stop)
Work toward Base step
Recursive call
In the following code chunk, we write a recursive function that computes \(a^n\) if \(n\) is a nonnegative integer and \(a\) is nonzero.
# Example (recursive power function)
pow <- function(a, n){
if (n == 0){
return(1)
} else {
return(a*pow(a,n-1))
}
}
pow(2,10) # find 2^10
## [1] 1024
Here is another example that computes the \(n\)th Fibonacci Number using a recursive function.
# Example (Find the nth Fibonacci Number)
Fib <- function(n){
if (n == 0 || n == 1){
return(n)
} else{
return(Fib(n-1)+Fib(n-2))
}
}
Fib(20) # find the 100th Fibonacci number
## [1] 6765
Loops can be used when repetitive operations are needed. Using loops helps to make the code more readable. However, programming in R is particularly slow when multiple loops are used. Writing pseudo code to plan your algorithm may help to reduce the number of loops needed in the computation.
for() loops are used for general purposes. This means we should know the number of iterations needed.
for loop statement:
for (variable in sequence) {
statements
}
The following code chunk shows how we can find the answer of \(1\times 2 + 2\times 3 + \cdots + 398 \times 399 + 399\times 400\) using a for() loop.
# set up the timer
t0 <- proc.time()
total <- 0
for (i in seq(1,399)){
total <- total+i * (i+1)
}
total
## [1] 21333200
## user system elapsed
## 0.02 0.00 0.02
A while() loop is recommended when the number of iterations in the loop is unknown. For example, if we want to find a numerical solution of \(sin(x) = x + 1\) with the error less than 0.0001, we can use the while() loop to find the solution.
while loop statement:
while (condition) {
statements
}
# Example (Application of Fix point theorem)
# We want to find a solution for sin(x)-1-x=0
# Define the function f(x) = sin(x)-1
find_solution <- function(x) sin(x)-1
precision <- 0.0001
step <- 0
x0 <- 0.5
error <- 1
# the while loop will be executed when the error is higher that the precision and the step is less than 100
while (error > precision && step < 100){
xn <- find_solution(x0)
error <- abs(xn-x0)
x0 <- xn
step <- step + 1
}
if (step == 100){
print('We can not find the root with 100 iterations.')
}else{
print(paste('The root of sinx = x + 1 is about ', x0))
print(paste('It takes ', step, 'iterations.'))
}
## [1] "The root of sinx = x + 1 is about -1.93458022062346"
## [1] "It takes 11 iterations."
In this section, we show two advanced examples for computing fractals from iterated function systems (Fern Fractal).
The first example shows fractal fern.
a <- c(0, 0.85, 0.2, -0.15)
b <- c(0, 0.04, -0.26, 0.28)
c <- c(0, -0.04, 0.23, 0.26)
d <- c(0.16, 0.85, 0.22, 0.24)
e <- c(0, 0, 0, 0)
f <- c(0, 1.6, 1.6, 0.44)
numits <- 2000 # number of iterations
x <- 0
y <- 0
par(bg="black") # change the color for background
plot(seq(-2, 10, by = 0.1), seq(-2, 10, by = 0.1), type = "n",
main = "fractal fern")
for (n in seq(1,numits)){
k <- sample(1:4, size = 1, replace = TRUE, prob = c(0.01, 0.85, 0.07, 0.07))
newx <- a[k]*x + b[k]*y + e[k]
newy <- c[k]*x + d[k]*y + f[k]
x <- newx
y <- newy
if (n>10){
points(x+3,y, col = "green", cex=0.5, pch=20)
}
}
Here is another example for fractal tree.
# Another Example (Fractal Tree)
a <- c(0, 0.42, 0.42, 0.1)
b <- c(0, -0.42, 0.42, 0)
c <- c(0, 0.42, -0.42, 0)
d <- c(0.5, 0.42, 0.42, 0.1)
e <- c(0, 0, 0, 0)
f <- c(0, 0.2, 0.2, 0.2)
numits <- 5000
x <- 0
y <- 0
par(bg="black")
plot(seq(-0.3, 0.3, by = 0.1), seq(0, 0.3, by = 0.05), type = "n",
main = "fractal fern")
for (n in seq(1,numits)){
k <- sample(1:4, size = 1, replace = TRUE, prob = c(0.05, 0.4, 0.4, 0.15))
newx <- a[k]*x + b[k]*y + e[k]
newy <- c[k]*x+d[k]*y+f[k]
x <- newx
y <- newy
if (n>10){
points(x,y, col = "green", cex=0.5, pch=20)
}
}
Here are some open data sources:
To see documentation on any function in R, execute ?data.frame
etc.
Google it! (Better way to learn coding!)
Ask questions online, for examples:
– R-Forum
– Nabble
R Documentation. 2020. “R: Quotes.” https://stat.ethz.ch/R-manual/R-devel/library/base/html/Quotes.html.
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.