In this lesson, we will introduce the following data structures in R.
Note: This lesson is based on the book: The Art of R Programming ((Matloff 2011)).
Similar to vectors and matrices, one common operation with lists is indexing. Technically, a list is a vector. However, it is much flexible compared to the use of vectors and matrices. A list could contains elements of different types such as stings, numbers, vectors, and another list inside it. It can contain a matrix, a function, and a data frame, etc. as its elements. The following code chunk shows a simple example.
x <- list(Name = "Tessa", Programming_Skills = c("R", "Python", "Matlab"), Num_Students_2022F = 80, office = "SC 329E", Tenured = FALSE, UD_Working_Yr = 5)
x## $Name
## [1] "Tessa"
##
## $Programming_Skills
## [1] "R" "Python" "Matlab"
##
## $Num_Students_2022F
## [1] 80
##
## $office
## [1] "SC 329E"
##
## $Tenured
## [1] FALSE
##
## $UD_Working_Yr
## [1] 5
The component names are called tags in the R literature. In fact, including a name for each component is optional. We can see an example on the next slide.
## [[1]]
## [1] "Tessa"
##
## [[2]]
## [1] "R" "Python" "Matlab"
##
## [[3]]
## [1] 80
##
## [[4]]
## [1] "SC 329E"
##
## [[5]]
## [1] FALSE
##
## [[6]]
## [1] 5
It is clearer and less error-prone to use names instead of numeric indices.
## [1] "SC 329E"
## [1] "R" "Python" "Matlab"
## [1] "Matlab"
## [[1]]
## [[1]]$Name
## [1] "Tessa"
##
## [[1]]$Programming_Skills
## [1] "R" "Python" "Matlab"
##
## [[1]]$Num_Students_2022F
## [1] 80
##
## [[1]]$office
## [1] "SC 329E"
##
## [[1]]$Tenured
## [1] FALSE
##
## [[1]]$UD_Working_Yr
## [1] 5
##
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 72 33 41
## [2,] 56 96 15
## [3,] 58 47 37
Question: How to extract the value in the 2nd row and 1st column in the second component in the list?
Since lists are vectors, they can be created via vector().
y <- vector(mode="list") # create a vector with size 0
z <- vector("list", 10) # create a vector with size 10
y[["Method"]] <- "Decision Tree"
y[["Sensitivity"]] <- 0.85
y[["Specificity"]] <- 0.89
y## $Method
## [1] "Decision Tree"
##
## $Sensitivity
## [1] 0.85
##
## $Specificity
## [1] 0.89
Another way to create a list.
## [[1]]
## [1] NA
##
## [[2]]
## [1] NA
There are three ways to access an individual component of a list and return it in the data type of this component. Each of these is useful in different contexts.
## [1] 0.89
## [1] 0.89
## [1] 0.85
An alternative to the 2nd and 3rd techniques listed is to use single brackets rather than double brackets.
## $Specificity
## [1] 0.89
## $Sensitivity
## [1] 0.85
## [[1]]
## [,1] [,2] [,3]
## [1,] 72 33 41
## [2,] 56 96 15
## [3,] 58 47 37
## [,1] [,2] [,3]
## [1,] 72 33 41
## [2,] 56 96 15
## [3,] 58 47 37
## $Programming_Skills
## [1] "R" "Python" "Matlab"
Question: How to extract the third element in the 2nd component of the list list_tessa?
We should pay attention to the use of indices when subsetting values in a list.
## [1] "list"
## [1] "matrix" "array"
Question: Try the following commands and discuss the difference between them.
new_list[[1:2]] and new_list[1:2]
new_list[[1]][1:2] and new_list[[2]][1:2]
new_list[[2]][1:2,] and new_list[[2]][,1:2]
A data frame is like a matrix. It has a two-dimensional data structure. But it differs from a matrix since the data type of each column could be different. Technically, a data frame is a list, with the components that list being equal-length vectors.
kids <- c("Vicky", "Patrick", "Jasmine")
ages <- c(14, 10, 8)
# create the data frame using kids and ages
df <- data.frame(kids, ages, stringsAsFactors = F)
df## kids ages
## 1 Vicky 14
## 2 Patrick 10
## 3 Jasmine 8
We can include columns directly in the data frame when creating it as well.
## kids ages
## 1 Vicky 14
## 2 Patrick 10
## 3 Jasmine 8
There are multiple ways to access a data frame.
## [1] "Vicky" "Patrick" "Jasmine"
## [1] "Vicky" "Patrick" "Jasmine"
## [1] "Vicky" "Patrick" "Jasmine"
## [1] "Vicky" "Patrick" "Jasmine"
Question: consider four ways to access the first column of the data frame above, which way(s) would generally considered to be clear and safer than others?
We can use str() function to display the structure of an arbitrary R object.
## 'data.frame': 3 obs. of 2 variables:
## $ kids: chr "Vicky" "Patrick" "Jasmine"
## $ ages: num 14 10 8
Many matrix operations apply to data frames as well. We will use a R built-in dataset ToothGrowth. First, we use the head() function to see the first six rows of the data.
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
Use ToothGrowth data to answer the following questions.
Find a data frame that contains observations from the 3rd row to 7th row in ToothGrowth.
Find a data frame that contains values from 3rd row to 7th row for the 3rd column in ToothGrowth.
Find a data frame that only contains observations such that their tooth length (len) are smaller than the mean tooth length of guinea pigs in ToothGrowth.
Since we have introduced factors in Lesson 2, we will focus the concepts about tables. Before we talk about how we could work with tables, let’s see an example that demonstrates how we can rearrange the levels in a factor.
set.seed(2022)
seasons <- factor(sample(c("Spring", "Summer", "Fall"), 1000, replace = T))
levels(seasons)## [1] "Fall" "Spring" "Summer"
Here, we can see that the order of levels in seasons is not Spring, Summer, Fall, which didn’t reflect the time order.
Continue from the previous slide, we want to change the order of levels in seasons.
seasons <- factor(seasons, labels = c("Spring", "Summer", "Fall"))
barplot(table(seasons), ylim=c(0, 350))Now, if we want to add another category Winter in seasons, an intuitive way is to combine the previous vector with the new vector. But quickly we find that the data type of the new object is changed to “character”.
## [1] "character"
If we look into this further by checking the frequency distribution of the data, we find that the first three categories were changed to the characters of 1, 2, and 3.
## all_seasons
## 1 2 3 Winter
## 342 320 338 10
Question: How to deal with this issue?
We should change the data type of winters to factor.
## [1] "factor"
We can add a new level to a factor object if needed.
On the previous slides, we used the table() function to find frequency distribution of a categorical variable. In fact, this function can be used to create a contingency table of two categorical variables.
set.seed(2022)
Grade <- sample(c(LETTERS[1:4], "F"), 100, replace=T)
Sex <- sample(c("Male", "Female"), 100, replace=T, prob=c(0.4, 0.6))
table(Grade, Sex)## Sex
## Grade Female Male
## A 12 7
## B 10 11
## C 11 9
## D 11 9
## F 16 4
Note: Similar to data frames, most matrix/array operations can be used on tables.
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller