MTH 209 Data Manipulation and Management

Lesson 4: Basic Data Structures (Lists, Data Frames, Factors and Tables)

Ying-Ju Tessa Chen
ychen4@udayton.edu
University of Dayton

Basic Data Structures

In this lesson, we will introduce the following data structures in R.

Note: This lesson is based on the book: The Art of R Programming ((Matloff 2011)).

Lists - 1

Similar to vectors and matrices, one common operation with lists is indexing. Technically, a list is a vector. However, it is much flexible compared to the use of vectors and matrices. A list could contains elements of different types such as stings, numbers, vectors, and another list inside it. It can contain a matrix, a function, and a data frame, etc. as its elements. The following code chunk shows a simple example.

x <- list(Name = "Tessa", Programming_Skills = c("R", "Python", "Matlab"), Num_Students_2022F = 80, office = "SC 329E", Tenured = FALSE, UD_Working_Yr = 5)

x
## $Name
## [1] "Tessa"
## 
## $Programming_Skills
## [1] "R"      "Python" "Matlab"
## 
## $Num_Students_2022F
## [1] 80
## 
## $office
## [1] "SC 329E"
## 
## $Tenured
## [1] FALSE
## 
## $UD_Working_Yr
## [1] 5

The component names are called tags in the R literature. In fact, including a name for each component is optional. We can see an example on the next slide.

Lists - 2

list("Tessa", c("R", "Python", "Matlab"), 80, "SC 329E", FALSE, 5) -> list_tessa
list_tessa
## [[1]]
## [1] "Tessa"
## 
## [[2]]
## [1] "R"      "Python" "Matlab"
## 
## [[3]]
## [1] 80
## 
## [[4]]
## [1] "SC 329E"
## 
## [[5]]
## [1] FALSE
## 
## [[6]]
## [1] 5
names(list_tessa) <- c("Name", "Programming_Skills", "Num_Students_2022F", "office", "Tenured", "UD_Working_Yr")

Lists - 3

It is clearer and less error-prone to use names instead of numeric indices.

list_tessa$office
## [1] "SC 329E"
list_tessa[[2]]
## [1] "R"      "Python" "Matlab"
list_tessa[[2]][3]
## [1] "Matlab"

Lists - 4

A <- matrix(sample(1:100, 9), ncol=3)
new_list <- list(list_tessa, A)
new_list
## [[1]]
## [[1]]$Name
## [1] "Tessa"
## 
## [[1]]$Programming_Skills
## [1] "R"      "Python" "Matlab"
## 
## [[1]]$Num_Students_2022F
## [1] 80
## 
## [[1]]$office
## [1] "SC 329E"
## 
## [[1]]$Tenured
## [1] FALSE
## 
## [[1]]$UD_Working_Yr
## [1] 5
## 
## 
## [[2]]
##      [,1] [,2] [,3]
## [1,]   72   33   41
## [2,]   56   96   15
## [3,]   58   47   37

Question: How to extract the value in the 2nd row and 1st column in the second component in the list?

Lists - 5

Since lists are vectors, they can be created via vector().

y <- vector(mode="list") # create a vector with size 0
z <- vector("list", 10) # create a vector with size 10
y[["Method"]] <- "Decision Tree" 
y[["Sensitivity"]] <- 0.85
y[["Specificity"]] <- 0.89
y
## $Method
## [1] "Decision Tree"
## 
## $Sensitivity
## [1] 0.85
## 
## $Specificity
## [1] 0.89

Another way to create a list.

w <- rep(list(NA), 2)
w
## [[1]]
## [1] NA
## 
## [[2]]
## [1] NA

Lists - 6

There are three ways to access an individual component of a list and return it in the data type of this component. Each of these is useful in different contexts.

y$Specificity
## [1] 0.89
y[["Specificity"]]
## [1] 0.89
y[[2]]
## [1] 0.85

Lists - 7

An alternative to the 2nd and 3rd techniques listed is to use single brackets rather than double brackets.

y["Specificity"]
## $Specificity
## [1] 0.89
y[2] # the 2 is the index of "Specificity" within the list y
## $Sensitivity
## [1] 0.85

Lists - 8

new_list[2]
## [[1]]
##      [,1] [,2] [,3]
## [1,]   72   33   41
## [2,]   56   96   15
## [3,]   58   47   37
new_list[[2]]
##      [,1] [,2] [,3]
## [1,]   72   33   41
## [2,]   56   96   15
## [3,]   58   47   37
list_tessa[2]
## $Programming_Skills
## [1] "R"      "Python" "Matlab"

Question: How to extract the third element in the 2nd component of the list list_tessa?

Lists - 9

We should pay attention to the use of indices when subsetting values in a list.

class(new_list[[1]])
## [1] "list"
class(new_list[[2]])
## [1] "matrix" "array"

Question: Try the following commands and discuss the difference between them.

  1. new_list[[1:2]] and new_list[1:2]

  2. new_list[[1]][1:2] and new_list[[2]][1:2]

  3. new_list[[2]][1:2,] and new_list[[2]][,1:2]

Data Frames - 1

A data frame is like a matrix. It has a two-dimensional data structure. But it differs from a matrix since the data type of each column could be different. Technically, a data frame is a list, with the components that list being equal-length vectors.

kids <- c("Vicky", "Patrick", "Jasmine")
ages <- c(14, 10, 8)
# create the data frame using kids and ages
df <- data.frame(kids, ages, stringsAsFactors = F)
df
##      kids ages
## 1   Vicky   14
## 2 Patrick   10
## 3 Jasmine    8

We can include columns directly in the data frame when creating it as well.

data.frame(kids = c("Vicky", "Patrick", "Jasmine"), ages = c(14, 10, 8), stringsAsFactors = F)
##      kids ages
## 1   Vicky   14
## 2 Patrick   10
## 3 Jasmine    8

Data Frames - 2

There are multiple ways to access a data frame.

df[[1]] # list-like
## [1] "Vicky"   "Patrick" "Jasmine"
df$kids # component name 
## [1] "Vicky"   "Patrick" "Jasmine"
df[,1]  # matrix-like
## [1] "Vicky"   "Patrick" "Jasmine"
df[, "kids"] # matri-like with the column name
## [1] "Vicky"   "Patrick" "Jasmine"

Question: consider four ways to access the first column of the data frame above, which way(s) would generally considered to be clear and safer than others?

Data Frames - 3

We can use str() function to display the structure of an arbitrary R object.

str(df)
## 'data.frame':    3 obs. of  2 variables:
##  $ kids: chr  "Vicky" "Patrick" "Jasmine"
##  $ ages: num  14 10 8

Many matrix operations apply to data frames as well. We will use a R built-in dataset ToothGrowth. First, we use the head() function to see the first six rows of the data.

head(ToothGrowth) 
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

Data Frames - 4

Use ToothGrowth data to answer the following questions.

  1. Find a data frame that contains observations from the 3rd row to 7th row in ToothGrowth.

  2. Find a data frame that contains values from 3rd row to 7th row for the 3rd column in ToothGrowth.

  3. Find a data frame that only contains observations such that their tooth length (len) are smaller than the mean tooth length of guinea pigs in ToothGrowth.

Factors & Tables - 1

Since we have introduced factors in Lesson 2, we will focus the concepts about tables. Before we talk about how we could work with tables, let’s see an example that demonstrates how we can rearrange the levels in a factor.

set.seed(2022)
seasons <- factor(sample(c("Spring", "Summer", "Fall"), 1000, replace = T))
levels(seasons)
## [1] "Fall"   "Spring" "Summer"

Here, we can see that the order of levels in seasons is not Spring, Summer, Fall, which didn’t reflect the time order.

barplot(table(seasons), ylim=c(0, 350))

Factors & Tables - 2

Continue from the previous slide, we want to change the order of levels in seasons.

seasons <- factor(seasons, labels = c("Spring", "Summer", "Fall"))
barplot(table(seasons), ylim=c(0, 350))

Now, if we want to add another category Winter in seasons, an intuitive way is to combine the previous vector with the new vector. But quickly we find that the data type of the new object is changed to “character”.

winters <- rep("Winter", 10)
all_seasons <- c(seasons, winters)
class(all_seasons)
## [1] "character"

Factors & Tables - 3

If we look into this further by checking the frequency distribution of the data, we find that the first three categories were changed to the characters of 1, 2, and 3.

table(all_seasons)
## all_seasons
##      1      2      3 Winter 
##    342    320    338     10

Question: How to deal with this issue?

We should change the data type of winters to factor.

winters <- as.factor(winters)
all_seasons <- c(seasons, winters)
class(all_seasons)
## [1] "factor"

We can add a new level to a factor object if needed.

levels(seasons) <- c(levels(seasons), "Winter")

Factors & Tables - 4

On the previous slides, we used the table() function to find frequency distribution of a categorical variable. In fact, this function can be used to create a contingency table of two categorical variables.

set.seed(2022)
Grade <- sample(c(LETTERS[1:4], "F"), 100, replace=T)
Sex <- sample(c("Male", "Female"), 100, replace=T, prob=c(0.4, 0.6)) 
table(Grade, Sex)
##      Sex
## Grade Female Male
##     A     12    7
##     B     10   11
##     C     11    9
##     D     11    9
##     F     16    4

Note: Similar to data frames, most matrix/array operations can be used on tables.

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

Matloff, Norman. 2011. The Art of r Programming: A Tour of Statistical Software Design. No Starch Press.
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.