class: center, middle, inverse, title-slide .title[ # MTH 209 Data Manipulation and Management ] .subtitle[ ## Lesson 4: Basic Data Structures
(Lists, Data Frames, Factors and Tables) ] .author[ ###
Ying-Ju Tessa Chen, PhD
Associate Professor
Department of Mathematics
University of Dayton
@ying-ju
ying-ju
ychen4@udayton.edu
] --- # Learning Objectives In this lesson, we will introduce the following data structures in R. - Lists - Data frames - Factors and Tables **Note:** This lesson is based on the book: The Art of R Programming. .left[.footnote[Matloff, Norman. The art of R programming: A tour of statistical software design. No Starch Press, 2011]] --- ## Lists - 1 .pull-left[.footnotesize[Similar to vectors and matrices, one common operation with lists is indexing. Technically, a list is a vector. However, it is much flexible compared to the use of vectors and matrices. A list could contains elements of different types such as stings, numbers, vectors, and another list inside it. It can contain a matrix, a function, and a data frame, etc. as its elements. The following code chunk shows a simple example. ] ```r x <- list(Name = "Tessa", Programming_Skills = c("R", "Python", "Matlab"), Num_Students_2023F = 84, office = "SC 329E", Tenured = TRUE, UD_Working_Yr = 7) ``` ] .pull-right[ ```r x ``` ``` ## $Name ## [1] "Tessa" ## ## $Programming_Skills ## [1] "R" "Python" "Matlab" ## ## $Num_Students_2023F ## [1] 84 ## ## $office ## [1] "SC 329E" ## ## $Tenured ## [1] TRUE ## ## $UD_Working_Yr ## [1] 7 ``` ] --- ## Lists - 2 .pull-left[.small[Assign a tag for each component in a list <br>The component names are called *tags* in the R literature. In fact, including a name for each component is optional. We can see an example here.] ```r list("Tessa", c("R", "Python", "Matlab"), 84, "SC 329E", TRUE, 7) -> list_tessa ``` ] .pull-right[ ```r names(list_tessa) <- c("Name", "Programming_Skills", "Num_Students_2023F", "office", "Tenured", "UD_Working_Yr") ``` ] --- ## Lists - 3 The advantages of using tags in a list <br>It is clearer and less error-prone to use names instead of numeric indices. ```r list_tessa$office ``` ``` ## [1] "SC 329E" ``` ```r list_tessa[[2]] ``` ``` ## [1] "R" "Python" "Matlab" ``` ```r list_tessa[[2]][3] ``` ``` ## [1] "Matlab" ``` --- ## Lists - 4 .pull-left[ ```r A <- matrix(sample(1:100, 9), ncol=3) new_list <- list(list_tessa, A) new_list[[1]] ``` ``` ## $Name ## [1] "Tessa" ## ## $Programming_Skills ## [1] "R" "Python" "Matlab" ## ## $Num_Students_2023F ## [1] 84 ## ## $office ## [1] "SC 329E" ## ## $Tenured ## [1] TRUE ## ## $UD_Working_Yr ## [1] 7 ``` ] .pull-right[ ```r new_list[[2]] ``` ``` ## [,1] [,2] [,3] ## [1,] 22 40 30 ## [2,] 32 48 42 ## [3,] 6 26 16 ``` .small[**Question:** How to extract the value in the 2nd row and 1st column in the second component in the list?]] --- ## Lists - 5 .pull-left[Since lists are vectors, they can be created via <span Style="color:blue">vector()</span>. ```r y <- vector(mode="list") # create a vector with size 0 z <- vector("list", 10) # create a vector with size 10 y[["Method"]] <- "Decision Tree" y[["Sensitivity"]] <- 0.85 y[["Specificity"]] <- 0.89 y ``` ``` ## $Method ## [1] "Decision Tree" ## ## $Sensitivity ## [1] 0.85 ## ## $Specificity ## [1] 0.89 ``` ] .pull-right[ Another way to create a list. ```r w <- rep(list(NA), 2) w ``` ``` ## [[1]] ## [1] NA ## ## [[2]] ## [1] NA ``` ] --- ## Lists - 6 List Indexing <br>There are three ways to access an individual component of a list and return it in the data type of this component. Each of these is useful in different contexts. ```r y$Specificity ``` ``` ## [1] 0.89 ``` ```r y[["Specificity"]] ``` ``` ## [1] 0.89 ``` ```r y[[2]] ``` ``` ## [1] 0.85 ``` --- ## Lists - 7 An alternative to the 2nd and 3rd techniques listed is to use single brackets rather than double brackets. ```r y["Specificity"] ``` ``` ## $Specificity ## [1] 0.89 ``` ```r y[2] # the 2 is the index of "Sensitivity" within the list y ``` ``` ## $Sensitivity ## [1] 0.85 ``` --- ## Lists - 8 .pull-left[ ```r new_list[2] ``` ``` ## [[1]] ## [,1] [,2] [,3] ## [1,] 22 40 30 ## [2,] 32 48 42 ## [3,] 6 26 16 ``` ] .pull-right[ ```r new_list[[2]] ``` ``` ## [,1] [,2] [,3] ## [1,] 22 40 30 ## [2,] 32 48 42 ## [3,] 6 26 16 ``` ```r list_tessa[2] ``` ``` ## $Programming_Skills ## [1] "R" "Python" "Matlab" ``` ] **Question:** How to extract the third element in the 2nd component of the list *list_tessa*? --- ## Lists - 9 We should pay attention to the use of indices when subsetting values in a list. ```r class(new_list[[1]]) ``` ``` ## [1] "list" ``` ```r class(new_list[[2]]) ``` ``` ## [1] "matrix" "array" ``` **Question:** Try the following commands and discuss the difference between them. 1. *new_list[[1:2]]* and *new_list[1:2]* 2. *new_list[[1]][1:2]* and *new_list[[2]][1:2]* 3. *new_list[[2]][1:2,]* and *new_list[[2]][,1:2]* --- ## Data Frames - 1 A data frame is like a matrix. It has a two-dimensional data structure. But it differs from a matrix since the data type of each column could be different. Technically, a data frame is a list, with the components that list being equal-length vectors. .pull-left[Creating Data Frames ```r kids <- c("Vicky", "Patrick", "Jasmine") ages <- c(14, 10, 8) # create the data frame using kids and ages df <- data.frame(kids, ages, stringsAsFactors = F) df ``` ``` ## kids ages ## 1 Vicky 14 ## 2 Patrick 10 ## 3 Jasmine 8 ``` ] .pull-right[We can include columns directly in the data frame when creating it as well. ```r data.frame(kids = c("Vicky", "Patrick", "Jasmine"), ages = c(14, 10, 8), stringsAsFactors = F) ``` ``` ## kids ages ## 1 Vicky 14 ## 2 Patrick 10 ## 3 Jasmine 8 ``` ] --- ## Data Frames - 2 Accessing Data Frames <br>There are multiple ways to access a data frame. .pull-left[ ```r df[[1]] # list-like ``` ``` ## [1] "Vicky" "Patrick" "Jasmine" ``` ```r df$kids # component name ``` ``` ## [1] "Vicky" "Patrick" "Jasmine" ``` ] .pull-right[ ```r df[,1] # matrix-like ``` ``` ## [1] "Vicky" "Patrick" "Jasmine" ``` ```r df[, "kids"] # matri-like with the column name ``` ``` ## [1] "Vicky" "Patrick" "Jasmine" ``` ] **Question:** consider four ways to access the first column of the data frame above, which way(s) would generally considered to be clear and safer than others? --- ## Data Frames - 3 .pull-left[ Checking Data Structures <br>We can use <span Style="color:blue">str()</span> function to display the structure of an arbitrary R object. ```r str(df) ``` ``` ## 'data.frame': 3 obs. of 2 variables: ## $ kids: chr "Vicky" "Patrick" "Jasmine" ## $ ages: num 14 10 8 ``` ] .pull-right[ Other Matrix-Like Operations <br>Many matrix operations apply to data frames as well. We will use a R built-in dataset <span Style="color:blue">ToothGrowth</span>. First, we use the <span Style="color:blue">head()</span> function to see the first six rows of the data. ```r head(ToothGrowth) ``` ``` ## len supp dose ## 1 4.2 VC 0.5 ## 2 11.5 VC 0.5 ## 3 7.3 VC 0.5 ## 4 5.8 VC 0.5 ## 5 6.4 VC 0.5 ## 6 10.0 VC 0.5 ``` ] --- ## Data Frames - 4 Use <span Style="color:blue">ToothGrowth</span> data to answer the following questions. 1. Find a data frame that contains observations from the 3rd row to 7th row in <span Style="color:blue">ToothGrowth</span>. 2. Find a data frame that contains values from 3rd row to 7th row for the 3rd column in <span Style="color:blue">ToothGrowth</span>. 3. Find a data frame that only contains observations such that their tooth length (*len*) are smaller than the mean tooth length of guinea pigs in <span Style="color:blue">ToothGrowth</span>. --- ## Factors & Tables - 1 .pull-left[.small[Since we have introduced factors in Lesson 2, we will focus the concepts about tables. Before we talk about how we could work with tables, let's see an example that demonstrates how we can rearrange the levels in a factor.] ```r set.seed(2022) seasons <- factor( sample(c("Spring", "Summer", "Fall"), 1000, replace = T) ) levels(seasons) ``` ``` ## [1] "Fall" "Spring" "Summer" ``` ] .pull-right[.small[Here, we can see that the order of levels in *seasons* is not *Spring*, *Summer*, *Fall*, which didn't reflect the time order. ] ```r barplot(table(seasons), ylim=c(0, 350)) ``` <img src="data:image/png;base64,#Lesson04_Data_Structures_2_files/figure-html/bar-1.png" style="display: block; margin: auto;" /> ] --- ## Factors & Tables - 2 .pull-left[.small[Continue from the previous slide, we want to change the order of levels in *seasons*.] ```r seasons <- factor(seasons, labels = c("Spring", "Summer", "Fall")) barplot(table(seasons), ylim=c(0, 350)) ``` <img src="data:image/png;base64,#Lesson04_Data_Structures_2_files/figure-html/change_order-1.png" style="display: block; margin: auto;" /> ] .pull-right[.small[Now, if we want to add another category *Winter* in *seasons*, an intuitive way is to combine the previous vector with the new vector. But quickly we find that the data type of the new object is changed to "character". ] ```r winters <- rep("Winter", 10) all_seasons <- c(seasons, winters) class(all_seasons) ``` ``` ## [1] "character" ``` ] --- ## Factors & Tables - 3 .pull-left[.small[If we look into this further by checking the frequency distribution of the data, we find that the first three categories were changed to the characters of 1, 2, and 3.] ```r table(all_seasons) ``` ``` ## all_seasons ## 1 2 3 Winter ## 342 320 338 10 ``` **Question:** How to deal with this issue? ] .pull-right[.small[We should change the data type of *winters* to factor.] ```r winters <- as.factor(winters) all_seasons <- c(seasons, winters) class(all_seasons) ``` ``` ## [1] "factor" ``` .small[We can add a new level to a factor object if needed.] ```r levels(seasons) <- c(levels(seasons), "Winter") ``` ] --- ## Factors & Tables - 4 .small[On the previous slides, we used the <span Style="color:blue">table()</span> function to find frequency distribution of a categorical variable. In fact, this function can be used to create a contingency table of two categorical variables. ] ```r set.seed(2022) Grade <- sample(c(LETTERS[1:4], "F"), 100, replace=T) Sex <- sample(c("Male", "Female"), 100, replace=T, prob=c(0.4, 0.6)) table(Grade, Sex) ``` ``` ## Sex ## Grade Female Male ## A 12 7 ## B 10 11 ## C 11 9 ## D 11 9 ## F 16 4 ``` .small[**Note:** Similar to data frames, most matrix/array operations can be used on tables.] --- # Summary of Main Points By now, you should know - Basic Data Structures: Lists, Data Frames, Factors and Tables - Recycle elements from a data object - How to extract elements from a data object --- # Supplementary Materials Here are some useful supplementary materials for self-learning. .pull-left[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/565916198b0be51bf88b36f94b80c7ea67cafe7c/7f70b/cover.png" height="250px">](https://adv-r.hadley.nz)] .small[ * [Names and values](https://adv-r.hadley.nz/names-values.html) * [Matrices and arrays](http://adv-r.had.co.nz/Data-structures.html#matrices-and-arrays) * [Data frames](http://adv-r.had.co.nz/Data-structures.html#data-frames) ] ] .pull-right[ .center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="250px">](https://r4ds.hadley.nz)] .small[ * [Hierarchical data](https://r4ds.hadley.nz/rectangling.html) ] ]