class: center, middle, inverse, title-slide .title[ # MTH 209 Data Manipulation and Management ] .subtitle[ ## Lesson 8: Data Manipulation with R Package tidyverse ] .author[ ###
Ying-Ju Tessa Chen, PhD
Associate Professor
Department of Mathematics
University of Dayton
@ying-ju
ying-ju
ychen4@udayton.edu
] --- ## Learning Objectives In this session, we will talk about data manipulation using R package .red[tidyverse]. This lecture is based on [R for Data Science](https://r4ds.had.co.nz/). .red[tidyverse] contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are: - dplyr: data manipulation - ggplot2: data visualization - purr: functional programming toolkit - readr: read data and write files - tibble: simple data frame - tidyr: data management --- ## Basic Functions in dplyr In this session, we will focus on the following key functions in .red[dplyr] using the dataset **flights** from the R package .red[nycflights13]. - filter(): pick observations by their values - arrange(): reorder the rows - select(): select variables by their names - mutate(): create new variables with functions of existing variables - group_by(): group data by existing variables - summarize(): collapse many values done to a single summary (with group_by) All functions above work similarly. 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame using the variable names. 3. The result is a new data frame (but we can save it back to the original data frame if needed). --- ## Load packages and read the Flights Data First, we load the necessary packages, check conflict functions, and import the dataset **flights** from the R package .red[nycflights13]. ```r library(tidyverse) library(conflicted) conflict_prefer("select", "dplyr") conflict_prefer("filter", "dplyr") df <- nycflights13::flights ``` --- ## Understand Our Data Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables ([Flights Data](https://rdrr.io/cran/nycflights13/man/flights.html)). .small[ - year, month, day: Date of departure. - dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone. - sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone. - dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals. - carrier: Two letter carrier abbreviation. See airlines to get name. - flight: Flight number. - tailnum: Plane tail number. See planes for additional metadata. - origin, dest: Origin and destination. See airports for additional metadata. - air_time: Amount of time spent in the air, in minutes. - distance: Distance between airports, in miles. - hour, minute: Time of scheduled departure broken into hour and minutes. - time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data. ] --- ## Get a Glimpse of the Data We get a glimpse of the data. ```r glimpse(df) ``` ``` ## Rows: 336,776 ## Columns: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2… ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1… ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, … ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, … ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1… ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,… ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,… ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1… ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "… ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4… ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394… ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",… ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",… ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1… ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, … ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6… ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0… ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0… ``` --- ## filter() - 1 .blue[filter()] is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code. ```r Christmas <- filter(df, month == 12, day == 25) ``` .orange[Comparisons] - R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal). ```r Jan1 <- filter(df, month == 1, day == 1) head(Jan1[,1:8]) ``` ``` ## # A tibble: 6 × 8 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ``` --- ## filter() - 2 .orange[Logical Operations] - R provides the following syntax: & is "and", | is "or", ! is "not". The following code finds all flights that departed in July or August. ```r flights0708 <- filter(df, month == 7 | month == 8) # only shows the first 5 rows and 10 columns in the data head(filter(df, month %in% c(7, 8)), c(5, 10)) ``` ``` ## # A tibble: 5 × 10 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 7 1 1 2029 212 236 2359 ## 2 2013 7 1 2 2359 3 344 344 ## 3 2013 7 1 29 2245 104 151 1 ## 4 2013 7 1 43 2130 193 322 14 ## 5 2013 7 1 44 2150 174 300 100 ## # ℹ 2 more variables: arr_delay <dbl>, carrier <chr> ``` **Note:** 1. If we use .blue[filter(df, month == 7 | 8)], it finds all months are equal 7 | 8, an expression that evaluates to **TRUE**. In a numeric context, TRUE becomes one, so this finds all fights in the data. 2. .blue[filter()] only includes rows where the condition is **TRUE** and it excludes both FALSE and NA values. --- ## filter() - 3 If we want to find flights that weren't delayed on both arrival and departure by more than 1 hour, we could use either of the following codes. ```r filter(df, !(arr_delay > 60 | dep_delay > 60)) ``` ``` ## # A tibble: 295,893 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 295,883 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r filter(df, arr_delay <= 60, dep_delay <= 60) ``` --- ## arrange() - 1 .blue[arrange()] is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates. ```r arrange(df, year, month, day) ``` ``` ## # A tibble: 336,776 × 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## arrange() - 2 **Note:** 1. We can save the data frame back to the original data frame after sorting the data. 2. Use .blue[desc()] for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order. 3. Missing values are always sorted at the end. ```r df <- arrange(df, desc(arr_delay)) ``` We can use the .blue[tail()] function to read the last part of the data. Below, we display the last 4 rows of 8 columns in the data. ```r tail(df[, 1:8], n = 4) ``` ``` ## # A tibble: 4 × 8 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 9 30 NA 2200 NA NA 2312 ## 2 2013 9 30 NA 1210 NA NA 1330 ## 3 2013 9 30 NA 1159 NA NA 1344 ## 4 2013 9 30 NA 840 NA NA 1020 ``` --- ## select() .blue[select()] is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables. ```r # select specific columns select(df, year, month, day) # select all columns between year and day select(df, year:day) # select all columns except those from year and day select(df, -(year:day)) ``` **Note:** 1. We can use a minus sign - to drop variables. 2. There are several helper functions we can use within .blue[select()]. See .blue[?select] for the information. 3. .blue[select()] can be used with the .blue[everything()] function when we have a handful of variables we would like to move to the start of the data frame. ```r # move carrier, origin, dest, and distance to the start of the data select(df, carrier, origin, dest, distance, everything()) ``` --- ## mutate() - 1 .blue[mutate()] is used when we would like to add a new variable / column using the other variables in the data. **Note:** .blue[mutate()] always adds new columns at the end of the data. First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset. ```r # we start by creating a smaller dataset. df1 <- select(df, year:day, ends_with("delay"), distance, air_time) df1 <- mutate(df1, gain= arr_delay - dep_delay, speed = distance / air_time * 60, hours = air_time / 60, gain_per_hour = gain / hours) head(df1, c(6, 11)) ``` ``` ## # A tibble: 6 × 11 ## year month day dep_delay arr_delay distance air_time gain speed hours ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 9 1301 1272 4983 640 -29 467. 10.7 ## 2 2013 6 15 1137 1127 483 74 -10 392. 1.23 ## 3 2013 1 10 1126 1109 719 111 -17 389. 1.85 ## 4 2013 9 20 1014 1007 2586 354 -7 438. 5.9 ## 5 2013 7 22 1005 989 589 96 -16 368. 1.6 ## 6 2013 4 10 960 931 1005 139 -29 434. 2.32 ## # ℹ 1 more variable: gain_per_hour <dbl> ``` --- ## mutate() - 2 If we only want to keep the new variables, use .blue[transmute()]. ```r transmute(df1, gain= arr_delay - dep_delay, speed = distance / air_time * 60, hours = air_time / 60, gain_per_hour = gain / hours) ``` ``` ## # A tibble: 336,776 × 4 ## gain speed hours gain_per_hour ## <dbl> <dbl> <dbl> <dbl> ## 1 -29 467. 10.7 -2.72 ## 2 -10 392. 1.23 -8.11 ## 3 -17 389. 1.85 -9.19 ## 4 -7 438. 5.9 -1.19 ## 5 -16 368. 1.6 -10 ## 6 -29 434. 2.32 -12.5 ## 7 4 366. 2.78 1.44 ## 8 -3 419. 1.82 -1.65 ## 9 -18 437. 2.48 -7.25 ## 10 -3 385. 1.87 -1.61 ## # ℹ 336,766 more rows ``` **Note:** There are many functions for creating new variables that we can use with .blue[mutate()]. The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output. --- ## summarize() - 1 .blue[summarize()] collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk. ```r summarize(df, delay = mean(dep_delay, na.rm=T)) ``` ``` ## # A tibble: 1 × 1 ## delay ## <dbl> ## 1 12.6 ``` --- ## group_by() & summarize() - 1 In general, .blue[summarize()] function is used together with .blue[group_by()] as we group rows for some purposes. .blue[group_by()] is used to group rows by one or more variables, giving priority to the variable entered first. ```r group_by(df, year, month, day) ``` ``` ## # A tibble: 336,776 × 19 ## # Groups: year, month, day [365] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 9 641 900 1301 1242 1530 ## 2 2013 6 15 1432 1935 1137 1607 2120 ## 3 2013 1 10 1121 1635 1126 1239 1810 ## 4 2013 9 20 1139 1845 1014 1457 2210 ## 5 2013 7 22 845 1600 1005 1044 1815 ## 6 2013 4 10 1100 1900 960 1342 2211 ## 7 2013 3 17 2321 810 911 135 1020 ## 8 2013 7 22 2257 759 898 121 1026 ## 9 2013 12 5 756 1700 896 1058 2020 ## 10 2013 5 3 1133 2055 878 1250 2215 ## # ℹ 336,766 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` The result shows the original data but indicates groups: year, month, day, in our example. --- ## group_by() & summarize() - 2 For example, we can study the average departure / arrival delays for each day. ```r by_day <- group_by(df, year, month, day) summarize(by_day, ave_dep_delay = mean(dep_delay, na.rm = T), ave_arr_delay = mean(arr_delay, na.rm = T) ) ``` ``` ## # A tibble: 365 × 5 ## # Groups: year, month [12] ## year month day ave_dep_delay ave_arr_delay ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 1 11.5 12.7 ## 2 2013 1 2 13.9 12.7 ## 3 2013 1 3 11.0 5.73 ## 4 2013 1 4 8.95 -1.93 ## 5 2013 1 5 5.73 -1.53 ## 6 2013 1 6 7.15 4.24 ## 7 2013 1 7 5.42 -4.95 ## 8 2013 1 8 2.55 -3.23 ## 9 2013 1 9 2.28 -0.264 ## 10 2013 1 10 2.84 -5.90 ## # ℹ 355 more rows ``` --- ## Combining Multiple Operations with the Pipe - 1 In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, **%>%**, is from the package .red[magrittr] and it is loaded automatically when tidyverse is loaded. The logic when using pipe: .orange[object %>% function1 %>% function 2....] .pull-left[ If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this. ```r by_dest <- group_by(df, dest) delay <- summarize(by_dest, count = n(), ave_dist = mean(distance, na.rm=T), ave_arr_delay = mean(arr_delay, na.rm=T)) delay <- filter(delay, count > 20, dest != "HNL") ``` ] .pull-right[ The following code chunk does the same task with the pipe, %>% and it makes the code easier to read. ```r delay <- df %>% group_by(dest) %>% summarize( count = n(), ave_dist = mean(distance, na.rm=T), ave_arr_delay = mean(arr_delay, na.rm=T) ) %>% filter(count > 20, dest != "HNL") ``` ] --- ## Useful Summary Functions - 1 - Measures of location for a quantitative variable: .blue[ mean()], .blue[ median()] - Measure of spread for a quantitative variable: .blue[sd()], .blue[ IQR()], .blue[mad()] Here, `\(MAD = median(|x_i-\bar{x}|)\)` is called the median absolute deviation which may be more useful if we have outliers. ```r not_cancelled <- df %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled_summary <- not_cancelled %>% group_by(dest) %>% summarize( distance_mu = mean(distance), distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) head(not_cancelled_summary, 5) ``` ``` ## # A tibble: 5 × 3 ## dest distance_mu distance_sd ## <chr> <dbl> <dbl> ## 1 EGE 1736. 10.5 ## 2 SAN 2437. 10.4 ## 3 SFO 2578. 10.2 ## 4 HNL 4973. 10.0 ## 5 SEA 2413. 9.98 ``` --- ## Useful Summary Functions - 2 - Measures of rank: .blue[min()], .blue[quantile()], .blue[max()] ```r not_cancelled %>% group_by(year, month, day) %>% summarize( first = min(dep_time), # the first flight departed each day last = max(dep_time) # the last flight departed each day ) ``` ``` ## # A tibble: 365 × 5 ## # Groups: year, month [12] ## year month day first last ## <int> <int> <int> <int> <int> ## 1 2013 1 1 517 2356 ## 2 2013 1 2 42 2354 ## 3 2013 1 3 32 2349 ## 4 2013 1 4 25 2358 ## 5 2013 1 5 14 2357 ## 6 2013 1 6 16 2355 ## 7 2013 1 7 49 2359 ## 8 2013 1 8 454 2351 ## 9 2013 1 9 2 2252 ## 10 2013 1 10 3 2320 ## # ℹ 355 more rows ``` --- ## Useful Summary Functions - 3 - Measures of position: .blue[first()], .blue[nth(x, 2)], .blue[last()] The following code chunk finds the first and last departure for each day ```r not_cancelled %>% group_by(year, month, day) %>% summarize( first_dep = first(dep_time), last_dep = last(dep_time) ) ``` ``` ## # A tibble: 365 × 5 ## # Groups: year, month [12] ## year month day first_dep last_dep ## <int> <int> <int> <int> <int> ## 1 2013 1 1 848 1904 ## 2 2013 1 2 1607 654 ## 3 2013 1 3 2056 941 ## 4 2013 1 4 2123 1026 ## 5 2013 1 5 1344 1329 ## 6 2013 1 6 943 812 ## 7 2013 1 7 2021 1623 ## 8 2013 1 8 1307 1847 ## 9 2013 1 9 641 1842 ## 10 2013 1 10 1121 1844 ## # ℹ 355 more rows ``` --- ## Useful Summary Functions - 4 - Counts: You have seen .blue[n()], which takes no arguments, and returns the size of the current group. To count the nubmer of non-missing values, we can use .blue[sum(is.na(x))]. To count the number of distinct values, use .blue[n_distinct()]. ```r not_cancelled %>% group_by(dest) %>% summarize(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) ``` ``` ## # A tibble: 104 × 2 ## dest carriers ## <chr> <int> ## 1 ATL 7 ## 2 BOS 7 ## 3 CLT 7 ## 4 ORD 7 ## 5 TPA 7 ## 6 AUS 6 ## 7 DCA 6 ## 8 DTW 6 ## 9 IAD 6 ## 10 MSP 6 ## # ℹ 94 more rows ``` --- ## Useful Summary Functions - 5 We can use .blue[count()] directly if all we want is a count. ```r not_cancelled_count <- not_cancelled %>% count(dest) head(not_cancelled_count, 4) ``` ``` ## # A tibble: 4 × 2 ## dest n ## <chr> <int> ## 1 ABQ 254 ## 2 ACK 264 ## 3 ALB 418 ## 4 ANC 8 ``` We can optionally provide a weight variable. For example we could use this to "count" the total number of miles a plane flew. ```r not_cancelled_weight <- not_cancelled %>% count(tailnum, wt = distance) head(not_cancelled_weight, 4) ``` ``` ## # A tibble: 4 × 2 ## tailnum n ## <chr> <dbl> ## 1 D942DN 3418 ## 2 N0EGMQ 239143 ## 3 N10156 109664 ## 4 N102UW 25722 ``` --- ## Useful Summary Functions - 6 - Counts and proportions of logical values When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, .blue[sum()] gives the number of TRUEs and .blue[mean()] gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk. ```r not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500)) ``` ``` ## # A tibble: 365 × 4 ## # Groups: year, month [12] ## year month day n_early ## <int> <int> <int> <int> ## 1 2013 1 1 0 ## 2 2013 1 2 3 ## 3 2013 1 3 4 ## 4 2013 1 4 3 ## 5 2013 1 5 3 ## 6 2013 1 6 2 ## 7 2013 1 7 2 ## 8 2013 1 8 1 ## 9 2013 1 9 3 ## 10 2013 1 10 3 ## # ℹ 355 more rows ``` --- ## Useful Summary Functions - 7 Or what proportion of flights are delayed by more than one hour? ```r not_cancelled %>% group_by(year, month, day) %>% summarize(hour_perc = mean(arr_delay > 60)) ``` ``` ## # A tibble: 365 × 4 ## # Groups: year, month [12] ## year month day hour_perc ## <int> <int> <int> <dbl> ## 1 2013 1 1 0.0722 ## 2 2013 1 2 0.0851 ## 3 2013 1 3 0.0567 ## 4 2013 1 4 0.0396 ## 5 2013 1 5 0.0349 ## 6 2013 1 6 0.0470 ## 7 2013 1 7 0.0333 ## 8 2013 1 8 0.0213 ## 9 2013 1 9 0.0202 ## 10 2013 1 10 0.0183 ## # ℹ 355 more rows ``` --- ## Grouping by Multiple Variables - 1 Here we show some examples to demonstrate how to group the data by multiple variables. ```r per_day <- df %>% group_by(year, month, day) %>% summarize(flights = n()) per_month <- summarize(per_day, flights = sum(flights)) per_year <- summarize(per_month, flights = sum(flights)) head(per_month, 3) ``` ``` ## # A tibble: 3 × 3 ## # Groups: year [1] ## year month flights ## <int> <int> <int> ## 1 2013 1 27004 ## 2 2013 2 24951 ## 3 2013 3 28834 ``` ```r per_year ``` ``` ## # A tibble: 1 × 2 ## year flights ## <int> <int> ## 1 2013 336776 ``` --- ## Ungrouping If we need to remove grouping, and return to operations on ungrouped data, use .blue[ungroup()]. ```r daily <- df %>% group_by(year, month, day) daily %>% ungroup() %>% # no longer grouped by date summarize(flights=n()) # all flights ``` ``` ## # A tibble: 1 × 1 ## flights ## <int> ## 1 336776 ``` --- ## Grouped Mutates and Filters - 1 We can also do convenient operations with .blue[mutate()] and .blue[filter()]. The following code chunk finds the worst members of each group. ```r df1 %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) ``` ``` ## # A tibble: 3,306 × 11 ## # Groups: year, month, day [365] ## year month day dep_delay arr_delay distance air_time gain speed hours ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 9 1301 1272 4983 640 -29 467. 10.7 ## 2 2013 6 15 1137 1127 483 74 -10 392. 1.23 ## 3 2013 1 10 1126 1109 719 111 -17 389. 1.85 ## 4 2013 9 20 1014 1007 2586 354 -7 438. 5.9 ## 5 2013 7 22 1005 989 589 96 -16 368. 1.6 ## 6 2013 4 10 960 931 1005 139 -29 434. 2.32 ## 7 2013 3 17 911 915 1020 167 4 366. 2.78 ## 8 2013 7 22 898 895 762 109 -3 419. 1.82 ## 9 2013 12 5 896 878 1085 149 -18 437. 2.48 ## 10 2013 5 3 878 875 719 112 -3 385. 1.87 ## # ℹ 3,296 more rows ## # ℹ 1 more variable: gain_per_hour <dbl> ``` --- ## Grouped Mutates and Filters - 2 The following code chunk finds all groups bigger than a threshold. ```r popular_dests <- df %>% group_by(dest) %>% filter(n()>365) popular_dests ``` ``` ## # A tibble: 332,577 × 19 ## # Groups: dest [77] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 9 641 900 1301 1242 1530 ## 2 2013 6 15 1432 1935 1137 1607 2120 ## 3 2013 1 10 1121 1635 1126 1239 1810 ## 4 2013 9 20 1139 1845 1014 1457 2210 ## 5 2013 7 22 845 1600 1005 1044 1815 ## 6 2013 4 10 1100 1900 960 1342 2211 ## 7 2013 3 17 2321 810 911 135 1020 ## 8 2013 7 22 2257 759 898 121 1026 ## 9 2013 12 5 756 1700 896 1058 2020 ## 10 2013 5 3 1133 2055 878 1250 2215 ## # ℹ 332,567 more rows ## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Grouped Mutates and Filters - 3 The following code chunk standardizes to compute per group metrics. ```r popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, arr_delay, prop_delay) ``` ``` ## # A tibble: 131,106 × 6 ## # Groups: dest [77] ## dest year month day arr_delay prop_delay ## <chr> <int> <int> <int> <dbl> <dbl> ## 1 HNL 2013 1 9 1272 0.154 ## 2 CMH 2013 6 15 1127 0.0187 ## 3 ORD 2013 1 10 1109 0.00392 ## 4 SFO 2013 9 20 1007 0.00490 ## 5 CVG 2013 7 22 989 0.0112 ## 6 TPA 2013 4 10 931 0.00778 ## 7 MSP 2013 3 17 915 0.00784 ## 8 ATL 2013 7 22 895 0.00298 ## 9 MIA 2013 12 5 878 0.00625 ## 10 ORD 2013 5 3 875 0.00309 ## # ℹ 131,096 more rows ``` --- ## Some Useful Functions - .blue[drop_na()]: drops rows contain a missing value. - **Usage**: .blue[drop_na(data, ...)] - .blue[rename()]: renames columns - **Usage**: .blue[rename(.data, ...)] - .blue[recode()]: replaces numeric values based on their position, and replaces character values by their name. Usage: .orange[.default], if supplied, gives all cases not matched a new value. .orange[.missing] replaces the missing values by the specified new value. - **Usage**: .blue[recode(.x, ..., .default = NULL, .missing = NULL)] - .blue[case_when()]: a general vectorised if - **Usage**: .blue[case_when(...)] --- ## drop_na() function We can use the .blue[drop_na()] function to drop rows containing missing values. ```r nrow(df) ``` ``` ## [1] 336776 ``` ```r df <- df %>% drop_na() nrow(df) ``` ``` ## [1] 327346 ``` --- ## rename() function We can use the .blue[rename()] function to rename column names. This function makes the task much easier since we can rename a variable directly without knowing which column that variable is. ```r df <- df %>% rename( DepTime = dep_time, SchedDepTime = sched_dep_time, DepDelay = dep_delay, ArrTime = arr_time, SchedArrTime = sched_arr_time, ArrDelay = arr_delay ) colnames(df) ``` ``` ## [1] "year" "month" "day" "DepTime" "SchedDepTime" ## [6] "DepDelay" "ArrTime" "SchedArrTime" "ArrDelay" "carrier" ## [11] "flight" "tailnum" "origin" "dest" "air_time" ## [16] "distance" "hour" "minute" "time_hour" ``` --- ## recode() function We might want to recode specific values in an R data frame sometimes. The .blue[recode()(] function from the .red[dplyr package] makes this simple to accomplish. ```r df$month <- recode(df$month, `1` = "Jan", `2` = "Feb", `3` = "Mar", `4` = "Apr") unique(df$month) ``` ``` ## [1] "Jan" NA "Apr" "Mar" "Feb" ``` We should pay attention to the data type of a variable. In the example above, the original data type is integer and it becomes character after we recode some values. This is due to unreplaced values treated as NA as `.x` is not compatible. ```r x <- c(12, 31, 0, 34, 8, 5, 7, 16, 8) x <- x %>% recode(`0` = 1314) x ``` ``` ## [1] 12 31 1314 34 8 5 7 16 8 ``` --- ## case_when() function We can use .blue[case_when()(] function to create a new variable with the .blue[mutate()(] function together. ```r df <- df %>% mutate(arr_status = case_when( ArrDelay > 60 ~ "LongDelay", ArrDelay > 0 & ArrDelay <= 60 ~ "Delay", ArrDelay == 0 ~ "OnTime", ArrDelay < 0 ~ "NotDelay", is.na(ArrDelay) ~ "Unknown" )) count(df, arr_status) ``` ``` ## # A tibble: 4 × 2 ## arr_status n ## <chr> <int> ## 1 Delay 105215 ## 2 LongDelay 27789 ## 3 NotDelay 188933 ## 4 OnTime 5409 ``` --- # Summary of Main Points By now, you should know how to utilize basic functions in the r package .red[dplyr] for data manipulation and analysis. .small[ - filter - arrange - select - mutate - group_by - summarize ] --- # Supplementary Materials Here are some useful supplementary materials for self-learning. .pull-left[ .center[[<img src="https://r4ds.hadley.nz/cover.jpg" height="250px">](https://r4ds.hadley.nz)] .small[ * [Data Visualization](https://r4ds.hadley.nz/data-visualize) ] ] .pull-right[ .center[[<img src="../Figures/RPubs.png" height="250px">](https://RPubs.com)] .small[ * [Visualizations with ggplot2 1](https://rpubs.com/chrisbrunsdon/UG_ggplot) * [Visualization with ggplot2 2](https://rpubs.com/Cisco_Self_Serve/318832) ] ]