Overview - 1

In this session, we will talk about data manipulation using R package tidyverse. This lecture is based on R for Data Science.

tidyverse contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:

dplyr: data manipulation
ggplot2: data visualization
purr: functional programming toolkit
readr: read data and write files
tibble: simple data frame
tidyr: data management

Overview - 2

In this session, we will focus on the following key functions in dplyr using the dataset flights from the R package nycflights13.

filter(): pick observations by their values
arrange(): reorder the rows
select(): select variables by their names
mutate(): create new variables with functions of existing variables
group_by(): group data by existing variables
summarize(): collapse many values done to a single summary (with group_by)

All functions above work similarly.

The first argument is a data frame.
The subsequent arguments describe what to do with the data frame using the variable names.
The result is a new data frame (but we can save it back to the original data frame if needed).

Load packages and read the Flights Data

First, we load the necessary packages, check conflict functions, and import the dataset flights from the R package nycflights13.

library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights

Understand Our Data

Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables (Flights Data).

year, month, day: Date of departure.
dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone.
sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone.
dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier: Two letter carrier abbreviation. See airlines to get name.
flight: Flight number.
tailnum: Plane tail number. See planes for additional metadata.
origin, dest: Origin and destination. See airports for additional metadata.
air_time: Amount of time spent in the air, in minutes.
distance: Distance between airports, in miles.
hour, minute: Time of scheduled departure broken into hour and minutes.
time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

Get a Glimpse of the Data

We get a glimpse of the data.

glimpse(df)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

filter() - 1

filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.

Christmas <- filter(df, month == 12, day == 25)

Comparisons - R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal). If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses

(Jan1 <- filter(df, month == 1, day == 1))

## # A tibble: 842 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

filter() - 2

Logical Operations - R provides the following syntax: & is “and”, | is “or”, ! is “not”. The following code finds all flights that departed in July or August.

flights0708 <- filter(df, month == 7 | month == 8)
# only shows the first 5 rows and 10 columns in the data
head(filter(df, month %in% c(7, 8)), c(5, 10))

## # A tibble: 5 × 10
##    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
## 1  2013     7     1        1        2029     212     236    2359     157 B6     
## 2  2013     7     1        2        2359       3     344     344       0 B6     
## 3  2013     7     1       29        2245     104     151       1     110 B6     
## 4  2013     7     1       43        2130     193     322      14     188 B6     
## 5  2013     7     1       44        2150     174     300     100     120 AA     
## # … with abbreviated variable names ¹sched_dep_time, ²dep_delay, ³arr_time,
## #   ⁴sched_arr_time, ⁵arr_delay

Note:

If we use filter(df, month == 7 | 8), it finds all months are equal 7 | 8, an expression that evaluates to TRUE. In a numeric context, TRUE becomes one, so this finds all fights in the data.
filter() only includes rows where the condition is TRUE and it excludes both FALSE and NA values.

filter() - 3

If we want to find flights that weren’t delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.

filter(df, !(arr_delay > 60 | dep_delay > 60))

## # A tibble: 295,893 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 295,883 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

filter(df, arr_delay <= 60, dep_delay <= 60)

arrange() - 1

arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.

arrange(df, year, month, day)

## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

arrange() - 2

Note: 1. We can save the data frame back to the original data frame after sorting the data. 2. Use desc() for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order. 3. Missing values are always sorted at the end.

df <- arrange(df, desc(arr_delay))

We can use the tail() function to read the last part of the data.

tail(df, 5)

## # A tibble: 5 × 19
##    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
## 1  2013     9    30       NA        1455      NA      NA    1634      NA 9E     
## 2  2013     9    30       NA        2200      NA      NA    2312      NA 9E     
## 3  2013     9    30       NA        1210      NA      NA    1330      NA MQ     
## 4  2013     9    30       NA        1159      NA      NA    1344      NA MQ     
## 5  2013     9    30       NA         840      NA      NA    1020      NA MQ     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

select()

select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.

# select specific columns
select(df, year, month, day)

# select all columns between year and day
select(df, year:day) 

# select all columns except those from year and day
select(df, -(year:day))

Note:

We can use a minus sign - to drop variables.
There are several helper functions we can use within select(). See ?select for the information.
select() can be used with the everything() function when we have a handful of variables we would like to move to the start of the data frame.

# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())

mutate() - 1

mutate() is used when we would like to add a new variable / column using the other variables in the data.

Note: mutate() always adds new columns at the end of the data.

First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset.

# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)
df1 <- mutate(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
head(df1, c(5, 11))

## # A tibble: 5 × 11
##    year month   day dep_delay arr_de…¹ dista…² air_t…³  gain speed hours gain_…⁴
##   <int> <int> <int>     <dbl>    <dbl>   <dbl>   <dbl> <dbl> <dbl> <dbl>   <dbl>
## 1  2013     1     9      1301     1272    4983     640   -29  467. 10.7    -2.72
## 2  2013     6    15      1137     1127     483      74   -10  392.  1.23   -8.11
## 3  2013     1    10      1126     1109     719     111   -17  389.  1.85   -9.19
## 4  2013     9    20      1014     1007    2586     354    -7  438.  5.9    -1.19
## 5  2013     7    22      1005      989     589      96   -16  368.  1.6   -10   
## # … with abbreviated variable names ¹arr_delay, ²distance, ³air_time,
## #   ⁴gain_per_hour

mutate() - 2

If we only want to keep the new variables, use transmute().

transmute(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)

## # A tibble: 336,776 × 4
##     gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>         <dbl>
##  1     9  370. 3.78           2.38
##  2    16  374. 3.78           4.23
##  3    31  408. 2.67          11.6 
##  4   -17  517. 3.05          -5.57
##  5   -19  394. 1.93          -9.83
##  6    16  288. 2.5            6.4 
##  7    24  404. 2.63           9.11
##  8   -11  259. 0.883        -12.5 
##  9    -5  405. 2.33          -2.14
## 10    10  319. 2.3            4.35
## # … with 336,766 more rows

Note: There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.

summarize() - 1

summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.

summarize(df, delay = mean(dep_delay, na.rm=T))

## # A tibble: 1 × 1
##   delay
##   <dbl>
## 1  12.6

group_by() & summarize() - 1

In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.

group_by(df, year, month, day)

## # A tibble: 336,776 × 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

The result shows the original data but indicates groups: year, month, day, in our example.

group_by() & summarize() - 2

For example, we can study the average departure / arrival delays for each day.

by_day <- group_by(df, year, month, day) 
summarize(by_day, 
          ave_dep_delay = mean(dep_delay, na.rm = T),
          ave_arr_delay = mean(arr_delay, na.rm = T)
          )

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day ave_dep_delay ave_arr_delay
##    <int> <int> <int>         <dbl>         <dbl>
##  1  2013     1     1         11.5         12.7  
##  2  2013     1     2         13.9         12.7  
##  3  2013     1     3         11.0          5.73 
##  4  2013     1     4          8.95        -1.93 
##  5  2013     1     5          5.73        -1.53 
##  6  2013     1     6          7.15         4.24 
##  7  2013     1     7          5.42        -4.95 
##  8  2013     1     8          2.55        -3.23 
##  9  2013     1     9          2.28        -0.264
## 10  2013     1    10          2.84        -5.90 
## # … with 355 more rows

Combining Multiple Operations with the Pipe - 1

In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.

The logic when using pipe: object %>% function1 %>% function 2….

If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.

by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
                   count = n(),
                   ave_dist = mean(distance, na.rm=T),
                   ave_arr_delay = mean(arr_delay, na.rm=T))
delay <- filter(delay, count > 20, dest != "HNL")

The following code chunk does the same task with the pipe, %>% and it makes the code easier to read.

delay <- df %>% 
  group_by(dest) %>%
  summarize(
    count = n(),
    ave_dist = mean(distance, na.rm=T),
    ave_arr_delay = mean(arr_delay, na.rm=T)
    ) %>%
  filter(count > 20, dest != "HNL")

Useful Summary Functions - 1

Measures of location for a quantitative variable: mean(), median()
Measure of spread for a quantitative variable: sd(), IQR(), mad()

Here, \(MAD = median(|x_i-\bar{x}|)\) is called the median absolute deviation which may be more useful if we have outliers.

not_cancelled <- df %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled_summary <- not_cancelled %>% 
  group_by(dest) %>%
  summarize(
    distance_mu = mean(distance),
    distance_sd = sd(distance)) %>%
  arrange(desc(distance_sd))

head(not_cancelled_summary, 5)

## # A tibble: 5 × 3
##   dest  distance_mu distance_sd
##   <chr>       <dbl>       <dbl>
## 1 EGE         1736.       10.5 
## 2 SAN         2437.       10.4 
## 3 SFO         2578.       10.2 
## 4 HNL         4973.       10.0 
## 5 SEA         2413.        9.98

Useful Summary Functions - 2

Measures of rank: min(), quantile(), max()

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first = min(dep_time), # the first flight departed each day
    last = max(dep_time) # the last flight departed each day
  )

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # … with 355 more rows

Useful Summary Functions - 3

Measures of position: first(), nth(x, 2), last()

The following code chunk finds the first and last departure for each day

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first_dep = first(dep_time),
    last_dep = last(dep_time)
  )

## # A tibble: 365 × 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # … with 355 more rows

Useful Summary Functions - 4

Counts: You have seen n(), which takes no arguments, and returns the size of the current group. To count the nubmer of non-missing values, we can use sum(is.na(x)). To count the number of distinct values, use n_distinct().

not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers))

## # A tibble: 104 × 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # … with 94 more rows

Useful Summary Functions - 5

We can use count() directly if all we want is a count.

not_cancelled_count <- not_cancelled %>% 
  count(dest)
head(not_cancelled_count, 4)

## # A tibble: 4 × 2
##   dest      n
##   <chr> <int>
## 1 ABQ     254
## 2 ACK     264
## 3 ALB     418
## 4 ANC       8

We can optionally provide a weight variable. For example we could use this to “count” the total number of miles a plane flew.

not_cancelled_weight <- not_cancelled %>%
  count(tailnum, wt = distance)
head(not_cancelled_weight, 4)

## # A tibble: 4 × 2
##   tailnum      n
##   <chr>    <dbl>
## 1 D942DN    3418
## 2 N0EGMQ  239143
## 3 N10156  109664
## 4 N102UW   25722

Useful Summary Functions - 6

Counts and proportions of logical values

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(n_early = sum(dep_time < 500))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # … with 355 more rows

Useful Summary Functions - 7

Or what proportion of flights are delayed by more than one hour?

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(hour_perc = mean(arr_delay > 60))

## # A tibble: 365 × 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # … with 355 more rows

Grouping by Multiple Variables - 1

Here we show some examples to demonstrate how to group the data by multiple variables.

per_day <- df %>% 
   group_by(year, month, day) %>%
  summarize(flights = n())
per_month <- summarize(per_day, flights = sum(flights))
per_year <- summarize(per_month, flights = sum(flights))

head(per_month, 3)

## # A tibble: 3 × 3
## # Groups:   year [1]
##    year month flights
##   <int> <int>   <int>
## 1  2013     1   27004
## 2  2013     2   24951
## 3  2013     3   28834

per_year

## # A tibble: 1 × 2
##    year flights
##   <int>   <int>
## 1  2013  336776

Ungrouping

If we need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily <- df %>% group_by(year, month, day)
daily %>% 
  ungroup() %>% # no longer grouped by date
  summarize(flights=n()) # all flights

## # A tibble: 1 × 1
##   flights
##     <int>
## 1  336776

Grouped Mutates and Filters - 1

We can also do convenient operations with mutate() and filter().

The following code chunk finds the worst members of each group.

df1 %>% 
  group_by(year, month, day) %>% 
  filter(rank(desc(arr_delay)) < 10)

## # A tibble: 3,306 × 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184       41
##  2  2013     1     1       290       338     1134      213
##  3  2013     1     1       260       263      266       46
##  4  2013     1     1       157       174      213       60
##  5  2013     1     1       216       222      708      121
##  6  2013     1     1       255       250      589      115
##  7  2013     1     1       285       246     1085      146
##  8  2013     1     1       192       191      199       44
##  9  2013     1     1       379       456     1092      222
## 10  2013     1     2       224       207      550       94
## # … with 3,296 more rows

Grouped Mutates and Filters - 2

The following code chunk finds all groups bigger than a threshold.

popular_dests <- df %>%
  group_by(dest) %>% 
  filter(n()>365)
popular_dests

## # A tibble: 332,577 × 19
## # Groups:   dest [77]
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 332,567 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Grouped Mutates and Filters - 3

The following code chunk standardizes to compute per group metrics.

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, arr_delay, prop_delay)

## # A tibble: 131,106 × 6
## # Groups:   dest [77]
##    dest   year month   day arr_delay prop_delay
##    <chr> <int> <int> <int>     <dbl>      <dbl>
##  1 IAH    2013     1     1        11  0.000111 
##  2 IAH    2013     1     1        20  0.000201 
##  3 MIA    2013     1     1        33  0.000235 
##  4 ORD    2013     1     1        12  0.0000424
##  5 FLL    2013     1     1        19  0.0000938
##  6 ORD    2013     1     1         8  0.0000283
##  7 LAX    2013     1     1         7  0.0000344
##  8 DFW    2013     1     1        31  0.000282 
##  9 ATL    2013     1     1        12  0.0000400
## 10 DTW    2013     1     1        16  0.000116 
## # … with 131,096 more rows

Some Useful Functions

drop_na(): drops rows contain a missing value.

Usage: drop_na(data, …)

rename(): renames columns

Usage: rename(.data, …)

recode(): replaces numeric values based on their position, and replaces character values by their name. Usage: .default, if supplied, gives all cases not matched a new value. .missing replaces the missing values by the specified new value.

Usage: recode(.x, …, .default = NULL, .missing = NULL)

case_when(): a general vectorised if

Usage: case_when(…)

drop_na() function

We can use the drop_na() function to drop rows containing missing values.

nrow(df)

## [1] 336776

df <- df %>% drop_na()
nrow(df)

## [1] 327346

rename() function

We can use the rename() function to rename column names. This function makes the task much easier since we can rename a variable directly without knowing which column that variable is.

df <- df %>% rename(
  DepTime = dep_time,
  SchedDepTime = sched_dep_time, 
  DepDelay = dep_delay,
  ArrTime = arr_time, 
  SchedArrTime = sched_arr_time,
  ArrDelay = arr_delay
)

colnames(df)

##  [1] "year"         "month"        "day"          "DepTime"      "SchedDepTime"
##  [6] "DepDelay"     "ArrTime"      "SchedArrTime" "ArrDelay"     "carrier"     
## [11] "flight"       "tailnum"      "origin"       "dest"         "air_time"    
## [16] "distance"     "hour"         "minute"       "time_hour"

recode() function

We might want to recode specific values in an R data frame sometimes. The recode()( function from the dplyr package makes this simple to accomplish.

df$month <- recode(df$month, 
                   `1` = "Jan",
                   `2` = "Feb",
                   `3` = "Mar",
                   `4` = "Apr")
unique(df$month)

## [1] "Jan" NA    "Apr" "Mar" "Feb"

We should pay attention to the data type of a variable. In the example above, the original data type is integer and it becomes character after we recode some values. This is due to unreplaced values treated as NA as .x is not compatible.

x  <- c(12, 31, 0, 34, 8, 5, 7, 16, 8)
x <- x %>% recode(`0` = 1314)
x

## [1]   12   31 1314   34    8    5    7   16    8

case_when() function

We can use case_when()( function to create a new variable with the mutate()( function together.

df <- df %>% mutate(arr_status = case_when(
  ArrDelay > 60 ~ "LongDelay",
  ArrDelay > 0 & ArrDelay <= 60 ~ "Delay",
  ArrDelay == 0  ~ "OnTime",
  ArrDelay < 0 ~ "NotDelay",
  is.na(ArrDelay) ~ "Unknown"
))

count(df, arr_status)

## # A tibble: 4 × 2
##   arr_status      n
##   <chr>       <int>
## 1 Delay      105215
## 2 LongDelay   27789
## 3 NotDelay   188933
## 4 OnTime       5409

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.

MTH 209 Data Manipulation and Management

Lesson 8: Data Manipulation with R Package tidyverse