In this session, we will talk about data manipulation using R package tidyverse. This lecture is based on R for Data Science.
tidyverse contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:
In this session, we will focus on the following key functions in dplyr using the dataset flights from the R package nycflights13.
All functions above work similarly.
First, we load the necessary packages, check conflict functions, and import the dataset flights from the R package nycflights13.
Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables (Flights Data).
We get a glimpse of the data.
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.
Comparisons - R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal). If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses
## # A tibble: 842 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 832 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Logical Operations - R provides the following syntax: & is “and”, | is “or”, ! is “not”. The following code finds all flights that departed in July or August.
flights0708 <- filter(df, month == 7 | month == 8)
# only shows the first 5 rows and 10 columns in the data
head(filter(df, month %in% c(7, 8)), c(5, 10))## # A tibble: 5 × 10
## year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 7 1 1 2029 212 236 2359 157 B6
## 2 2013 7 1 2 2359 3 344 344 0 B6
## 3 2013 7 1 29 2245 104 151 1 110 B6
## 4 2013 7 1 43 2130 193 322 14 188 B6
## 5 2013 7 1 44 2150 174 300 100 120 AA
## # … with abbreviated variable names ¹sched_dep_time, ²dep_delay, ³arr_time,
## # ⁴sched_arr_time, ⁵arr_delay
Note:
If we want to find flights that weren’t delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.
## # A tibble: 295,893 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 295,883 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
Note: 1. We can save the data frame back to the original data frame after sorting the data. 2. Use desc() for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order. 3. Missing values are always sorted at the end.
We can use the tail() function to read the last part of the data.
## # A tibble: 5 × 19
## year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 9 30 NA 1455 NA NA 1634 NA 9E
## 2 2013 9 30 NA 2200 NA NA 2312 NA 9E
## 3 2013 9 30 NA 1210 NA NA 1330 NA MQ
## 4 2013 9 30 NA 1159 NA NA 1344 NA MQ
## 5 2013 9 30 NA 840 NA NA 1020 NA MQ
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.
# select specific columns
select(df, year, month, day)
# select all columns between year and day
select(df, year:day)
# select all columns except those from year and day
select(df, -(year:day)) Note:
mutate() is used when we would like to add a new variable / column using the other variables in the data.
Note: mutate() always adds new columns at the end of the data.
First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset.
# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)
df1 <- mutate(df1,
gain= arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
head(df1, c(5, 11))## # A tibble: 5 × 11
## year month day dep_delay arr_de…¹ dista…² air_t…³ gain speed hours gain_…⁴
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 9 1301 1272 4983 640 -29 467. 10.7 -2.72
## 2 2013 6 15 1137 1127 483 74 -10 392. 1.23 -8.11
## 3 2013 1 10 1126 1109 719 111 -17 389. 1.85 -9.19
## 4 2013 9 20 1014 1007 2586 354 -7 438. 5.9 -1.19
## 5 2013 7 22 1005 989 589 96 -16 368. 1.6 -10
## # … with abbreviated variable names ¹arr_delay, ²distance, ³air_time,
## # ⁴gain_per_hour
If we only want to keep the new variables, use transmute().
transmute(df1,
gain= arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)## # A tibble: 336,776 × 4
## gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl>
## 1 9 370. 3.78 2.38
## 2 16 374. 3.78 4.23
## 3 31 408. 2.67 11.6
## 4 -17 517. 3.05 -5.57
## 5 -19 394. 1.93 -9.83
## 6 16 288. 2.5 6.4
## 7 24 404. 2.63 9.11
## 8 -11 259. 0.883 -12.5
## 9 -5 405. 2.33 -2.14
## 10 10 319. 2.3 4.35
## # … with 336,766 more rows
Note: There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.
summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.
## # A tibble: 1 × 1
## delay
## <dbl>
## 1 12.6
In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.
## # A tibble: 336,776 × 19
## # Groups: year, month, day [365]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The result shows the original data but indicates groups: year, month, day, in our example.
For example, we can study the average departure / arrival delays for each day.
by_day <- group_by(df, year, month, day)
summarize(by_day,
ave_dep_delay = mean(dep_delay, na.rm = T),
ave_arr_delay = mean(arr_delay, na.rm = T)
)## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day ave_dep_delay ave_arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7
## 2 2013 1 2 13.9 12.7
## 3 2013 1 3 11.0 5.73
## 4 2013 1 4 8.95 -1.93
## 5 2013 1 5 5.73 -1.53
## 6 2013 1 6 7.15 4.24
## 7 2013 1 7 5.42 -4.95
## 8 2013 1 8 2.55 -3.23
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90
## # … with 355 more rows
In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.
The logic when using pipe: object %>% function1 %>% function 2….
If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.
by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
count = n(),
ave_dist = mean(distance, na.rm=T),
ave_arr_delay = mean(arr_delay, na.rm=T))
delay <- filter(delay, count > 20, dest != "HNL")The following code chunk does the same task with the pipe, %>% and it makes the code easier to read.
Measures of location for a quantitative variable: mean(), median()
Measure of spread for a quantitative variable: sd(), IQR(), mad()
Here, \(MAD = median(|x_i-\bar{x}|)\) is called the median absolute deviation which may be more useful if we have outliers.
not_cancelled <- df %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled_summary <- not_cancelled %>%
group_by(dest) %>%
summarize(
distance_mu = mean(distance),
distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
head(not_cancelled_summary, 5)## # A tibble: 5 × 3
## dest distance_mu distance_sd
## <chr> <dbl> <dbl>
## 1 EGE 1736. 10.5
## 2 SAN 2437. 10.4
## 3 SFO 2578. 10.2
## 4 HNL 4973. 10.0
## 5 SEA 2413. 9.98
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time), # the first flight departed each day
last = max(dep_time) # the last flight departed each day
)## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
The following code chunk finds the first and last departure for each day
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first_dep = first(dep_time),
last_dep = last(dep_time)
)## # A tibble: 365 × 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # … with 355 more rows
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))## # A tibble: 104 × 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # … with 94 more rows
We can use count() directly if all we want is a count.
## # A tibble: 4 × 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
We can optionally provide a weight variable. For example we could use this to “count” the total number of miles a plane flew.
not_cancelled_weight <- not_cancelled %>%
count(tailnum, wt = distance)
head(not_cancelled_weight, 4)## # A tibble: 4 × 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
## 7 2013 1 7 2
## 8 2013 1 8 1
## 9 2013 1 9 3
## 10 2013 1 10 3
## # … with 355 more rows
Or what proportion of flights are delayed by more than one hour?
## # A tibble: 365 × 4
## # Groups: year, month [12]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
## 7 2013 1 7 0.0333
## 8 2013 1 8 0.0213
## 9 2013 1 9 0.0202
## 10 2013 1 10 0.0183
## # … with 355 more rows
Here we show some examples to demonstrate how to group the data by multiple variables.
per_day <- df %>%
group_by(year, month, day) %>%
summarize(flights = n())
per_month <- summarize(per_day, flights = sum(flights))
per_year <- summarize(per_month, flights = sum(flights))
head(per_month, 3)## # A tibble: 3 × 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## # A tibble: 1 × 2
## year flights
## <int> <int>
## 1 2013 336776
If we need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily <- df %>% group_by(year, month, day)
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights=n()) # all flights## # A tibble: 1 × 1
## flights
## <int>
## 1 336776
We can also do convenient operations with mutate() and filter().
The following code chunk finds the worst members of each group.
## # A tibble: 3,306 × 7
## # Groups: year, month, day [365]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46
## 4 2013 1 1 157 174 213 60
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
## 7 2013 1 1 285 246 1085 146
## 8 2013 1 1 192 191 199 44
## 9 2013 1 1 379 456 1092 222
## 10 2013 1 2 224 207 550 94
## # … with 3,296 more rows
The following code chunk finds all groups bigger than a threshold.
## # A tibble: 332,577 × 19
## # Groups: dest [77]
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 332,567 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The following code chunk standardizes to compute per group metrics.
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, arr_delay, prop_delay)## # A tibble: 131,106 × 6
## # Groups: dest [77]
## dest year month day arr_delay prop_delay
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 IAH 2013 1 1 11 0.000111
## 2 IAH 2013 1 1 20 0.000201
## 3 MIA 2013 1 1 33 0.000235
## 4 ORD 2013 1 1 12 0.0000424
## 5 FLL 2013 1 1 19 0.0000938
## 6 ORD 2013 1 1 8 0.0000283
## 7 LAX 2013 1 1 7 0.0000344
## 8 DFW 2013 1 1 31 0.000282
## 9 ATL 2013 1 1 12 0.0000400
## 10 DTW 2013 1 1 16 0.000116
## # … with 131,096 more rows
We can use the drop_na() function to drop rows containing missing values.
## [1] 336776
## [1] 327346
We can use the rename() function to rename column names. This function makes the task much easier since we can rename a variable directly without knowing which column that variable is.
df <- df %>% rename(
DepTime = dep_time,
SchedDepTime = sched_dep_time,
DepDelay = dep_delay,
ArrTime = arr_time,
SchedArrTime = sched_arr_time,
ArrDelay = arr_delay
)
colnames(df)## [1] "year" "month" "day" "DepTime" "SchedDepTime"
## [6] "DepDelay" "ArrTime" "SchedArrTime" "ArrDelay" "carrier"
## [11] "flight" "tailnum" "origin" "dest" "air_time"
## [16] "distance" "hour" "minute" "time_hour"
We might want to recode specific values in an R data frame sometimes. The recode()( function from the dplyr package makes this simple to accomplish.
## [1] "Jan" NA "Apr" "Mar" "Feb"
We should pay attention to the data type of a variable. In the
example above, the original data type is integer and it becomes
character after we recode some values. This is due to unreplaced values
treated as NA as .x is not compatible.
## [1] 12 31 1314 34 8 5 7 16 8
We can use case_when()( function to create a new variable with the mutate()( function together.
df <- df %>% mutate(arr_status = case_when(
ArrDelay > 60 ~ "LongDelay",
ArrDelay > 0 & ArrDelay <= 60 ~ "Delay",
ArrDelay == 0 ~ "OnTime",
ArrDelay < 0 ~ "NotDelay",
is.na(ArrDelay) ~ "Unknown"
))
count(df, arr_status)## # A tibble: 4 × 2
## arr_status n
## <chr> <int>
## 1 Delay 105215
## 2 LongDelay 27789
## 3 NotDelay 188933
## 4 OnTime 5409
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller