We facilitated a short series of R data analytics remotely at the National Taipei University in May 2022. To the extent possible, the content of the lectures are recorded here. The lectures are based on R for Data Science (Wickham and Grolemund (2016)).
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller
Where to get help
?data.frame
etc.In this session, we will talk about data manipulation using R package tidyverse. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:
In this session, we will focus on the following key functions in dplyr using the dataset flights from the R package nycflights13.
All functions above work similarly.
First, we load the necessary packages, check conflict functions, and import the dataset flights from the R package nycflights13.
library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights
Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables (Flights Data).
We get a glimpse of the data.
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 25 456 500 -4 649 651
## 2 2013 12 25 524 515 9 805 814
## 3 2013 12 25 542 540 2 832 850
## 4 2013 12 25 546 550 -4 1022 1027
## 5 2013 12 25 556 600 -4 730 745
## 6 2013 12 25 557 600 -3 743 752
## 7 2013 12 25 557 600 -3 818 831
## 8 2013 12 25 559 600 -1 855 856
## 9 2013 12 25 559 600 -1 849 855
## 10 2013 12 25 600 600 0 850 846
## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The following code finds all flights that departed in July or August.
## # A tibble: 58,752 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 58,752 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note:
If we want to find flights that weren’t delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.
## # A tibble: 295,893 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 295,893 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Exercises: Find all flights that:
arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Note:
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 9 641 900 1301 1242 1530
## 2 2013 6 15 1432 1935 1137 1607 2120
## 3 2013 1 10 1121 1635 1126 1239 1810
## 4 2013 9 20 1139 1845 1014 1457 2210
## 5 2013 7 22 845 1600 1005 1044 1815
## 6 2013 4 10 1100 1900 960 1342 2211
## 7 2013 3 17 2321 810 911 135 1020
## 8 2013 7 22 2257 759 898 121 1026
## 9 2013 12 5 756 1700 896 1058 2020
## 10 2013 5 3 1133 2055 878 1250 2215
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
## # A tibble: 6 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 9 30 NA 1842 NA NA 2019
## 2 2013 9 30 NA 1455 NA NA 1634
## 3 2013 9 30 NA 2200 NA NA 2312
## 4 2013 9 30 NA 1210 NA NA 1330
## 5 2013 9 30 NA 1159 NA NA 1344
## 6 2013 9 30 NA 840 NA NA 1020
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Exercises:
select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
## # A tibble: 336,776 x 3
## year month day
## <int> <int> <int>
## 1 2013 1 1
## 2 2013 1 1
## 3 2013 1 1
## 4 2013 1 1
## 5 2013 1 1
## 6 2013 1 1
## 7 2013 1 1
## 8 2013 1 1
## 9 2013 1 1
## 10 2013 1 1
## # ... with 336,766 more rows
## # A tibble: 336,776 x 16
## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
## <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 517 515 2 830 819 11 UA
## 2 533 529 4 850 830 20 UA
## 3 542 540 2 923 850 33 AA
## 4 544 545 -1 1004 1022 -18 B6
## 5 554 600 -6 812 837 -25 DL
## 6 554 558 -4 740 728 12 UA
## 7 555 600 -5 913 854 19 B6
## 8 557 600 -3 709 723 -14 EV
## 9 557 600 -3 838 846 -8 B6
## 10 558 600 -2 753 745 8 AA
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
Note:
# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())
## # A tibble: 336,776 x 19
## carrier origin dest distance year month day dep_time sched_dep_time
## <chr> <chr> <chr> <dbl> <int> <int> <int> <int> <int>
## 1 UA EWR IAH 1400 2013 1 1 517 515
## 2 UA LGA IAH 1416 2013 1 1 533 529
## 3 AA JFK MIA 1089 2013 1 1 542 540
## 4 B6 JFK BQN 1576 2013 1 1 544 545
## 5 DL LGA ATL 762 2013 1 1 554 600
## 6 UA EWR ORD 719 2013 1 1 554 558
## 7 B6 EWR FLL 1065 2013 1 1 555 600
## 8 EV LGA IAD 229 2013 1 1 557 600
## 9 B6 JFK MCO 944 2013 1 1 557 600
## 10 AA LGA ORD 733 2013 1 1 558 600
## # ... with 336,766 more rows, and 10 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## # tailnum <chr>, air_time <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Exercises:
mutate() is used when we would like to add a new variable / column using the other variables in the data.
Note: mutate() always adds new columns at the end of the data.
First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset.
# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)
mutate(df1,
gain= arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
## year month day dep_delay arr_delay distance air_time gain speed hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 370. 3.78
## 2 2013 1 1 4 20 1416 227 16 374. 3.78
## 3 2013 1 1 2 33 1089 160 31 408. 2.67
## 4 2013 1 1 -1 -18 1576 183 -17 517. 3.05
## 5 2013 1 1 -6 -25 762 116 -19 394. 1.93
## 6 2013 1 1 -4 12 719 150 16 288. 2.5
## 7 2013 1 1 -5 19 1065 158 24 404. 2.63
## 8 2013 1 1 -3 -14 229 53 -11 259. 0.883
## 9 2013 1 1 -3 -8 944 140 -5 405. 2.33
## 10 2013 1 1 -2 8 733 138 10 319. 2.3
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
If we only want to keep the new variables, use transmute().
transmute(df1,
gain= arr_delay - dep_delay,
speed = distance / air_time * 60,
hours = air_time / 60,
gain_per_hour = gain / hours)
## # A tibble: 336,776 x 4
## gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl>
## 1 9 370. 3.78 2.38
## 2 16 374. 3.78 4.23
## 3 31 408. 2.67 11.6
## 4 -17 517. 3.05 -5.57
## 5 -19 394. 1.93 -9.83
## 6 16 288. 2.5 6.4
## 7 24 404. 2.63 9.11
## 8 -11 259. 0.883 -12.5
## 9 -5 405. 2.33 -2.14
## 10 10 319. 2.3 4.35
## # ... with 336,766 more rows
Note: There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.
Exercises:
## # A tibble: 336,776 x 2
## dep_time sched_dep_time
## <int> <int>
## 1 517 515
## 2 533 529
## 3 542 540
## 4 544 545
## 5 554 600
## 6 554 558
## 7 555 600
## 8 557 600
## 9 557 600
## 10 558 600
## # ... with 336,766 more rows
For example, 517 represents 5:17 (5:17 AM) and 1517 represents 15:17 (or 3:17 PM). We will use 1517 to demonstrate how to convert the time to the number of minutes since midnight (\(15 \times 60+17=917\) minutes).
We need to be able to extract 15 and 17 separately. We can use the integer division operator, %/%, and the modulo operator, %%, to achieve this.
## [1] 15
## [1] 17
Now we still have an issue. Since Midnight is represented by 2400, which would correspond to \(24 \times 60 = 1440\) minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, whatever_time %% 1440 will convert 1440 to zero while keeping all the other times the same.
transmute(df,
dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1400,
sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1400
)
## # A tibble: 336,776 x 2
## dep_time_mins sched_dep_time_mins
## <dbl> <dbl>
## 1 317 315
## 2 333 329
## 3 342 340
## 4 344 345
## 5 354 360
## 6 354 358
## 7 355 360
## 8 357 360
## 9 357 360
## 10 358 360
## # ... with 336,766 more rows
time_to_mins <- function(x) (x %/% 100 * 60 + x %% 100) %% 1400
transmute(df,
dep_time_mins = time_to_mins(dep_time),
sched_dep_time_mins = time_to_mins(sched_dep_time)
)
## # A tibble: 336,776 x 2
## dep_time_mins sched_dep_time_mins
## <dbl> <dbl>
## 1 317 315
## 2 333 329
## 3 342 340
## 4 344 345
## 5 354 360
## 6 354 358
## 7 355 360
## 8 357 360
## 9 357 360
## 10 358 360
## # ... with 336,766 more rows
summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.
## # A tibble: 336,776 x 19
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The result shows the original data but indicates groups: year, month, day, in our example. For example, we can study the average departure / arrival delays for each day.
by_day <- group_by(df, year, month, day)
summarize(by_day,
ave_dep_delay = mean(dep_delay, na.rm = T),
ave_arr_delay = mean(arr_delay, na.rm = T)
)
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day ave_dep_delay ave_arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7
## 2 2013 1 2 13.9 12.7
## 3 2013 1 3 11.0 5.73
## 4 2013 1 4 8.95 -1.93
## 5 2013 1 5 5.73 -1.53
## 6 2013 1 6 7.15 4.24
## 7 2013 1 7 5.42 -4.95
## 8 2013 1 8 2.55 -3.23
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90
## # ... with 355 more rows
In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.
The logic when using pipe: object %>% function1 %>% function 2….
If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.
by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
count = n(),
ave_dist = mean(distance, na.rm=T),
ave_arr_delay = mean(arr_delay, na.rm=T)
)
delay <- filter(delay, count > 20, dest != "HNL")
The following code chunk does the same task with the pipe, %>% and it makes the code easier to read.
delay <- df %>%
group_by(dest) %>%
summarize(
count = n(),
ave_dist = mean(distance, na.rm=T),
ave_arr_delay = mean(arr_delay, na.rm=T)
) %>%
filter(count > 20, dest != "HNL")
Measures of location for a quantitative variable: mean(), median()
Measure of spread for a quantitative variable: sd(), IQR(), mad()
Here, \(MAD = median(|x_i-\bar{x}|)\) is called the median absolute deviation which may be more useful if we have outliers.
not_cancelled <- df %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(dest) %>%
summarize(
distance_mu = mean(distance),
distance_sd = sd(distance)) %>%
arrange(desc(distance_sd))
## # A tibble: 104 x 3
## dest distance_mu distance_sd
## <chr> <dbl> <dbl>
## 1 EGE 1736. 10.5
## 2 SAN 2437. 10.4
## 3 SFO 2578. 10.2
## 4 HNL 4973. 10.0
## 5 SEA 2413. 9.98
## 6 LAS 2241. 9.91
## 7 PDX 2446. 9.87
## 8 PHX 2141. 9.86
## 9 LAX 2469. 9.66
## 10 IND 652. 9.46
## # ... with 94 more rows
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first = min(dep_time), # the first flight departed each day
last = max(dep_time) # the last flight departed each day
)
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
The following code chunk finds the first and last departure for each day
not_cancelled %>%
group_by(year, month, day) %>%
summarize(
first_dep = first(dep_time),
last_dep = last(dep_time)
)
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
## 7 2013 1 7 49 2359
## 8 2013 1 8 454 2351
## 9 2013 1 9 2 2252
## 10 2013 1 10 3 2320
## # ... with 355 more rows
not_cancelled %>%
group_by(dest) %>%
summarize(carriers = n_distinct(carrier)) %>%
arrange(desc(carriers))
## # A tibble: 104 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
## 7 DCA 6
## 8 DTW 6
## 9 IAD 6
## 10 MSP 6
## # ... with 94 more rows
We can use count() directly if all we want is a count.
## # A tibble: 104 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
## 6 AUS 2411
## 7 AVL 261
## 8 BDL 412
## 9 BGR 358
## 10 BHM 269
## # ... with 94 more rows
We can optionally provide a weight variable. For example we could use this to “count” the total number of miles a plane flew.
## # A tibble: 4,037 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
## 6 N104UW 24616
## 7 N10575 139903
## 8 N105UW 23618
## 9 N107US 21677
## 10 N108UW 32070
## # ... with 4,027 more rows
When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
## 7 2013 1 7 2
## 8 2013 1 8 1
## 9 2013 1 9 3
## 10 2013 1 10 3
## # ... with 355 more rows
Or what proportion of flights are delayed by more than one hour?
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
## 7 2013 1 7 0.0333
## 8 2013 1 8 0.0213
## 9 2013 1 9 0.0202
## 10 2013 1 10 0.0183
## # ... with 355 more rows
Here we show some examples to demonstrate how to group the data by multiple variables.
## # A tibble: 365 x 4
## # Groups: year, month [12]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
## 7 2013 1 7 933
## 8 2013 1 8 899
## 9 2013 1 9 902
## 10 2013 1 10 932
## # ... with 355 more rows
## # A tibble: 12 x 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
## 7 2013 7 29425
## 8 2013 8 29327
## 9 2013 9 27574
## 10 2013 10 28889
## 11 2013 11 27268
## 12 2013 12 28135
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
If we need to remove grouping, and return to operations on ungrouped data, use ungroup().
daily <- df %>% group_by(year, month, day)
daily %>%
ungroup() %>% # no longer grouped by date
summarize(flights=n()) # all flights
## # A tibble: 1 x 1
## flights
## <int>
## 1 336776
Exercises:
df %>%
filter(!is.na(dep_delay)) %>%
arrange(tailnum, year, month, day) %>%
group_by(tailnum) %>%
# cumulative number of flights delayed over one hour
mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
# count the number of flights == 0
summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
arrange(total_flights)
## # A tibble: 4,037 x 2
## tailnum total_flights
## <chr> <int>
## 1 D942DN 0
## 2 N10575 0
## 3 N11106 0
## 4 N11109 0
## 5 N11187 0
## 6 N11199 0
## 7 N12967 0
## 8 N13550 0
## 9 N136DL 0
## 10 N13903 0
## # ... with 4,027 more rows
The sort argument to count() sorts the results in order of n. We could use this anytime we would run count() followed by arrange().
For example, the following code chunk counts the number of flights to a destination and sorts the returned data from highest to lowest.
## # A tibble: 105 x 2
## dest n
## <chr> <int>
## 1 ORD 17283
## 2 ATL 17215
## 3 LAX 16174
## 4 BOS 15508
## 5 MCO 14082
## 6 CLT 14064
## 7 SFO 13331
## 8 FLL 12055
## 9 MIA 11728
## 10 DCA 9705
## # ... with 95 more rows
We can also do convenient operations with mutate() and filter().
The following code chunk finds the worst members of each group.
## # A tibble: 3,306 x 7
## # Groups: year, month, day [365]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46
## 4 2013 1 1 157 174 213 60
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
## 7 2013 1 1 285 246 1085 146
## 8 2013 1 1 192 191 199 44
## 9 2013 1 1 379 456 1092 222
## 10 2013 1 2 224 207 550 94
## # ... with 3,296 more rows
The following code chunk finds all groups bigger than a threshold.
## # A tibble: 332,577 x 19
## # Groups: dest [77]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The following code chunk standardizes to compute per group metrics.
popular_dests %>%
filter(arr_delay > 0) %>%
mutate(prop_delay = arr_delay / sum(arr_delay)) %>%
select(year:day, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups: dest [77]
## dest year month day arr_delay prop_delay
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 IAH 2013 1 1 11 0.000111
## 2 IAH 2013 1 1 20 0.000201
## 3 MIA 2013 1 1 33 0.000235
## 4 ORD 2013 1 1 12 0.0000424
## 5 FLL 2013 1 1 19 0.0000938
## 6 ORD 2013 1 1 8 0.0000283
## 7 LAX 2013 1 1 7 0.0000344
## 8 DFW 2013 1 1 31 0.000282
## 9 ATL 2013 1 1 12 0.0000400
## 10 DTW 2013 1 1 16 0.000116
## # ... with 131,096 more rows
Exercises:
In this session, we will introduce how to visualize our data using ggplot2 and plotly. The lecture is based on UC Business Analytics R Programming Guide.
While we can use the built-in functions in the base package in R to obtain plots, the package ggplot2 creates advanced graphs with simple and flexible commands.
First, we load the necessary packages, check conflict functions, and get a glimpse of the dataset mpg from the R package ggplot2.
library(tidyverse)
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("filter", "dplyr")
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c~
Now we need to understand the data and each variable in the data. This dataset contains 38 popular models of cars from 1999 to 2008. (Fuel Economy Data).
The basic idea of creating plots using ggplot2 is to specify each component of the following and combine them with +.
ggplot() function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays.
The logic when using ggplot() function is: ggplot(data, mapping) + geom_function().
The following code chunk shows how we can obtain a scatter plot to study the relationship between engine displacement and highway mileage per gallon.
Exercises:
The aesthetic mappings allow to select variables to be plotted and use data properties to influence visual characteristics such as color, size, shape, position, etc. As a result, each visual characteristic can encode a different part of the data and be utilized to communicate information.
All aesthetics for a plot are specified in the aes() function call.
For example, we can add a mapping from the class of the cars to a color characteristic:
Note:
We should note that in the above code chunk, “class” is a variable in the data and therefore, the commend specifies a categorical variable is used as the third variable in the figure.
Using the aes() function will cause the visual channel to be based on the data specified in the argument. For example, using aes(color = “blue”) won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c(“blue”) — as if we only had a single type of engine that happened to be called “red”. If we wish to apply an aesthetic property to an entire geometry, you can set that property as an argument to the geom method, outside of the aes() call:
Exercises:
Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical versus continuous variables?
What happens if we map an aesthetic to something other than a variable name, like ggplot(mpg, aes(x = displ, y = hwy, color = displ<5)) + geom_point() .
Building on these basics, we can use ggplot2 to create almost any kind of plot we may want. These plots are declared using functions that follow from the Grammar of Graphics. ggplot2 supports a number of different types of geometric objects, including:
Each of these geometries will make use of the aesthetic mappings provided, albeit the visual qualities to which the data will be mapped will differ. For example, we can map data to the shape of a geom_point (e.g., if they should be circles or squares), or we can map data to the line-type of a geom_line (e.g., if it is solid or dotted), but not vice versa.
Almost all geoms require an x and y mapping at the bare minimum.
# x and y mapping needed when creating a scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point()
What makes this really powerful is that you can add multiple geometries to a plot, thus allowing you to create complex graphics showing multiple aspects of your data.
# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
Note: 1. Since the aesthetics for each geom can be different, we could show multiple lines on the same plot (or with different colors, styles, etc).
For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colors within each geom:
If we specify an aesthetic within ggplot(), it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).
# color aesthetic passed to each geom layer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
geom_smooth(se = FALSE)
# color aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_smooth(se = FALSE)
Exercises:
What geom would you use to draw a line chart? A boxplot? A histogram?
Create a boxplot of the highwya mileage (hwy).
The following bar chart shows the frequency distribution of vehicle class. We can find that y axis was defined as the count of elements that have the particular type. This count is not part of the data set, but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation.
ggplot2 supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. We can specify which statistical transformation a geom uses by passing it as the stat argument. For example, consider our data already had the count as a variable:
## # A tibble: 7 x 2
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
We can use stat = “identity” within geom_bar to plot our bar height values to this variable. Also, note that we now include n for our y variable:
We can also call stat_ functions directly to add additional layers. For example, here we create a scatter plot of highway miles for each displacement value and then use stat_summary() to plot the mean highway miles at each displacement value.
ggplot(mpg, aes(displ, hwy)) +
geom_point(color = "grey") +
stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed")
Exercises:
What is the default geom associated with stat_summary()?
What variables does stat_smooth()compute? What parameters control its behavior?
In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in geom_bar() if we map a different variable to the color visual characteristic.
# bar chart of class, colored by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar()
The geom_bar() by default uses a position adjustment of “stack”, which makes each rectangle’s height proprotional to its value and stacks them on top of each other. We can use the position argument to specify what position adjustment rules to follow:
# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "dodge")
# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill")
Note: We may need to check the documentation for each particular geom to learn more about its positioning adjustments.
Whenever we specify an aesthetic mapping, ggplot() uses a particular scale to determine the range of values that the data should map to. It automatically adds a scale for each mapping to the plot.
However, the sclae used in the figure could be changed if needed. Each scale can be represented by a function with the following name: scale_, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data, whereas a discrete scale will handle things like colors.
# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_x_continuous() +
scale_y_continuous() +
scale_colour_discrete()
While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, we can use a scale to change the direction of an axis:
# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
geom_point() +
scale_x_reverse() +
scale_y_reverse()
Similarly, we can use scale_x_log10() and scale_x_sqrt() to transform the scale. We can use scales to format the axes as well.
ggplot(mpg, aes(x = class, fill = drv)) +
geom_bar(position = "fill") +
scale_y_continuous(breaks = seq(0, 1, by = .2),
labels = scales::percent) +
labs(y = "Percent")
A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from colorbrewer.org. These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the scale_color_brewer(), passing the pallete as an argument.
# default color brewer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer()
# specifying color palette
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
scale_color_brewer(palette = "Set3")
Similar to scales, coordinate systems are specified with functions that all start with coord_ and are added as a layer. There are a number of different possible coordinate systems to use, including:
# zoom in with coord_cartesian
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point() +
coord_cartesian(xlim = c(0, 5))
If we want to divide the information into multiple subplots, facets are ways to go. It allows us to view a separate plot for each case in a categorical variable. We can construct a plot with multiple facets by using the facet_wrap(). This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument).
NOte: 1. We can facet_grid() to facet the data by more than one categorical variable. 2. We use a tilde (~) in our facet functions. With facet_grid() the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns.
Exercises:
Textual annotations and labels (on the plot, axes, geometry, and legend) are crucial for understanding and presenting information.
We can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!).
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
geom_point() +
labs(title = "Fuel Efficiency by Engine Power",
subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
x = "Engine Displacement (liters)",
y = "Fuel Efficiency (miles per gallon)",
color = "Car Type")
It is also possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, we are plotting an extra set of data which happen to be the variable names.
# a data table of each car that has best efficiency of its type
best_in_class <- mpg %>%
group_by(class) %>%
filter(row_number(desc(hwy)) == 1)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_label(data = best_in_class, aes(label = model), alpha = 0.5)
However, we can find that two labels overlap one-another in the top left part of the plot. We can use the geom_text_repel() from the ggrepel package to help position labels.
library(ggrepel)
ggplot(mpg, aes(x = displ, y = hwy)) +
geom_point(aes(color = class)) +
geom_text_repel(data = best_in_class, aes(label = model))
Whenever we want to customize titles, labels, fonts, background, grid lines, and legends, we can use themes.
ggplot(mpg, aes(x=displ, y=hwy)) +
labs(title = "Fuel Efficiency by Engine Power",
x = "Engine Displacement (Liters)",
y = "Fuel Efficiency (Miles per gallon)") +
theme(axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12))
Note:
We only list some key components here.
See Modify Components of A Theme and Complete Themes for more details about the use of theme.
The R package plotly can be used to make interactive graphic displays very easy when we already know how to use ggplot() to create graphs.
The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section.
In data science, it is important to get to know your data before advanced modeling or further analysis. We should understand what the data are about, what variables we have, the size of the data, how many missing values, what is the data type of each variable, any possible relationships between variables and anything unusual or interesting in the data.
We will use the Medical Cost Personal Dataset to go over the use of functions in the package DataExplorer. For the demonstration purpose, we modified this dataset by having random missing values in some variables.
library(DataExplorer)
insurance <- read_csv("https://raw.githubusercontent.com/Ying-Ju/R_Data_Analytics_Series_NTPU/main/insurance.csv")
glimpse(insurance)
## Rows: 1,338
## Columns: 7
## $ age <dbl> 19, 18, 28, NA, 32, NA, 46, 37, 37, 60, 25, 62, NA, 56, 27, 1~
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
First, we check the basic description for the data using the function plot_intro() in the package DataExplorer.
Then, we study the distribution of missing values in the data using the function plot_missing() in the package DataExplorer.
Since there are 7 variables, we will study all variables in the data.
Now, we study the frequency distribution of all categorical variables in the data using the function plot_bar() in the package DataExplorer.
The following code shows the distribution of sum of charges by the categorical variables in the data, individually.
Next, we study the distribution of all quantitative variables in the data using the function plot_histogram() in the package DataExplorer.
We study the distributions of age, bmi, and charges with respect to region individually using the function plot_boxplot() in the package DataExplorer.
insurance_Q <- insurance %>%
select(age, bmi, charges, region) %>%
drop_na()
plot_boxplot(insurance_Q, by = "region")
We can study the association between any quantitative variable with a given response variable in the data using the function plot_scatterplot() in the package DataExplorer. Here, we study the association between charges and other quantitative variables in the data.
We can get a scatterplot with sample observations as well.
The above figure only shows the association between charges and other quantitative variables in the northwest.
We can check the correlation of all quantitative variables in the data using the function plot_correlation() in the package DataExplorer.
In you are new to data exploration and have no ideas about where to start. create_report() function in the package DataExplorer can help to create a report for the data exploration of the data.
create_report(insurance, output_file = "report.html", output_dir = "G:/Shared drives/R data analytics series at NTPU")
Note: Use help(“create_report”) to find the usage of create_report().
In this session, we will introduce
R Markdown is a file format for creating dynamic documents with R and RStudio. RMarkdown documents are written in Markdown which has easy-to-write plain text format with embedded R code.
In order to create a Rmarkdown presentation, we click File and then find New File and then R markdown … There are four options:
This format allows us to create a slide show and the slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (—).
Similar to ioslides, this format allows to create a slide show broken up into sections by using the heading tag ##. If a header is not needed, a new slide could be created using a horizontal rule (—). A Slidy presentation gives a table of content while An ioslides presentation doesn’t.
This format allows to create a beamer presenation (LaTex). The slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (—).
Creating a Rmarkdown Presentation
In the following, we show an example of the header of a Rmarkdown
file.
We can use the output option to manipulate which presentation we would like to have.
To render an R Markdown document into its final output format, we can click the Knit button to render the document in RStudio and RStudio will show a preview of it.
The further settings for presentations could be found at R Markdown: The Definitive Guide
A easy way to start creating a xaringan presentation is to use the R markdown template with Ninja Presentation or Ninja Themed Presentation.
Creating a Rmarkdown Presentation
A comprehensive tutorial regarding xaringan presentation could be found at xaringan Presentation
A easy way to start creating a Flex dashboard is to use the R markdown template with Flex Dashboard.
Creating a Flex Dashboard
A comprehensive tutorial regarding Flex Dashboard could be found at flexdashboard
In this session, we will provide a quick overview of GitHub. It will cover some basic usages of GitHub Desktop due to the time limitation of the class.
Git is a version control system that allows us to track changes in any set of files. It is typically used by programmers who are working on source code together.
GitHub is a version control and collaboration tools for programming. It allows us to collaborate on projects from any location with other people.
We can register for a GitHub account at www.github.com.
Installing GitHub Desktop from https://desktop.github.com/
We will create a process that shows how to create a repository, track changes, and explore a file’s history.
Here is an example that shows how GitHub Desktop looks like.
Fetch downloads the most recent updates from origin but it does not update our local working copy with the changes. After we click Fetch origin, the button changes to Pull Origin. Clicking Pull Origin will update our local working copy with the fetched updates.
When we commit the changes, the list of uncommitted changes was gone from the left pane. We have, however, just committed the changes locally. The commit must be pushed to the remote (origin) repository.
The html file needs to be named as index.html.