R Data Analytics Series at NTPU, 2022

Ying-Ju Tessa Chen
University of Dayton
ychen4@udayton.edu

README

We facilitated a short series of R data analytics remotely at the National Taipei University in May 2022. To the extent possible, the content of the lectures are recorded here. The lectures are based on R for Data Science (Wickham and Grolemund (2016)).

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):


Where to get help

  1. To see documentation on any function in R, execute ?data.frame etc.
  2. Google it! (Better way to learn coding!)
  3. Ask questions online, for example: stackoverflow.com.

Session 1: Data Manipulation

Brief Overview

In this session, we will talk about data manipulation using R package tidyverse. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:

In this session, we will focus on the following key functions in dplyr using the dataset flights from the R package nycflights13.

All functions above work similarly.

  1. The first argument is a data frame.
  2. The subsequent arguments describe what to do with the data frame using the variable names.
  3. The result is a new data frame (but we can save it back to the original data frame if needed).

Load packages and read the Flights Data

First, we load the necessary packages, check conflict functions, and import the dataset flights from the R package nycflights13.

library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights

Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables (Flights Data).


We get a glimpse of the data.

glimpse(df)
## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~

filter()

filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.

filter(df, month == 12, day == 25)
## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Christmas <- filter(df, month == 12, day == 25)

If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses

(Jan1 <- filter(df, month == 1, day == 1)) 
## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The following code finds all flights that departed in July or August.

filter(df, month == 7 | month == 8)
## # A tibble: 58,752 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(df, month %in% c(7, 8))
## # A tibble: 58,752 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

  1. If we use filter(df, month == 7 | 8), it finds all months are equal 7 | 8, an expression that evaluates to TRUE. In a numeric context, TRUE becomes one, so this finds all fights in the data.
  2. filter() only includes rows where the condition is TRUE and it excludes both FALSE and NA values.

If we want to find flights that weren’t delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.

filter(df, !(arr_delay > 60 | dep_delay > 60))
## # A tibble: 295,893 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
filter(df, arr_delay <= 60, dep_delay <= 60)
## # A tibble: 295,893 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises: Find all flights that:

  1. Had an arrival delay of one or more hours
  2. Were operated by American (AA), Delta (DL), or United (UA) airlines
  3. Flew to Houston (IAH or HOU)
  4. Departed in winter (December, January, February)
  5. arrived more than one hour late, but didn’t leave late

arrange()

arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.

arrange(df, year, month, day)
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

  1. We can save the data frame back to the original data frame after sorting the data.
  2. Use desc() for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order.
  3. Missing values are always sorted at the end.
arrange(df, desc(arr_delay))
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
tail(arrange(df, desc(arr_delay)))
## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1842        NA       NA           2019
## 2  2013     9    30       NA           1455        NA       NA           1634
## 3  2013     9    30       NA           2200        NA       NA           2312
## 4  2013     9    30       NA           1210        NA       NA           1330
## 5  2013     9    30       NA           1159        NA       NA           1344
## 6  2013     9    30       NA            840        NA       NA           1020
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises:

  1. Sort flights to find the most delayed flights. Find the flights that left earliest.
  2. Sort flights to find the fastest flights.

select()

select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.

# select specific columns
select(df, year, month, day)
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# select all columns between year and day
select(df, year:day) 
## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows
# select all columns except those from year and day
select(df, -(year:day)) 
## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

  1. We can use a minus sign - to drop variables.
  2. There are several helper functions we can use within select(). See ?select for the information.
  3. select() can be used with the everything() function when we have a handful of variables we would like to move to the start of the data frame.
# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())
## # A tibble: 336,776 x 19
##    carrier origin dest  distance  year month   day dep_time sched_dep_time
##    <chr>   <chr>  <chr>    <dbl> <int> <int> <int>    <int>          <int>
##  1 UA      EWR    IAH       1400  2013     1     1      517            515
##  2 UA      LGA    IAH       1416  2013     1     1      533            529
##  3 AA      JFK    MIA       1089  2013     1     1      542            540
##  4 B6      JFK    BQN       1576  2013     1     1      544            545
##  5 DL      LGA    ATL        762  2013     1     1      554            600
##  6 UA      EWR    ORD        719  2013     1     1      554            558
##  7 B6      EWR    FLL       1065  2013     1     1      555            600
##  8 EV      LGA    IAD        229  2013     1     1      557            600
##  9 B6      JFK    MCO        944  2013     1     1      557            600
## 10 AA      LGA    ORD        733  2013     1     1      558            600
## # ... with 336,766 more rows, and 10 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## #   tailnum <chr>, air_time <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises:

  1. Select dep_time, dep_delay, arr_time, and arr_delay from the Flights Data.
  2. What happens if we include the name of a variable multiple times in a select() call?
  3. What is the result of running the following code select(df, contains(“TIME”))?

mutate()

mutate() is used when we would like to add a new variable / column using the other variables in the data.

Note: mutate() always adds new columns at the end of the data.

First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset.

# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)

mutate(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
## # A tibble: 336,776 x 11
##     year month   day dep_delay arr_delay distance air_time  gain speed hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227     9  370. 3.78 
##  2  2013     1     1         4        20     1416      227    16  374. 3.78 
##  3  2013     1     1         2        33     1089      160    31  408. 2.67 
##  4  2013     1     1        -1       -18     1576      183   -17  517. 3.05 
##  5  2013     1     1        -6       -25      762      116   -19  394. 1.93 
##  6  2013     1     1        -4        12      719      150    16  288. 2.5  
##  7  2013     1     1        -5        19     1065      158    24  404. 2.63 
##  8  2013     1     1        -3       -14      229       53   -11  259. 0.883
##  9  2013     1     1        -3        -8      944      140    -5  405. 2.33 
## 10  2013     1     1        -2         8      733      138    10  319. 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

If we only want to keep the new variables, use transmute().

transmute(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
## # A tibble: 336,776 x 4
##     gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>         <dbl>
##  1     9  370. 3.78           2.38
##  2    16  374. 3.78           4.23
##  3    31  408. 2.67          11.6 
##  4   -17  517. 3.05          -5.57
##  5   -19  394. 1.93          -9.83
##  6    16  288. 2.5            6.4 
##  7    24  404. 2.63           9.11
##  8   -11  259. 0.883        -12.5 
##  9    -5  405. 2.33          -2.14
## 10    10  319. 2.3            4.35
## # ... with 336,766 more rows

Note: There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.

Exercises:

  1. Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they are not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.
select(df, dep_time, sched_dep_time)
## # A tibble: 336,776 x 2
##    dep_time sched_dep_time
##       <int>          <int>
##  1      517            515
##  2      533            529
##  3      542            540
##  4      544            545
##  5      554            600
##  6      554            558
##  7      555            600
##  8      557            600
##  9      557            600
## 10      558            600
## # ... with 336,766 more rows

For example, 517 represents 5:17 (5:17 AM) and 1517 represents 15:17 (or 3:17 PM). We will use 1517 to demonstrate how to convert the time to the number of minutes since midnight (\(15 \times 60+17=917\) minutes).

We need to be able to extract 15 and 17 separately. We can use the integer division operator, %/%, and the modulo operator, %%, to achieve this.

1517 %/% 100
## [1] 15
1517 %% 100
## [1] 17

Now we still have an issue. Since Midnight is represented by 2400, which would correspond to \(24 \times 60 = 1440\) minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, whatever_time %% 1440 will convert 1440 to zero while keeping all the other times the same.

transmute(df, 
          dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1400,
          sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1400
)
## # A tibble: 336,776 x 2
##    dep_time_mins sched_dep_time_mins
##            <dbl>               <dbl>
##  1           317                 315
##  2           333                 329
##  3           342                 340
##  4           344                 345
##  5           354                 360
##  6           354                 358
##  7           355                 360
##  8           357                 360
##  9           357                 360
## 10           358                 360
## # ... with 336,766 more rows
  1. As we can see that the formula used to create the two variables are the same, we should write a function to avoid copying and pasting code in the previous exercise. Think about how to achieve this.
time_to_mins <- function(x) (x %/% 100 * 60 + x %% 100) %% 1400

transmute(df, 
          dep_time_mins = time_to_mins(dep_time),
          sched_dep_time_mins = time_to_mins(sched_dep_time)
)
## # A tibble: 336,776 x 2
##    dep_time_mins sched_dep_time_mins
##            <dbl>               <dbl>
##  1           317                 315
##  2           333                 329
##  3           342                 340
##  4           344                 345
##  5           354                 360
##  6           354                 358
##  7           355                 360
##  8           357                 360
##  9           357                 360
## 10           358                 360
## # ... with 336,766 more rows
  1. Find the 10 most delayed flights.
  2. What does 1:5 + 1:10 return? Why?

group_by() & summarize()

summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.

summarize(df, delay = mean(dep_delay, na.rm=T))
## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.

group_by(df, year, month, day)
## # A tibble: 336,776 x 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The result shows the original data but indicates groups: year, month, day, in our example. For example, we can study the average departure / arrival delays for each day.

by_day <- group_by(df, year, month, day) 
summarize(by_day, 
          ave_dep_delay = mean(dep_delay, na.rm = T),
          ave_arr_delay = mean(arr_delay, na.rm = T)
          )
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day ave_dep_delay ave_arr_delay
##    <int> <int> <int>         <dbl>         <dbl>
##  1  2013     1     1         11.5         12.7  
##  2  2013     1     2         13.9         12.7  
##  3  2013     1     3         11.0          5.73 
##  4  2013     1     4          8.95        -1.93 
##  5  2013     1     5          5.73        -1.53 
##  6  2013     1     6          7.15         4.24 
##  7  2013     1     7          5.42        -4.95 
##  8  2013     1     8          2.55        -3.23 
##  9  2013     1     9          2.28        -0.264
## 10  2013     1    10          2.84        -5.90 
## # ... with 355 more rows

Combining Multiple Operations with the Pipe

In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.

The logic when using pipe: object %>% function1 %>% function 2….

If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.

by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
                   count = n(),
                   ave_dist = mean(distance, na.rm=T),
                   ave_arr_delay = mean(arr_delay, na.rm=T)
                   )
delay <- filter(delay, count > 20, dest != "HNL")

The following code chunk does the same task with the pipe, %>% and it makes the code easier to read.

delay <- df %>% 
  group_by(dest) %>%
  summarize(
    count = n(),
    ave_dist = mean(distance, na.rm=T),
    ave_arr_delay = mean(arr_delay, na.rm=T)
    ) %>%
  filter(count > 20, dest != "HNL")

Useful Summary Functions

Here, \(MAD = median(|x_i-\bar{x}|)\) is called the median absolute deviation which may be more useful if we have outliers.

not_cancelled <- df %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(dest) %>%
  summarize(
    distance_mu = mean(distance),
    distance_sd = sd(distance)) %>%
  arrange(desc(distance_sd))
## # A tibble: 104 x 3
##    dest  distance_mu distance_sd
##    <chr>       <dbl>       <dbl>
##  1 EGE         1736.       10.5 
##  2 SAN         2437.       10.4 
##  3 SFO         2578.       10.2 
##  4 HNL         4973.       10.0 
##  5 SEA         2413.        9.98
##  6 LAS         2241.        9.91
##  7 PDX         2446.        9.87
##  8 PHX         2141.        9.86
##  9 LAX         2469.        9.66
## 10 IND          652.        9.46
## # ... with 94 more rows
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first = min(dep_time), # the first flight departed each day
    last = max(dep_time) # the last flight departed each day
  )
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # ... with 355 more rows

The following code chunk finds the first and last departure for each day

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first_dep = first(dep_time),
    last_dep = last(dep_time)
  )
## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows
not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers))
## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

We can use count() directly if all we want is a count.

not_cancelled %>% 
  count(dest)
## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

We can optionally provide a weight variable. For example we could use this to “count” the total number of miles a plane flew.

not_cancelled %>%
  count(tailnum, wt = distance)
## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(n_early = sum(dep_time < 500))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows

Or what proportion of flights are delayed by more than one hour?

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(hour_perc = mean(arr_delay > 60))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows

Grouping by Multiple Variables

Here we show some examples to demonstrate how to group the data by multiple variables.

(per_day <- df %>% 
   group_by(year, month, day) %>%
  summarize(flights = n()))
## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows
(per_month <- summarize(per_day, flights = sum(flights)))
## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135
(per_year <- summarize(per_month, flights = sum(flights)))
## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

Ungrouping

If we need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily <- df %>% group_by(year, month, day)
daily %>% 
  ungroup() %>% # no longer grouped by date
  summarize(flights=n()) # all flights
## # A tibble: 1 x 1
##   flights
##     <int>
## 1  336776

Exercises:

  1. For each plane, count the number of flights before the first departure delay of greater than 1 hour.
df %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
  # count the number of flights == 0
  summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
  arrange(total_flights)
## # A tibble: 4,037 x 2
##    tailnum total_flights
##    <chr>           <int>
##  1 D942DN              0
##  2 N10575              0
##  3 N11106              0
##  4 N11109              0
##  5 N11187              0
##  6 N11199              0
##  7 N12967              0
##  8 N13550              0
##  9 N136DL              0
## 10 N13903              0
## # ... with 4,027 more rows
  1. What does the sort argument to count() do? When might we use it?

The sort argument to count() sorts the results in order of n. We could use this anytime we would run count() followed by arrange().

For example, the following code chunk counts the number of flights to a destination and sorts the returned data from highest to lowest.

df %>%
  count(dest, sort = TRUE)
## # A tibble: 105 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ... with 95 more rows

Grouped Mutates and Filters

We can also do convenient operations with mutate() and filter().

The following code chunk finds the worst members of each group.

df1 %>% 
  group_by(year, month, day) %>% 
  filter(rank(desc(arr_delay)) < 10)
## # A tibble: 3,306 x 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184       41
##  2  2013     1     1       290       338     1134      213
##  3  2013     1     1       260       263      266       46
##  4  2013     1     1       157       174      213       60
##  5  2013     1     1       216       222      708      121
##  6  2013     1     1       255       250      589      115
##  7  2013     1     1       285       246     1085      146
##  8  2013     1     1       192       191      199       44
##  9  2013     1     1       379       456     1092      222
## 10  2013     1     2       224       207      550       94
## # ... with 3,296 more rows

The following code chunk finds all groups bigger than a threshold.

popular_dests <- df %>%
  group_by(dest) %>% 
  filter(n()>365)
popular_dests
## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The following code chunk standardizes to compute per group metrics.

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, arr_delay, prop_delay)
## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##    dest   year month   day arr_delay prop_delay
##    <chr> <int> <int> <int>     <dbl>      <dbl>
##  1 IAH    2013     1     1        11  0.000111 
##  2 IAH    2013     1     1        20  0.000201 
##  3 MIA    2013     1     1        33  0.000235 
##  4 ORD    2013     1     1        12  0.0000424
##  5 FLL    2013     1     1        19  0.0000938
##  6 ORD    2013     1     1         8  0.0000283
##  7 LAX    2013     1     1         7  0.0000344
##  8 DFW    2013     1     1        31  0.000282 
##  9 ATL    2013     1     1        12  0.0000400
## 10 DTW    2013     1     1        16  0.000116 
## # ... with 131,096 more rows

Exercises:

  1. What time of day should you fly if you want to avoid delays as much as possible?
  2. For each destination, compute the total minutes of the delay. For each flight, compute the proportion of the total delay for its destination.

Section 2: Data Visualization with ggplot2

Brief Overview

In this session, we will introduce how to visualize our data using ggplot2 and plotly. The lecture is based on UC Business Analytics R Programming Guide.

Data Visualization with R Package: ggplot2

While we can use the built-in functions in the base package in R to obtain plots, the package ggplot2 creates advanced graphs with simple and flexible commands.

Load packages and read the Fuel Economy Data

First, we load the necessary packages, check conflict functions, and get a glimpse of the dataset mpg from the R package ggplot2.

library(tidyverse)
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("filter", "dplyr")
glimpse(mpg)
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c~

Now we need to understand the data and each variable in the data. This dataset contains 38 popular models of cars from 1999 to 2008. (Fuel Economy Data).


Grammar of Graphics

The basic idea of creating plots using ggplot2 is to specify each component of the following and combine them with +.

ggplot() function

ggplot() function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays.

The logic when using ggplot() function is: ggplot(data, mapping) + geom_function().

The Basics

The following code chunk shows how we can obtain a scatter plot to study the relationship between engine displacement and highway mileage per gallon.

# create canvas
ggplot(mpg)

# variables of interest mapped
ggplot(mpg, mapping = aes(x = displ, y = hwy))

# data plotted
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Exercises:

  1. Make a scatterplot of hwy versus cty

Aesthetic Mappings

The aesthetic mappings allow to select variables to be plotted and use data properties to influence visual characteristics such as color, size, shape, position, etc. As a result, each visual characteristic can encode a different part of the data and be utilized to communicate information.

All aesthetics for a plot are specified in the aes() function call.

For example, we can add a mapping from the class of the cars to a color characteristic:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Note:

  1. We should note that in the above code chunk, “class” is a variable in the data and therefore, the commend specifies a categorical variable is used as the third variable in the figure.

  2. Using the aes() function will cause the visual channel to be based on the data specified in the argument. For example, using aes(color = “blue”) won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c(“blue”) — as if we only had a single type of engine that happened to be called “red”. If we wish to apply an aesthetic property to an entire geometry, you can set that property as an argument to the geom method, outside of the aes() call:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")

Exercises:

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical versus continuous variables?

  2. What happens if we map an aesthetic to something other than a variable name, like ggplot(mpg, aes(x = displ, y = hwy, color = displ<5)) + geom_point() .

Specifying Geometric Shapes

Building on these basics, we can use ggplot2 to create almost any kind of plot we may want. These plots are declared using functions that follow from the Grammar of Graphics. ggplot2 supports a number of different types of geometric objects, including:

Each of these geometries will make use of the aesthetic mappings provided, albeit the visual qualities to which the data will be mapped will differ. For example, we can map data to the shape of a geom_point (e.g., if they should be circles or squares), or we can map data to the line-type of a geom_line (e.g., if it is solid or dotted), but not vice versa.

Almost all geoms require an x and y mapping at the bare minimum.

# x and y mapping needed when creating a scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

# no y mapping needed when creating a bar chart
ggplot(mpg, aes(x = class)) +
  geom_bar()  

ggplot(mpg, aes(x = hwy)) +
  geom_histogram() 

What makes this really powerful is that you can add multiple geometries to a plot, thus allowing you to create complex graphics showing multiple aspects of your data.

# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

Note: 1. Since the aesthetics for each geom can be different, we could show multiple lines on the same plot (or with different colors, styles, etc).

For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colors within each geom:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue") +
  geom_smooth(color = "red")

  1. It is also possible to give each geom a different data argument, so that we can show multiple data sets in the same plot.

If we specify an aesthetic within ggplot(), it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).

# color aesthetic passed to each geom layer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(se = FALSE)

# color aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE)

Exercises:

  1. What geom would you use to draw a line chart? A boxplot? A histogram?

  2. Create a boxplot of the highwya mileage (hwy).

Statistical Transformations

The following bar chart shows the frequency distribution of vehicle class. We can find that y axis was defined as the count of elements that have the particular type. This count is not part of the data set, but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation.

ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot2 supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. We can specify which statistical transformation a geom uses by passing it as the stat argument. For example, consider our data already had the count as a variable:

(class_count <- count(mpg, class))
## # A tibble: 7 x 2
##   class          n
##   <chr>      <int>
## 1 2seater        5
## 2 compact       47
## 3 midsize       41
## 4 minivan       11
## 5 pickup        33
## 6 subcompact    35
## 7 suv           62

We can use stat = “identity” within geom_bar to plot our bar height values to this variable. Also, note that we now include n for our y variable:

ggplot(class_count, aes(x = class, y = n)) +
  geom_bar(stat = "identity")

We can also call stat_ functions directly to add additional layers. For example, here we create a scatter plot of highway miles for each displacement value and then use stat_summary() to plot the mean highway miles at each displacement value.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point(color = "grey") + 
  stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed")

Exercises:

  1. What is the default geom associated with stat_summary()?

  2. What variables does stat_smooth()compute? What parameters control its behavior?

Position Adjustments

In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in geom_bar() if we map a different variable to the color visual characteristic.

# bar chart of class, colored by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar()

The geom_bar() by default uses a position adjustment of “stack”, which makes each rectangle’s height proprotional to its value and stacks them on top of each other. We can use the position argument to specify what position adjustment rules to follow:

# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "dodge")

# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill")

Note: We may need to check the documentation for each particular geom to learn more about its positioning adjustments.

Managing Scales

Whenever we specify an aesthetic mapping, ggplot() uses a particular scale to determine the range of values that the data should map to. It automatically adds a scale for each mapping to the plot.

# color the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

However, the sclae used in the figure could be changed if needed. Each scale can be represented by a function with the following name: scale_, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data, whereas a discrete scale will handle things like colors.

# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, we can use a scale to change the direction of an axis:

# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()

Similarly, we can use scale_x_log10() and scale_x_sqrt() to transform the scale. We can use scales to format the axes as well.

ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, by = .2), 
                     labels = scales::percent) + 
  labs(y = "Percent")

Use Pre-Defined Palettees

A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from colorbrewer.org. These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the scale_color_brewer(), passing the pallete as an argument.

# default color brewer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_color_brewer()

# specifying color palette
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_color_brewer(palette = "Set3")

Coordinate Systems

Similar to scales, coordinate systems are specified with functions that all start with coord_ and are added as a layer. There are a number of different possible coordinate systems to use, including:

# zoom in with coord_cartesian
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  coord_cartesian(xlim = c(0, 5))

# flip x and y axis with coord_flip
ggplot(mpg, aes(x = class)) +
  geom_bar() +
  coord_flip()

Facets

If we want to divide the information into multiple subplots, facets are ways to go. It allows us to view a separate plot for each case in a categorical variable. We can construct a plot with multiple facets by using the facet_wrap(). This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument).

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(~ class)

NOte: 1. We can facet_grid() to facet the data by more than one categorical variable. 2. We use a tilde (~) in our facet functions. With facet_grid() the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(year ~ cyl)

Exercises:

  1. Create a figure of multiple boxplots of the highway mileage (hwy) by the drive type (drv).

Labels & Annotations

Textual annotations and labels (on the plot, axes, geometry, and legend) are crucial for understanding and presenting information.

We can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!).

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
       x = "Engine Displacement (liters)",
       y = "Fuel Efficiency (miles per gallon)",
       color = "Car Type")

It is also possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, we are plotting an extra set of data which happen to be the variable names.

# a data table of each car that has best efficiency of its type

best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_label(data = best_in_class, aes(label = model), alpha = 0.5)

However, we can find that two labels overlap one-another in the top left part of the plot. We can use the geom_text_repel() from the ggrepel package to help position labels.

library(ggrepel)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_text_repel(data = best_in_class, aes(label = model))

Themes

Whenever we want to customize titles, labels, fonts, background, grid lines, and legends, we can use themes.

ggplot(mpg, aes(x=displ, y=hwy)) +
  labs(title = "Fuel Efficiency by Engine Power",
       x = "Engine Displacement (Liters)",
       y = "Fuel Efficiency (Miles per gallon)") + 
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),  
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

Note:

  1. We only list some key components here.

  2. See Modify Components of A Theme and Complete Themes for more details about the use of theme.

Data Visualization with R Package: plotly

The R package plotly can be used to make interactive graphic displays very easy when we already know how to use ggplot() to create graphs.

The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section.

library(plotly)
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(~ class)
ggplotly(p1)

Session 3: Data Exploration with R Package: DataExplorer

In data science, it is important to get to know your data before advanced modeling or further analysis. We should understand what the data are about, what variables we have, the size of the data, how many missing values, what is the data type of each variable, any possible relationships between variables and anything unusual or interesting in the data.

We will use the Medical Cost Personal Dataset to go over the use of functions in the package DataExplorer. For the demonstration purpose, we modified this dataset by having random missing values in some variables.

library(DataExplorer)
insurance <- read_csv("https://raw.githubusercontent.com/Ying-Ju/R_Data_Analytics_Series_NTPU/main/insurance.csv")
glimpse(insurance)
## Rows: 1,338
## Columns: 7
## $ age      <dbl> 19, 18, 28, NA, 32, NA, 46, 37, 37, 60, 25, 62, NA, 56, 27, 1~
## $ sex      <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi      <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker   <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region   <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges  <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~

First, we check the basic description for the data using the function plot_intro() in the package DataExplorer.

plot_intro(insurance)

Then, we study the distribution of missing values in the data using the function plot_missing() in the package DataExplorer.

plot_missing(insurance)

Since there are 7 variables, we will study all variables in the data.

Now, we study the frequency distribution of all categorical variables in the data using the function plot_bar() in the package DataExplorer.

plot_bar(insurance)

The following code shows the distribution of sum of charges by the categorical variables in the data, individually.

plot_bar(insurance, with="charges")

Next, we study the distribution of all quantitative variables in the data using the function plot_histogram() in the package DataExplorer.

plot_histogram(insurance, ncol=2) 

We study the distributions of age, bmi, and charges with respect to region individually using the function plot_boxplot() in the package DataExplorer.

insurance_Q <- insurance %>% 
               select(age, bmi, charges, region) %>% 
               drop_na()
plot_boxplot(insurance_Q, by = "region")

We can study the association between any quantitative variable with a given response variable in the data using the function plot_scatterplot() in the package DataExplorer. Here, we study the association between charges and other quantitative variables in the data.

plot_scatterplot(insurance_Q %>% select(-region), by = "charges")

We can get a scatterplot with sample observations as well.

plot_scatterplot(insurance_Q %>% select(-region), by = "charges", sampled_rows=100)

plot_scatterplot(insurance_Q %>% 
                   filter(region=="northwest") %>% 
                   select(-region), 
                 by = "charges")

The above figure only shows the association between charges and other quantitative variables in the northwest.

We can check the correlation of all quantitative variables in the data using the function plot_correlation() in the package DataExplorer.

plot_correlation(insurance_Q %>% select(-region), cor_args = list( "use" = "complete.obs"))

In you are new to data exploration and have no ideas about where to start. create_report() function in the package DataExplorer can help to create a report for the data exploration of the data.

create_report(insurance, output_file = "report.html", output_dir = "G:/Shared drives/R data analytics series at NTPU")

Note: Use help(“create_report”) to find the usage of create_report().

Session 4: Learn R Rmarkdown Presentation

In this session, we will introduce

  1. Rmarkdown Presentation
  2. Flex Dashboard

What is R markdown?

R Markdown is a file format for creating dynamic documents with R and RStudio. RMarkdown documents are written in Markdown which has easy-to-write plain text format with embedded R code.

Rmarkdown Presentation

In order to create a Rmarkdown presentation, we click File and then find New File and then R markdown … There are four options:

This format allows us to create a slide show and the slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (—).

Similar to ioslides, this format allows to create a slide show broken up into sections by using the heading tag ##. If a header is not needed, a new slide could be created using a horizontal rule (—). A Slidy presentation gives a table of content while An ioslides presentation doesn’t.

This format allows to create a beamer presenation (LaTex). The slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (—).

Creating a Rmarkdown Presentation

Creating a Rmarkdown Presentation

In the following, we show an example of the header of a Rmarkdown file.

We can use the output option to manipulate which presentation we would like to have.

To render an R Markdown document into its final output format, we can click the Knit button to render the document in RStudio and RStudio will show a preview of it.

The further settings for presentations could be found at R Markdown: The Definitive Guide

xaringan Presentation

A easy way to start creating a xaringan presentation is to use the R markdown template with Ninja Presentation or Ninja Themed Presentation.

Creating a Rmarkdown Presentation

Creating a Rmarkdown Presentation

A comprehensive tutorial regarding xaringan presentation could be found at xaringan Presentation

Flex Dashboard

A easy way to start creating a Flex dashboard is to use the R markdown template with Flex Dashboard.

Creating a Flex Dashboard

Creating a Flex Dashboard

A comprehensive tutorial regarding Flex Dashboard could be found at flexdashboard

Session 5: A Quick Overview of GitHub

In this session, we will provide a quick overview of GitHub. It will cover some basic usages of GitHub Desktop due to the time limitation of the class.

What is Git?

Git is a version control system that allows us to track changes in any set of files. It is typically used by programmers who are working on source code together.

What is GitHub?

GitHub is a version control and collaboration tools for programming. It allows us to collaborate on projects from any location with other people.

Register for a GitHub account

We can register for a GitHub account at www.github.com.

Install and Set up GitHub Desktop

Installing GitHub Desktop from https://desktop.github.com/

Create a repository, track changes, and explore a file’s history

We will create a process that shows how to create a repository, track changes, and explore a file’s history.

Share information on the web

Here is an example that shows how GitHub Desktop looks like.

Fetch downloads the most recent updates from origin but it does not update our local working copy with the changes. After we click Fetch origin, the button changes to Pull Origin. Clicking Pull Origin will update our local working copy with the fetched updates.

When we commit the changes, the list of uncommitted changes was gone from the left pane. We have, however, just committed the changes locally. The commit must be pushed to the remote (origin) repository.

Use GitHub Pages to Publish a html file

The html file needs to be named as index.html.

  1. Sign in our GitHub account at www.github.com
  2. Navigate to the repository where our html file is
  3. Click Settings and find Pages from the left menu
  4. Under “GitHub Pages”, use the None or Branch drop-down menu and select a publishing source
  5. Click Save

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.