Session 1: Data Manipulation

Brief Overview

In this session, we will talk about data manipulation using R package tidyverse. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:

dplyr: data manipulation
ggplot2: data visualization
purr: functional programming toolkit
readr: read data and write files
tibble: simple data frame
tidyr: data management

In this session, we will focus on the following key functions in dplyr using the dataset flights from the R package nycflights13.

filter(): pick observations by their values
arrange(): reorder the rows
select(): select variables by their names
mutate(): create new variables with functions of existing variables
group_by(): group data by existing variables
summarize(): collapse many values done to a single summary (with group_by)

All functions above work similarly.

The first argument is a data frame.
The subsequent arguments describe what to do with the data frame using the variable names.
The result is a new data frame (but we can save it back to the original data frame if needed).

Load packages and read the Flights Data

First, we load the necessary packages, check conflict functions, and import the dataset flights from the R package nycflights13.

library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights

Now we need to understand the data and each variable before we move on. This dataset provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables (Flights Data).

year, month, day: Date of departure.
dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone.
sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone.
dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier: Two letter carrier abbreviation. See airlines to get name.
flight: Flight number.
tailnum: Plane tail number. See planes for additional metadata.
origin, dest: Origin and destination. See airports for additional metadata.
air_time: Amount of time spent in the air, in minutes.
distance: Distance between airports, in miles.
hour, minute: Time of scheduled departure broken into hour and minutes.
time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

We get a glimpse of the data.

glimpse(df)

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~

filter()

filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.

filter(df, month == 12, day == 25)

## # A tibble: 719 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013    12    25      456            500        -4      649            651
##  2  2013    12    25      524            515         9      805            814
##  3  2013    12    25      542            540         2      832            850
##  4  2013    12    25      546            550        -4     1022           1027
##  5  2013    12    25      556            600        -4      730            745
##  6  2013    12    25      557            600        -3      743            752
##  7  2013    12    25      557            600        -3      818            831
##  8  2013    12    25      559            600        -1      855            856
##  9  2013    12    25      559            600        -1      849            855
## 10  2013    12    25      600            600         0      850            846
## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Christmas <- filter(df, month == 12, day == 25)

Comparisons - R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal).

If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses

(Jan1 <- filter(df, month == 1, day == 1))

## # A tibble: 842 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Logical Operations - R provides the following syntax: & is “and”, | is “or”, ! is “not”.

The following code finds all flights that departed in July or August.

filter(df, month == 7 | month == 8)

## # A tibble: 58,752 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

filter(df, month %in% c(7, 8))

## # A tibble: 58,752 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     7     1        1           2029       212      236           2359
##  2  2013     7     1        2           2359         3      344            344
##  3  2013     7     1       29           2245       104      151              1
##  4  2013     7     1       43           2130       193      322             14
##  5  2013     7     1       44           2150       174      300            100
##  6  2013     7     1       46           2051       235      304           2358
##  7  2013     7     1       48           2001       287      308           2305
##  8  2013     7     1       58           2155       183      335             43
##  9  2013     7     1      100           2146       194      327             30
## 10  2013     7     1      100           2245       135      337            135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

If we use filter(df, month == 7 | 8), it finds all months are equal 7 | 8, an expression that evaluates to TRUE. In a numeric context, TRUE becomes one, so this finds all fights in the data.
filter() only includes rows where the condition is TRUE and it excludes both FALSE and NA values.

If we want to find flights that weren’t delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.

filter(df, !(arr_delay > 60 | dep_delay > 60))

## # A tibble: 295,893 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

filter(df, arr_delay <= 60, dep_delay <= 60)

## # A tibble: 295,893 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises: Find all flights that:

Had an arrival delay of one or more hours
Were operated by American (AA), Delta (DL), or United (UA) airlines
Flew to Houston (IAH or HOU)
Departed in winter (December, January, February)
arrived more than one hour late, but didn’t leave late

arrange()

arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.

arrange(df, year, month, day)

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

We can save the data frame back to the original data frame after sorting the data.
Use desc() for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order.
Missing values are always sorted at the end.

arrange(df, desc(arr_delay))

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     9      641            900      1301     1242           1530
##  2  2013     6    15     1432           1935      1137     1607           2120
##  3  2013     1    10     1121           1635      1126     1239           1810
##  4  2013     9    20     1139           1845      1014     1457           2210
##  5  2013     7    22      845           1600      1005     1044           1815
##  6  2013     4    10     1100           1900       960     1342           2211
##  7  2013     3    17     2321            810       911      135           1020
##  8  2013     7    22     2257            759       898      121           1026
##  9  2013    12     5      756           1700       896     1058           2020
## 10  2013     5     3     1133           2055       878     1250           2215
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

tail(arrange(df, desc(arr_delay)))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     9    30       NA           1842        NA       NA           2019
## 2  2013     9    30       NA           1455        NA       NA           1634
## 3  2013     9    30       NA           2200        NA       NA           2312
## 4  2013     9    30       NA           1210        NA       NA           1330
## 5  2013     9    30       NA           1159        NA       NA           1344
## 6  2013     9    30       NA            840        NA       NA           1020
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises:

Sort flights to find the most delayed flights. Find the flights that left earliest.
Sort flights to find the fastest flights.

select()

select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.

# select specific columns
select(df, year, month, day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# select all columns between year and day
select(df, year:day)

## # A tibble: 336,776 x 3
##     year month   day
##    <int> <int> <int>
##  1  2013     1     1
##  2  2013     1     1
##  3  2013     1     1
##  4  2013     1     1
##  5  2013     1     1
##  6  2013     1     1
##  7  2013     1     1
##  8  2013     1     1
##  9  2013     1     1
## 10  2013     1     1
## # ... with 336,766 more rows

# select all columns except those from year and day
select(df, -(year:day))

## # A tibble: 336,776 x 16
##    dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
##       <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
##  1      517            515         2      830            819        11 UA     
##  2      533            529         4      850            830        20 UA     
##  3      542            540         2      923            850        33 AA     
##  4      544            545        -1     1004           1022       -18 B6     
##  5      554            600        -6      812            837       -25 DL     
##  6      554            558        -4      740            728        12 UA     
##  7      555            600        -5      913            854        19 B6     
##  8      557            600        -3      709            723       -14 EV     
##  9      557            600        -3      838            846        -8 B6     
## 10      558            600        -2      753            745         8 AA     
## # ... with 336,766 more rows, and 9 more variables: flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Note:

We can use a minus sign - to drop variables.
There are several helper functions we can use within select(). See ?select for the information.
select() can be used with the everything() function when we have a handful of variables we would like to move to the start of the data frame.

# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())

## # A tibble: 336,776 x 19
##    carrier origin dest  distance  year month   day dep_time sched_dep_time
##    <chr>   <chr>  <chr>    <dbl> <int> <int> <int>    <int>          <int>
##  1 UA      EWR    IAH       1400  2013     1     1      517            515
##  2 UA      LGA    IAH       1416  2013     1     1      533            529
##  3 AA      JFK    MIA       1089  2013     1     1      542            540
##  4 B6      JFK    BQN       1576  2013     1     1      544            545
##  5 DL      LGA    ATL        762  2013     1     1      554            600
##  6 UA      EWR    ORD        719  2013     1     1      554            558
##  7 B6      EWR    FLL       1065  2013     1     1      555            600
##  8 EV      LGA    IAD        229  2013     1     1      557            600
##  9 B6      JFK    MCO        944  2013     1     1      557            600
## 10 AA      LGA    ORD        733  2013     1     1      558            600
## # ... with 336,766 more rows, and 10 more variables: dep_delay <dbl>,
## #   arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## #   tailnum <chr>, air_time <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Exercises:

Select dep_time, dep_delay, arr_time, and arr_delay from the Flights Data.
What happens if we include the name of a variable multiple times in a select() call?
What is the result of running the following code select(df, contains(“TIME”))?

mutate()

mutate() is used when we would like to add a new variable / column using the other variables in the data.

Note: mutate() always adds new columns at the end of the data.

First, we start by creating a smaller dataset with a few variables and create two variables using varaibles in the dataset.

# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)

mutate(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)

## # A tibble: 336,776 x 11
##     year month   day dep_delay arr_delay distance air_time  gain speed hours
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl> <dbl>
##  1  2013     1     1         2        11     1400      227     9  370. 3.78 
##  2  2013     1     1         4        20     1416      227    16  374. 3.78 
##  3  2013     1     1         2        33     1089      160    31  408. 2.67 
##  4  2013     1     1        -1       -18     1576      183   -17  517. 3.05 
##  5  2013     1     1        -6       -25      762      116   -19  394. 1.93 
##  6  2013     1     1        -4        12      719      150    16  288. 2.5  
##  7  2013     1     1        -5        19     1065      158    24  404. 2.63 
##  8  2013     1     1        -3       -14      229       53   -11  259. 0.883
##  9  2013     1     1        -3        -8      944      140    -5  405. 2.33 
## 10  2013     1     1        -2         8      733      138    10  319. 2.3  
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>

If we only want to keep the new variables, use transmute().

transmute(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)

## # A tibble: 336,776 x 4
##     gain speed hours gain_per_hour
##    <dbl> <dbl> <dbl>         <dbl>
##  1     9  370. 3.78           2.38
##  2    16  374. 3.78           4.23
##  3    31  408. 2.67          11.6 
##  4   -17  517. 3.05          -5.57
##  5   -19  394. 1.93          -9.83
##  6    16  288. 2.5            6.4 
##  7    24  404. 2.63           9.11
##  8   -11  259. 0.883        -12.5 
##  9    -5  405. 2.33          -2.14
## 10    10  319. 2.3            4.35
## # ... with 336,766 more rows

Note: There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.

Exercises:

Currently dep_time and sched_dep_time are convenient to look at, but hard to compute with because they are not really continuous numbers. Convert them to a more convenient representation of number of minutes since midnight.

select(df, dep_time, sched_dep_time)

## # A tibble: 336,776 x 2
##    dep_time sched_dep_time
##       <int>          <int>
##  1      517            515
##  2      533            529
##  3      542            540
##  4      544            545
##  5      554            600
##  6      554            558
##  7      555            600
##  8      557            600
##  9      557            600
## 10      558            600
## # ... with 336,766 more rows

For example, 517 represents 5:17 (5:17 AM) and 1517 represents 15:17 (or 3:17 PM). We will use 1517 to demonstrate how to convert the time to the number of minutes since midnight (\(15 \times 60+17=917\) minutes).

We need to be able to extract 15 and 17 separately. We can use the integer division operator, %/%, and the modulo operator, %%, to achieve this.

1517 %/% 100

## [1] 15

1517 %% 100

## [1] 17

Now we still have an issue. Since Midnight is represented by 2400, which would correspond to \(24 \times 60 = 1440\) minutes since midnight, but it should correspond to 0. After converting all the times to minutes after midnight, whatever_time %% 1440 will convert 1440 to zero while keeping all the other times the same.

transmute(df, 
          dep_time_mins = (dep_time %/% 100 * 60 + dep_time %% 100) %% 1400,
          sched_dep_time_mins = (sched_dep_time %/% 100 * 60 + sched_dep_time %% 100) %% 1400
)

## # A tibble: 336,776 x 2
##    dep_time_mins sched_dep_time_mins
##            <dbl>               <dbl>
##  1           317                 315
##  2           333                 329
##  3           342                 340
##  4           344                 345
##  5           354                 360
##  6           354                 358
##  7           355                 360
##  8           357                 360
##  9           357                 360
## 10           358                 360
## # ... with 336,766 more rows

As we can see that the formula used to create the two variables are the same, we should write a function to avoid copying and pasting code in the previous exercise. Think about how to achieve this.

time_to_mins <- function(x) (x %/% 100 * 60 + x %% 100) %% 1400

transmute(df, 
          dep_time_mins = time_to_mins(dep_time),
          sched_dep_time_mins = time_to_mins(sched_dep_time)
)

## # A tibble: 336,776 x 2
##    dep_time_mins sched_dep_time_mins
##            <dbl>               <dbl>
##  1           317                 315
##  2           333                 329
##  3           342                 340
##  4           344                 345
##  5           354                 360
##  6           354                 358
##  7           355                 360
##  8           357                 360
##  9           357                 360
## 10           358                 360
## # ... with 336,766 more rows

Find the 10 most delayed flights.
What does 1:5 + 1:10 return? Why?

group_by() & summarize()

summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.

summarize(df, delay = mean(dep_delay, na.rm=T))

## # A tibble: 1 x 1
##   delay
##   <dbl>
## 1  12.6

In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.

group_by(df, year, month, day)

## # A tibble: 336,776 x 19
## # Groups:   year, month, day [365]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The result shows the original data but indicates groups: year, month, day, in our example. For example, we can study the average departure / arrival delays for each day.

by_day <- group_by(df, year, month, day) 
summarize(by_day, 
          ave_dep_delay = mean(dep_delay, na.rm = T),
          ave_arr_delay = mean(arr_delay, na.rm = T)
          )

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day ave_dep_delay ave_arr_delay
##    <int> <int> <int>         <dbl>         <dbl>
##  1  2013     1     1         11.5         12.7  
##  2  2013     1     2         13.9         12.7  
##  3  2013     1     3         11.0          5.73 
##  4  2013     1     4          8.95        -1.93 
##  5  2013     1     5          5.73        -1.53 
##  6  2013     1     6          7.15         4.24 
##  7  2013     1     7          5.42        -4.95 
##  8  2013     1     8          2.55        -3.23 
##  9  2013     1     9          2.28        -0.264
## 10  2013     1    10          2.84        -5.90 
## # ... with 355 more rows

Combining Multiple Operations with the Pipe

In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.

The logic when using pipe: object %>% function1 %>% function 2….

If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.

by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
                   count = n(),
                   ave_dist = mean(distance, na.rm=T),
                   ave_arr_delay = mean(arr_delay, na.rm=T)
                   )
delay <- filter(delay, count > 20, dest != "HNL")

The following code chunk does the same task with the pipe, %>% and it makes the code easier to read.

delay <- df %>% 
  group_by(dest) %>%
  summarize(
    count = n(),
    ave_dist = mean(distance, na.rm=T),
    ave_arr_delay = mean(arr_delay, na.rm=T)
    ) %>%
  filter(count > 20, dest != "HNL")

Useful Summary Functions

Measures of location for a quantitative variable: mean(), median()
Measure of spread for a quantitative variable: sd(), IQR(), mad()

Here, \(MAD = median(|x_i-\bar{x}|)\) is called the median absolute deviation which may be more useful if we have outliers.

not_cancelled <- df %>% 
  filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(dest) %>%
  summarize(
    distance_mu = mean(distance),
    distance_sd = sd(distance)) %>%
  arrange(desc(distance_sd))

## # A tibble: 104 x 3
##    dest  distance_mu distance_sd
##    <chr>       <dbl>       <dbl>
##  1 EGE         1736.       10.5 
##  2 SAN         2437.       10.4 
##  3 SFO         2578.       10.2 
##  4 HNL         4973.       10.0 
##  5 SEA         2413.        9.98
##  6 LAS         2241.        9.91
##  7 PDX         2446.        9.87
##  8 PHX         2141.        9.86
##  9 LAX         2469.        9.66
## 10 IND          652.        9.46
## # ... with 94 more rows

Measures of rank: min(), quantile(), max()

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first = min(dep_time), # the first flight departed each day
    last = max(dep_time) # the last flight departed each day
  )

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first  last
##    <int> <int> <int> <int> <int>
##  1  2013     1     1   517  2356
##  2  2013     1     2    42  2354
##  3  2013     1     3    32  2349
##  4  2013     1     4    25  2358
##  5  2013     1     5    14  2357
##  6  2013     1     6    16  2355
##  7  2013     1     7    49  2359
##  8  2013     1     8   454  2351
##  9  2013     1     9     2  2252
## 10  2013     1    10     3  2320
## # ... with 355 more rows

Measures of position: first(), nth(x, 2), last()

The following code chunk finds the first and last departure for each day

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first_dep = first(dep_time),
    last_dep = last(dep_time)
  )

## # A tibble: 365 x 5
## # Groups:   year, month [12]
##     year month   day first_dep last_dep
##    <int> <int> <int>     <int>    <int>
##  1  2013     1     1       517     2356
##  2  2013     1     2        42     2354
##  3  2013     1     3        32     2349
##  4  2013     1     4        25     2358
##  5  2013     1     5        14     2357
##  6  2013     1     6        16     2355
##  7  2013     1     7        49     2359
##  8  2013     1     8       454     2351
##  9  2013     1     9         2     2252
## 10  2013     1    10         3     2320
## # ... with 355 more rows

Counts: You have seen n(), which takes no arguments, and returns the size of the current group. To count the nubmer of non-missing values, we can use sum(is.na(x)). To count the number of distinct values, use n_distinct().

not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers))

## # A tibble: 104 x 2
##    dest  carriers
##    <chr>    <int>
##  1 ATL          7
##  2 BOS          7
##  3 CLT          7
##  4 ORD          7
##  5 TPA          7
##  6 AUS          6
##  7 DCA          6
##  8 DTW          6
##  9 IAD          6
## 10 MSP          6
## # ... with 94 more rows

We can use count() directly if all we want is a count.

not_cancelled %>% 
  count(dest)

## # A tibble: 104 x 2
##    dest      n
##    <chr> <int>
##  1 ABQ     254
##  2 ACK     264
##  3 ALB     418
##  4 ANC       8
##  5 ATL   16837
##  6 AUS    2411
##  7 AVL     261
##  8 BDL     412
##  9 BGR     358
## 10 BHM     269
## # ... with 94 more rows

We can optionally provide a weight variable. For example we could use this to “count” the total number of miles a plane flew.

not_cancelled %>%
  count(tailnum, wt = distance)

## # A tibble: 4,037 x 2
##    tailnum      n
##    <chr>    <dbl>
##  1 D942DN    3418
##  2 N0EGMQ  239143
##  3 N10156  109664
##  4 N102UW   25722
##  5 N103US   24619
##  6 N104UW   24616
##  7 N10575  139903
##  8 N105UW   23618
##  9 N107US   21677
## 10 N108UW   32070
## # ... with 4,027 more rows

Counts and proportions of logical values

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(n_early = sum(dep_time < 500))

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day n_early
##    <int> <int> <int>   <int>
##  1  2013     1     1       0
##  2  2013     1     2       3
##  3  2013     1     3       4
##  4  2013     1     4       3
##  5  2013     1     5       3
##  6  2013     1     6       2
##  7  2013     1     7       2
##  8  2013     1     8       1
##  9  2013     1     9       3
## 10  2013     1    10       3
## # ... with 355 more rows

Or what proportion of flights are delayed by more than one hour?

not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(hour_perc = mean(arr_delay > 60))

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day hour_perc
##    <int> <int> <int>     <dbl>
##  1  2013     1     1    0.0722
##  2  2013     1     2    0.0851
##  3  2013     1     3    0.0567
##  4  2013     1     4    0.0396
##  5  2013     1     5    0.0349
##  6  2013     1     6    0.0470
##  7  2013     1     7    0.0333
##  8  2013     1     8    0.0213
##  9  2013     1     9    0.0202
## 10  2013     1    10    0.0183
## # ... with 355 more rows

Grouping by Multiple Variables

Here we show some examples to demonstrate how to group the data by multiple variables.

(per_day <- df %>% 
   group_by(year, month, day) %>%
  summarize(flights = n()))

## # A tibble: 365 x 4
## # Groups:   year, month [12]
##     year month   day flights
##    <int> <int> <int>   <int>
##  1  2013     1     1     842
##  2  2013     1     2     943
##  3  2013     1     3     914
##  4  2013     1     4     915
##  5  2013     1     5     720
##  6  2013     1     6     832
##  7  2013     1     7     933
##  8  2013     1     8     899
##  9  2013     1     9     902
## 10  2013     1    10     932
## # ... with 355 more rows

(per_month <- summarize(per_day, flights = sum(flights)))

## # A tibble: 12 x 3
## # Groups:   year [1]
##     year month flights
##    <int> <int>   <int>
##  1  2013     1   27004
##  2  2013     2   24951
##  3  2013     3   28834
##  4  2013     4   28330
##  5  2013     5   28796
##  6  2013     6   28243
##  7  2013     7   29425
##  8  2013     8   29327
##  9  2013     9   27574
## 10  2013    10   28889
## 11  2013    11   27268
## 12  2013    12   28135

(per_year <- summarize(per_month, flights = sum(flights)))

## # A tibble: 1 x 2
##    year flights
##   <int>   <int>
## 1  2013  336776

Ungrouping

If we need to remove grouping, and return to operations on ungrouped data, use ungroup().

daily <- df %>% group_by(year, month, day)
daily %>% 
  ungroup() %>% # no longer grouped by date
  summarize(flights=n()) # all flights

## # A tibble: 1 x 1
##   flights
##     <int>
## 1  336776

Exercises:

For each plane, count the number of flights before the first departure delay of greater than 1 hour.

df %>%
  filter(!is.na(dep_delay)) %>%
  arrange(tailnum, year, month, day) %>%
  group_by(tailnum) %>%
  # cumulative number of flights delayed over one hour
  mutate(cumulative_hr_delays = cumsum(dep_delay > 60)) %>%
  # count the number of flights == 0
  summarise(total_flights = sum(cumulative_hr_delays < 1)) %>%
  arrange(total_flights)

## # A tibble: 4,037 x 2
##    tailnum total_flights
##    <chr>           <int>
##  1 D942DN              0
##  2 N10575              0
##  3 N11106              0
##  4 N11109              0
##  5 N11187              0
##  6 N11199              0
##  7 N12967              0
##  8 N13550              0
##  9 N136DL              0
## 10 N13903              0
## # ... with 4,027 more rows

What does the sort argument to count() do? When might we use it?

The sort argument to count() sorts the results in order of n. We could use this anytime we would run count() followed by arrange().

For example, the following code chunk counts the number of flights to a destination and sorts the returned data from highest to lowest.

df %>%
  count(dest, sort = TRUE)

## # A tibble: 105 x 2
##    dest      n
##    <chr> <int>
##  1 ORD   17283
##  2 ATL   17215
##  3 LAX   16174
##  4 BOS   15508
##  5 MCO   14082
##  6 CLT   14064
##  7 SFO   13331
##  8 FLL   12055
##  9 MIA   11728
## 10 DCA    9705
## # ... with 95 more rows

Grouped Mutates and Filters

We can also do convenient operations with mutate() and filter().

The following code chunk finds the worst members of each group.

df1 %>% 
  group_by(year, month, day) %>% 
  filter(rank(desc(arr_delay)) < 10)

## # A tibble: 3,306 x 7
## # Groups:   year, month, day [365]
##     year month   day dep_delay arr_delay distance air_time
##    <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
##  1  2013     1     1       853       851      184       41
##  2  2013     1     1       290       338     1134      213
##  3  2013     1     1       260       263      266       46
##  4  2013     1     1       157       174      213       60
##  5  2013     1     1       216       222      708      121
##  6  2013     1     1       255       250      589      115
##  7  2013     1     1       285       246     1085      146
##  8  2013     1     1       192       191      199       44
##  9  2013     1     1       379       456     1092      222
## 10  2013     1     2       224       207      550       94
## # ... with 3,296 more rows

The following code chunk finds all groups bigger than a threshold.

popular_dests <- df %>%
  group_by(dest) %>% 
  filter(n()>365)
popular_dests

## # A tibble: 332,577 x 19
## # Groups:   dest [77]
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 332,567 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The following code chunk standardizes to compute per group metrics.

popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, arr_delay, prop_delay)

## # A tibble: 131,106 x 6
## # Groups:   dest [77]
##    dest   year month   day arr_delay prop_delay
##    <chr> <int> <int> <int>     <dbl>      <dbl>
##  1 IAH    2013     1     1        11  0.000111 
##  2 IAH    2013     1     1        20  0.000201 
##  3 MIA    2013     1     1        33  0.000235 
##  4 ORD    2013     1     1        12  0.0000424
##  5 FLL    2013     1     1        19  0.0000938
##  6 ORD    2013     1     1         8  0.0000283
##  7 LAX    2013     1     1         7  0.0000344
##  8 DFW    2013     1     1        31  0.000282 
##  9 ATL    2013     1     1        12  0.0000400
## 10 DTW    2013     1     1        16  0.000116 
## # ... with 131,096 more rows

Exercises:

What time of day should you fly if you want to avoid delays as much as possible?
For each destination, compute the total minutes of the delay. For each flight, compute the proportion of the total delay for its destination.

Section 2: Data Visualization with ggplot2

Brief Overview

In this session, we will introduce how to visualize our data using ggplot2 and plotly. The lecture is based on UC Business Analytics R Programming Guide.

Data Visualization with R Package: ggplot2

While we can use the built-in functions in the base package in R to obtain plots, the package ggplot2 creates advanced graphs with simple and flexible commands.

Load packages and read the Fuel Economy Data

First, we load the necessary packages, check conflict functions, and get a glimpse of the dataset mpg from the R package ggplot2.

library(tidyverse)
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("filter", "dplyr")
glimpse(mpg)

## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "c~

Now we need to understand the data and each variable in the data. This dataset contains 38 popular models of cars from 1999 to 2008. (Fuel Economy Data).

manufacturer: car manufacturer
model: model name
displ: engine displacement, in liters
year: year of manufacturing (1999-2008)
cyl: number of cylinders
trans: type of transmission
drv: drive type (f, r, 4, f=front wheel, r=rear wheel, 4=4 wheel)
cty: city mileage miles per gallon
hwy: highway mileage miles per gallon
fl: fuel type (diesel, petrol, electric, etc.)
class: vehicle class 7 types (compact, SUV, minivan etc.)

Grammar of Graphics

The basic idea of creating plots using ggplot2 is to specify each component of the following and combine them with +.

ggplot() function

ggplot() function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays.

The logic when using ggplot() function is: ggplot(data, mapping) + geom_function().

The Basics

The following code chunk shows how we can obtain a scatter plot to study the relationship between engine displacement and highway mileage per gallon.

# create canvas
ggplot(mpg)

# variables of interest mapped
ggplot(mpg, mapping = aes(x = displ, y = hwy))

# data plotted
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

Exercises:

Make a scatterplot of hwy versus cty

Aesthetic Mappings

The aesthetic mappings allow to select variables to be plotted and use data properties to influence visual characteristics such as color, size, shape, position, etc. As a result, each visual characteristic can encode a different part of the data and be utilized to communicate information.

All aesthetics for a plot are specified in the aes() function call.

For example, we can add a mapping from the class of the cars to a color characteristic:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

Note:

We should note that in the above code chunk, “class” is a variable in the data and therefore, the commend specifies a categorical variable is used as the third variable in the figure.
Using the aes() function will cause the visual channel to be based on the data specified in the argument. For example, using aes(color = “blue”) won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c(“blue”) — as if we only had a single type of engine that happened to be called “red”. If we wish to apply an aesthetic property to an entire geometry, you can set that property as an argument to the geom method, outside of the aes() call:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")

Exercises:

Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical versus continuous variables?
What happens if we map an aesthetic to something other than a variable name, like ggplot(mpg, aes(x = displ, y = hwy, color = displ<5)) + geom_point() .

Specifying Geometric Shapes

Building on these basics, we can use ggplot2 to create almost any kind of plot we may want. These plots are declared using functions that follow from the Grammar of Graphics. ggplot2 supports a number of different types of geometric objects, including:

geom_bar(): bar charts
geom_boxplot(): boxplots
geom_histogram(): histograms
geom_line(): lines
geom_map(): polygons in the shape of a map.
geom_point(): individual points
geom_polygon(): arbitrary shapes
geom_smooth(): smoothed lines

Each of these geometries will make use of the aesthetic mappings provided, albeit the visual qualities to which the data will be mapped will differ. For example, we can map data to the shape of a geom_point (e.g., if they should be circles or squares), or we can map data to the line-type of a geom_line (e.g., if it is solid or dotted), but not vice versa.

Almost all geoms require an x and y mapping at the bare minimum.

# x and y mapping needed when creating a scatterplot
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth()

# no y mapping needed when creating a bar chart
ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot(mpg, aes(x = hwy)) +
  geom_histogram()

What makes this really powerful is that you can add multiple geometries to a plot, thus allowing you to create complex graphics showing multiple aspects of your data.

# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()

Note: 1. Since the aesthetics for each geom can be different, we could show multiple lines on the same plot (or with different colors, styles, etc).

For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colors within each geom:

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue") +
  geom_smooth(color = "red")

It is also possible to give each geom a different data argument, so that we can show multiple data sets in the same plot.

If we specify an aesthetic within ggplot(), it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).

# color aesthetic passed to each geom layer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  geom_smooth(se = FALSE)

# color aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE)

Exercises:

What geom would you use to draw a line chart? A boxplot? A histogram?
Create a boxplot of the highwya mileage (hwy).

Statistical Transformations

The following bar chart shows the frequency distribution of vehicle class. We can find that y axis was defined as the count of elements that have the particular type. This count is not part of the data set, but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation.

ggplot(mpg, aes(x = class)) +
  geom_bar()

ggplot2 supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. We can specify which statistical transformation a geom uses by passing it as the stat argument. For example, consider our data already had the count as a variable:

(class_count <- count(mpg, class))

## # A tibble: 7 x 2
##   class          n
##   <chr>      <int>
## 1 2seater        5
## 2 compact       47
## 3 midsize       41
## 4 minivan       11
## 5 pickup        33
## 6 subcompact    35
## 7 suv           62

We can use stat = “identity” within geom_bar to plot our bar height values to this variable. Also, note that we now include n for our y variable:

ggplot(class_count, aes(x = class, y = n)) +
  geom_bar(stat = "identity")

We can also call stat_ functions directly to add additional layers. For example, here we create a scatter plot of highway miles for each displacement value and then use stat_summary() to plot the mean highway miles at each displacement value.

ggplot(mpg, aes(displ, hwy)) + 
  geom_point(color = "grey") + 
  stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed")

Exercises:

What is the default geom associated with stat_summary()?
What variables does stat_smooth()compute? What parameters control its behavior?

Position Adjustments

In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in geom_bar() if we map a different variable to the color visual characteristic.

# bar chart of class, colored by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar()

The geom_bar() by default uses a position adjustment of “stack”, which makes each rectangle’s height proprotional to its value and stacks them on top of each other. We can use the position argument to specify what position adjustment rules to follow:

# position = "dodge": values next to each other
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "dodge")

# position = "fill": percentage chart
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill")

Note: We may need to check the documentation for each particular geom to learn more about its positioning adjustments.

Managing Scales

Whenever we specify an aesthetic mapping, ggplot() uses a particular scale to determine the range of values that the data should map to. It automatically adds a scale for each mapping to the plot.

# color the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()

However, the sclae used in the figure could be changed if needed. Each scale can be represented by a function with the following name: scale_, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data, whereas a discrete scale will handle things like colors.

# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()

While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, we can use a scale to change the direction of an axis:

# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()

Similarly, we can use scale_x_log10() and scale_x_sqrt() to transform the scale. We can use scales to format the axes as well.

ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, by = .2), 
                     labels = scales::percent) + 
  labs(y = "Percent")

Use Pre-Defined Palettees

A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from colorbrewer.org. These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the scale_color_brewer(), passing the pallete as an argument.

# default color brewer
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_color_brewer()

# specifying color palette
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_color_brewer(palette = "Set3")

Coordinate Systems

Similar to scales, coordinate systems are specified with functions that all start with coord_ and are added as a layer. There are a number of different possible coordinate systems to use, including:

coord_cartesian: the default Cartesian coordinate system, where you specify x and y values
coord_flip: a cartesian system with the x and y flipped
coord_fixed: a cartesian system with a “fixed” aspect ratio
coord_polar: a plot using polar coordinates
coord_quickmap: a coordinate system that approximates a good aspect ratio for maps.

# zoom in with coord_cartesian
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  coord_cartesian(xlim = c(0, 5))

# flip x and y axis with coord_flip
ggplot(mpg, aes(x = class)) +
  geom_bar() +
  coord_flip()

If we want to divide the information into multiple subplots, facets are ways to go. It allows us to view a separate plot for each case in a categorical variable. We can construct a plot with multiple facets by using the facet_wrap(). This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument).

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(~ class)

NOte: 1. We can facet_grid() to facet the data by more than one categorical variable. 2. We use a tilde (~) in our facet functions. With facet_grid() the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(year ~ cyl)

Exercises:

Create a figure of multiple boxplots of the highway mileage (hwy) by the drive type (drv).

Labels & Annotations

Textual annotations and labels (on the plot, axes, geometry, and legend) are crucial for understanding and presenting information.

labs: assign title, subtitile, caption, x & y labels

We can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!).

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
       x = "Engine Displacement (liters)",
       y = "Fuel Efficiency (miles per gallon)",
       color = "Car Type")

It is also possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, we are plotting an extra set of data which happen to be the variable names.

# a data table of each car that has best efficiency of its type

best_in_class <- mpg %>%
  group_by(class) %>%
  filter(row_number(desc(hwy)) == 1)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_label(data = best_in_class, aes(label = model), alpha = 0.5)

However, we can find that two labels overlap one-another in the top left part of the plot. We can use the geom_text_repel() from the ggrepel package to help position labels.

library(ggrepel)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_text_repel(data = best_in_class, aes(label = model))

Themes

Whenever we want to customize titles, labels, fonts, background, grid lines, and legends, we can use themes.

ggplot(mpg, aes(x=displ, y=hwy)) +
  labs(title = "Fuel Efficiency by Engine Power",
       x = "Engine Displacement (Liters)",
       y = "Fuel Efficiency (Miles per gallon)") + 
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),  
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))

Note:

We only list some key components here.
See Modify Components of A Theme and Complete Themes for more details about the use of theme.

Data Visualization with R Package: plotly

The R package plotly can be used to make interactive graphic displays very easy when we already know how to use ggplot() to create graphs.

The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section.

library(plotly)
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(~ class)
ggplotly(p1)

R Data Analytics Series at NTPU, 2022

README

Session 1: Data Manipulation

Brief Overview

Load packages and read the Flights Data

filter()

arrange()

select()

mutate()

group_by() & summarize()

Combining Multiple Operations with the Pipe

Useful Summary Functions

Grouping by Multiple Variables

Ungrouping

Grouped Mutates and Filters

Section 2: Data Visualization with ggplot2

Brief Overview

Data Visualization with R Package: ggplot2

Load packages and read the Fuel Economy Data

Grammar of Graphics

ggplot() function

The Basics

Aesthetic Mappings

Specifying Geometric Shapes

Statistical Transformations

Position Adjustments

Managing Scales

Use Pre-Defined Palettees

Coordinate Systems

Facets

Labels & Annotations

Themes

Data Visualization with R Package: plotly

Session 3: Data Exploration with R Package: DataExplorer

Session 4: Learn R Rmarkdown Presentation

What is R markdown?

Rmarkdown Presentation

xaringan Presentation

Flex Dashboard

Session 5: A Quick Overview of GitHub

What is Git?

What is GitHub?

Register for a GitHub account

Install and Set up GitHub Desktop

Create a repository, track changes, and explore a file’s history

Share information on the web

Use GitHub Pages to Publish a html file