Overview

In this lesson, we will talk about how to work with relational data. There are three families verbs designed to work with relational data:

Mutating joins: adding new variables to one data frame from matching observations (identifier) in another.
Filtering joins: filtering observations from one data frame based on whether or not they match an observation in the other table.
Set operations: treating observations as if they were set elements.

Note: This lesson is based on the book: R for Data Science (Wickham and Grolemund (2016)).

Reading Datasets in nycflights13

We will explore relational data from the R package nycflights13 using the two-table verbs from dplyr.

library(pacman)
p_load(tidyverse, nycflights13)

This nycflights13 package includes 5 datasets: airlines, airports, flights, planes, and weather.

airlines provides the full carrier name from its abbreviated code.
airports gives information about each airport, identified by the faa airport code.
planes provides information about each plane, identified by its tailnum.
weather gives the weather at each NYC airport for each hour.

Note: We introduce a R function p_load() in the R package pacman here. The p_load() function checks to see if a package is installed, if not it attempts to install the package from CRAN and/or any other repository in the pacman repository list.

Getting to Know Variables

Let’s take a look at what variables are in these datasets.

Show the Relationships between the Tables

One way to understand the relationships between the tables is with a drawing:

Keys

The variables used to connect each pair of tables are called keys. A key is a variable or a set of variables that uniquely identifies an observation. There are two types of keys:

A primary key uniquely identifies an observation in its own table.
A foreign key uniquely identifies an observation in another table.

If a table/dataset lacks a primary key, it may be useful to add one with the mutate() and rownumber() functions. This makes it easier to match observations and check our work. Such a variable is called a surrogate key.

A primary key and the corresponding foreign key in another table form a relation, which is typically one-to-many.

Note:

A variable can be both a primary key and a foreign key.
It is important to verify that the keys do indeed uniquely identify each observation.

Verifying Keys

Here we show two examples:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)

## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>

weather %>%
  select(year, month, day, hour, origin) %>% 
  duplicated() %>% sum()

## [1] 3

Mutating Joins

A mutating join could be used to combine variables from two tables. It matches observations by their keys first, then copies across variables from one table to the other. For the convenience, we start from a small data frame.

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

Suppose we want to add the full airline name to the flight2 data. We can combine the airlines and flight2 data frames by left_join() function.

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier") %>% 
  head(n=3)

## # A tibble: 3 × 7
##    year month   day  hour tailnum carrier name                  
##   <int> <int> <int> <dbl> <chr>   <chr>   <chr>                 
## 1  2013     1     1     5 N14228  UA      United Air Lines Inc. 
## 2  2013     1     1     5 N24211  UA      United Air Lines Inc. 
## 3  2013     1     1     5 N619AA  AA      American Airlines Inc.

Understanding Joins

Consider two data frames.

A join is a way of connecting each row in x to some or zero rows in y.

In an actual join, matches will be indicated with dots.

Inner Join

The simplest type of join is the inner join. It matches pairs of observations whenever their keys are equal. The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by to indicate which variable is the key.

x %>% 
  inner_join(y, by = "key")

## # A tibble: 2 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

Outer Joins

We should note that an inner join keeps only observations that appear in both data frames. An outer join keeps observation that apper in at least one of the data frames.

There are three types of outer joins.

The most commonly used join is the left join. We use a left join whenever we loop up additional data from another data frame since it keeps the original observations even when there is not a match.

Duplicate Keys

When keys are not unique. There are two possibilities:

One data frame has duplicate keys.

This means that there is a one-to-many relationship between two data frames.

Both data frames have duplicate keys.

This is usually an error. When we join duplicated keys, we will get all possible combinations.

Defining the Key Columns - 1

The default, by = NULL, uses all variables that appear in both tables.

flights2 %>% 
  left_join(weather) %>% 
  head()

## # A tibble: 6 × 18
##    year month   day  hour origin dest  tailnum carrier  temp  dewp humid wind_…¹
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>   <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4     260
## 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8     250
## 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6     260
## 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6     260
## 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8     260
## 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4     260
## # … with 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>, and abbreviated variable
## #   name ¹wind_dir

Defining the Key Columns - 2

A character vector, by = “A”, uses the variable A that appears in both tables.

flights2 %>% 
  left_join(planes, by = "tailnum") %>% 
  head()

## # A tibble: 6 × 16
##   year.x month   day  hour origin dest  tailnum carrier year.y type      manuf…¹
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>     <chr>  
## 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wi… BOEING 
## 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wi… BOEING 
## 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wi… BOEING 
## 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wi… AIRBUS 
## 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wi… BOEING 
## 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wi… BOEING 
## # … with 5 more variables: model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>, and abbreviated variable name ¹manufacturer

Defining the Key Columns - 3

A named character vector, by = c(“A” = “B”), matches variable A in the first data frame to variable B in the second data frame.

flights2 %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  head(n=3)

## # A tibble: 3 × 15
##    year month   day  hour origin dest  tailnum carrier name      lat   lon   alt
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97
## 2  2013     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97
## 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

flights2 %>% 
  left_join(airports, by = c("origin" = "faa")) %>% 
  head(n=3)

## # A tibble: 3 × 15
##    year month   day  hour origin dest  tailnum carrier name      lat   lon   alt
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA      Newark…  40.7 -74.2    18
## 2  2013     1     1     5 LGA    IAH   N24211  UA      La Gua…  40.8 -73.9    22
## 3  2013     1     1     5 JFK    MIA   N619AA  AA      John F…  40.6 -73.8    13
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Filtering Joins - 1

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types.

semi_join(x, y) keeps all observations in x that have a match in y.
anti_join(x, y) drops all observations in x that have a match in y.

Suppose that we want to find the 10 days with the highest average delays. How would we construct the filter statement that used year, month, and day to match it back to flights?

topdest <- flights %>% 
  count(dest, sort=TRUE) %>% 
  head(10)

flights %>% 
  filter(dest %in% topdest$dest) %>% 
  head(5) # only show the first 5 rows due to the limited space on the page

## # A tibble: 5 × 19
##    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
## 1  2013     1     1      542         540       2     923     850      33 AA     
## 2  2013     1     1      554         600      -6     812     837     -25 DL     
## 3  2013     1     1      554         558      -4     740     728      12 UA     
## 4  2013     1     1      555         600      -5     913     854      19 B6     
## 5  2013     1     1      557         600      -3     838     846      -8 B6     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Filtering Joins - 2

The previous approach is difficult to extend to the situation when including multiple variables.

flights %>% 
  semi_join(topdest) %>% 
  head(10)

## # A tibble: 10 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      542        540       2     923     850      33 AA     
##  2  2013     1     1      554        600      -6     812     837     -25 DL     
##  3  2013     1     1      554        558      -4     740     728      12 UA     
##  4  2013     1     1      555        600      -5     913     854      19 B6     
##  5  2013     1     1      557        600      -3     838     846      -8 B6     
##  6  2013     1     1      558        600      -2     753     745       8 AA     
##  7  2013     1     1      558        600      -2     924     917       7 UA     
##  8  2013     1     1      558        600      -2     923     937     -14 UA     
##  9  2013     1     1      559        559       0     702     706      -4 B6     
## 10  2013     1     1      600        600       0     851     858      -7 B6     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## #   ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay

Filtering Joins - 3

Semi-Join - 1

The following figure shows how a semi-join works. Only the existence of a match is important. It doesn’t matter which observation is matched.

Semi-Join - 2

Anti-Join

The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match.

Set Operations - 1

The final types of two-table verb are the set operations.

intersect(x, y) returns only observations in x and y.
union(x, y) returns unique observations in x and y.
setdiff(x, y) returns observations in x, but not in y.

These expects that x and y inputs to have the same variables, and treat the observations like sets.

Set Operations - 2

We see a simple example here.

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)

intersect(df1, df2)

## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     1

union(df1, df2)

## # A tibble: 3 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     2     1
## 3     1     2

setdiff(df1, df2)

## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     2     1

setdiff(df2, df1)

## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     2

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.

MTH 209 Data Manipulation and Management

Lesson 13: Relational Data with dplyr