MTH 209 Data Manipulation and Management

Lesson 13: Relational Data with dplyr

Ying-Ju Tessa Chen
ychen4@udayton.edu
University of Dayton

Overview

In this lesson, we will talk about how to work with relational data. There are three families verbs designed to work with relational data:

Note: This lesson is based on the book: R for Data Science (Wickham and Grolemund (2016)).

Reading Datasets in nycflights13

We will explore relational data from the R package nycflights13 using the two-table verbs from dplyr.

library(pacman)
p_load(tidyverse, nycflights13)

This nycflights13 package includes 5 datasets: airlines, airports, flights, planes, and weather.

Note: We introduce a R function p_load() in the R package pacman here. The p_load() function checks to see if a package is installed, if not it attempts to install the package from CRAN and/or any other repository in the pacman repository list.

Getting to Know Variables

Let’s take a look at what variables are in these datasets.

Show the Relationships between the Tables

One way to understand the relationships between the tables is with a drawing:

Keys

The variables used to connect each pair of tables are called keys. A key is a variable or a set of variables that uniquely identifies an observation. There are two types of keys:

If a table/dataset lacks a primary key, it may be useful to add one with the mutate() and rownumber() functions. This makes it easier to match observations and check our work. Such a variable is called a surrogate key.

A primary key and the corresponding foreign key in another table form a relation, which is typically one-to-many.

Note:

  1. A variable can be both a primary key and a foreign key.
  2. It is important to verify that the keys do indeed uniquely identify each observation.

Verifying Keys

Here we show two examples:

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
weather %>%
  select(year, month, day, hour, origin) %>% 
  duplicated() %>% sum()
## [1] 3

Mutating Joins

A mutating join could be used to combine variables from two tables. It matches observations by their keys first, then copies across variables from one table to the other. For the convenience, we start from a small data frame.

flights2 <- flights %>% 
  select(year:day, hour, origin, dest, tailnum, carrier)

Suppose we want to add the full airline name to the flight2 data. We can combine the airlines and flight2 data frames by left_join() function.

flights2 %>%
  select(-origin, -dest) %>% 
  left_join(airlines, by = "carrier") %>% 
  head(n=3)
## # A tibble: 3 × 7
##    year month   day  hour tailnum carrier name                  
##   <int> <int> <int> <dbl> <chr>   <chr>   <chr>                 
## 1  2013     1     1     5 N14228  UA      United Air Lines Inc. 
## 2  2013     1     1     5 N24211  UA      United Air Lines Inc. 
## 3  2013     1     1     5 N619AA  AA      American Airlines Inc.

Understanding Joins

Consider two data frames.

A join is a way of connecting each row in x to some or zero rows in y.

In an actual join, matches will be indicated with dots.

Inner Join

The simplest type of join is the inner join. It matches pairs of observations whenever their keys are equal. The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by to indicate which variable is the key.

x %>% 
  inner_join(y, by = "key")
## # A tibble: 2 × 3
##     key val_x val_y
##   <dbl> <chr> <chr>
## 1     1 x1    y1   
## 2     2 x2    y2

Outer Joins

We should note that an inner join keeps only observations that appear in both data frames. An outer join keeps observation that apper in at least one of the data frames.

There are three types of outer joins.

The most commonly used join is the left join. We use a left join whenever we loop up additional data from another data frame since it keeps the original observations even when there is not a match.

Duplicate Keys

When keys are not unique. There are two possibilities:

  • One data frame has duplicate keys.
  This means that there is a one-to-many relationship between two data frames.

  • Both data frames have duplicate keys.
  This is usually an error. When we join duplicated keys, we will get all possible combinations.

Defining the Key Columns - 1

flights2 %>% 
  left_join(weather) %>% 
  head()
## # A tibble: 6 × 18
##    year month   day  hour origin dest  tailnum carrier  temp  dewp humid wind_…¹
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <dbl> <dbl> <dbl>   <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA       39.0  28.0  64.4     260
## 2  2013     1     1     5 LGA    IAH   N24211  UA       39.9  25.0  54.8     250
## 3  2013     1     1     5 JFK    MIA   N619AA  AA       39.0  27.0  61.6     260
## 4  2013     1     1     5 JFK    BQN   N804JB  B6       39.0  27.0  61.6     260
## 5  2013     1     1     6 LGA    ATL   N668DN  DL       39.9  25.0  54.8     260
## 6  2013     1     1     5 EWR    ORD   N39463  UA       39.0  28.0  64.4     260
## # … with 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>, and abbreviated variable
## #   name ¹​wind_dir

Defining the Key Columns - 2

flights2 %>% 
  left_join(planes, by = "tailnum") %>% 
  head()
## # A tibble: 6 × 16
##   year.x month   day  hour origin dest  tailnum carrier year.y type      manuf…¹
##    <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>    <int> <chr>     <chr>  
## 1   2013     1     1     5 EWR    IAH   N14228  UA        1999 Fixed wi… BOEING 
## 2   2013     1     1     5 LGA    IAH   N24211  UA        1998 Fixed wi… BOEING 
## 3   2013     1     1     5 JFK    MIA   N619AA  AA        1990 Fixed wi… BOEING 
## 4   2013     1     1     5 JFK    BQN   N804JB  B6        2012 Fixed wi… AIRBUS 
## 5   2013     1     1     6 LGA    ATL   N668DN  DL        1991 Fixed wi… BOEING 
## 6   2013     1     1     5 EWR    ORD   N39463  UA        2012 Fixed wi… BOEING 
## # … with 5 more variables: model <chr>, engines <int>, seats <int>,
## #   speed <int>, engine <chr>, and abbreviated variable name ¹​manufacturer

Defining the Key Columns - 3

flights2 %>% 
  left_join(airports, by = c("dest" = "faa")) %>% 
  head(n=3)
## # A tibble: 3 × 15
##    year month   day  hour origin dest  tailnum carrier name      lat   lon   alt
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA      George…  30.0 -95.3    97
## 2  2013     1     1     5 LGA    IAH   N24211  UA      George…  30.0 -95.3    97
## 3  2013     1     1     5 JFK    MIA   N619AA  AA      Miami …  25.8 -80.3     8
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
flights2 %>% 
  left_join(airports, by = c("origin" = "faa")) %>% 
  head(n=3)
## # A tibble: 3 × 15
##    year month   day  hour origin dest  tailnum carrier name      lat   lon   alt
##   <int> <int> <int> <dbl> <chr>  <chr> <chr>   <chr>   <chr>   <dbl> <dbl> <dbl>
## 1  2013     1     1     5 EWR    IAH   N14228  UA      Newark…  40.7 -74.2    18
## 2  2013     1     1     5 LGA    IAH   N24211  UA      La Gua…  40.8 -73.9    22
## 3  2013     1     1     5 JFK    MIA   N619AA  AA      John F…  40.6 -73.8    13
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>

Filtering Joins - 1

Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types.

Suppose that we want to find the 10 days with the highest average delays. How would we construct the filter statement that used year, month, and day to match it back to flights?

topdest <- flights %>% 
  count(dest, sort=TRUE) %>% 
  head(10)

flights %>% 
  filter(dest %in% topdest$dest) %>% 
  head(5) # only show the first 5 rows due to the limited space on the page
## # A tibble: 5 × 19
##    year month   day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##   <int> <int> <int>    <int>       <int>   <dbl>   <int>   <int>   <dbl> <chr>  
## 1  2013     1     1      542         540       2     923     850      33 AA     
## 2  2013     1     1      554         600      -6     812     837     -25 DL     
## 3  2013     1     1      554         558      -4     740     728      12 UA     
## 4  2013     1     1      555         600      -5     913     854      19 B6     
## 5  2013     1     1      557         600      -3     838     846      -8 B6     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Filtering Joins - 2

The previous approach is difficult to extend to the situation when including multiple variables.

flights %>% 
  semi_join(topdest) %>% 
  head(10)
## # A tibble: 10 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      542        540       2     923     850      33 AA     
##  2  2013     1     1      554        600      -6     812     837     -25 DL     
##  3  2013     1     1      554        558      -4     740     728      12 UA     
##  4  2013     1     1      555        600      -5     913     854      19 B6     
##  5  2013     1     1      557        600      -3     838     846      -8 B6     
##  6  2013     1     1      558        600      -2     753     745       8 AA     
##  7  2013     1     1      558        600      -2     924     917       7 UA     
##  8  2013     1     1      558        600      -2     923     937     -14 UA     
##  9  2013     1     1      559        559       0     702     706      -4 B6     
## 10  2013     1     1      600        600       0     851     858      -7 B6     
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
## #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

Filtering Joins - 3

  • Semi-Join - 1

The following figure shows how a semi-join works. Only the existence of a match is important. It doesn’t matter which observation is matched.

  • Semi-Join - 2



  • Anti-Join

The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match.


Set Operations - 1

The final types of two-table verb are the set operations.

These expects that x and y inputs to have the same variables, and treat the observations like sets.

Set Operations - 2

We see a simple example here.

df1 <- tribble(
  ~x, ~y,
   1,  1,
   2,  1
)
df2 <- tribble(
  ~x, ~y,
   1,  1,
   1,  2
)
intersect(df1, df2)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
union(df1, df2)
## # A tibble: 3 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     1
## 2     2     1
## 3     1     2
setdiff(df1, df2)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     2     1
setdiff(df2, df1)
## # A tibble: 1 × 2
##       x     y
##   <dbl> <dbl>
## 1     1     2

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

Wickham, Hadley, and Garrett Grolemund. 2016. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. " O’Reilly Media, Inc.".
Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.