In this lesson, we will talk about how to work with relational data. There are three families verbs designed to work with relational data:
Mutating joins: adding new variables to one data frame from matching observations (identifier) in another.
Filtering joins: filtering observations from one data frame based on whether or not they match an observation in the other table.
Set operations: treating observations as if they were set elements.
Note: This lesson is based on the book: R for Data Science (Wickham and Grolemund (2016)).
We will explore relational data from the R package nycflights13 using the two-table verbs from dplyr.
This nycflights13 package includes 5 datasets: airlines, airports, flights, planes, and weather.
airlines provides the full carrier name from its abbreviated code.
airports gives information about each airport, identified by the faa airport code.
planes provides information about each plane, identified by its tailnum.
weather gives the weather at each NYC airport for each hour.
Note: We introduce a R function p_load() in the R package pacman here. The p_load() function checks to see if a package is installed, if not it attempts to install the package from CRAN and/or any other repository in the pacman repository list.
Let’s take a look at what variables are in these datasets.
One way to understand the relationships between the tables is with a drawing:
The variables used to connect each pair of tables are called keys. A key is a variable or a set of variables that uniquely identifies an observation. There are two types of keys:
A primary key uniquely identifies an observation in its own table.
A foreign key uniquely identifies an observation in another table.
If a table/dataset lacks a primary key, it may be useful to add one with the mutate() and rownumber() functions. This makes it easier to match observations and check our work. Such a variable is called a surrogate key.
A primary key and the corresponding foreign key in another table form a relation, which is typically one-to-many.
Note:
Here we show two examples:
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
## [1] 3
A mutating join could be used to combine variables from two tables. It matches observations by their keys first, then copies across variables from one table to the other. For the convenience, we start from a small data frame.
Suppose we want to add the full airline name to the flight2 data. We can combine the airlines and flight2 data frames by left_join() function.
## # A tibble: 3 × 7
## year month day hour tailnum carrier name
## <int> <int> <int> <dbl> <chr> <chr> <chr>
## 1 2013 1 1 5 N14228 UA United Air Lines Inc.
## 2 2013 1 1 5 N24211 UA United Air Lines Inc.
## 3 2013 1 1 5 N619AA AA American Airlines Inc.
Consider two data frames.
A join is a way of connecting each row in x to some or zero rows in y.
In an actual join, matches will be indicated with dots.
The simplest type of join is the inner join. It matches pairs of observations whenever their keys are equal. The output of an inner join is a new data frame that contains the key, the x values, and the y values. We use by to indicate which variable is the key.
## # A tibble: 2 × 3
## key val_x val_y
## <dbl> <chr> <chr>
## 1 1 x1 y1
## 2 2 x2 y2
We should note that an inner join keeps only observations that appear in both data frames. An outer join keeps observation that apper in at least one of the data frames.
There are three types of outer joins.
The most commonly used join is the left join. We use a left join whenever we loop up additional data from another data frame since it keeps the original observations even when there is not a match.
When keys are not unique. There are two possibilities:
## # A tibble: 6 × 18
## year month day hour origin dest tailnum carrier temp dewp humid wind_…¹
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA 39.0 28.0 64.4 260
## 2 2013 1 1 5 LGA IAH N24211 UA 39.9 25.0 54.8 250
## 3 2013 1 1 5 JFK MIA N619AA AA 39.0 27.0 61.6 260
## 4 2013 1 1 5 JFK BQN N804JB B6 39.0 27.0 61.6 260
## 5 2013 1 1 6 LGA ATL N668DN DL 39.9 25.0 54.8 260
## 6 2013 1 1 5 EWR ORD N39463 UA 39.0 28.0 64.4 260
## # … with 6 more variables: wind_speed <dbl>, wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm>, and abbreviated variable
## # name ¹wind_dir
## # A tibble: 6 × 16
## year.x month day hour origin dest tailnum carrier year.y type manuf…¹
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <int> <chr> <chr>
## 1 2013 1 1 5 EWR IAH N14228 UA 1999 Fixed wi… BOEING
## 2 2013 1 1 5 LGA IAH N24211 UA 1998 Fixed wi… BOEING
## 3 2013 1 1 5 JFK MIA N619AA AA 1990 Fixed wi… BOEING
## 4 2013 1 1 5 JFK BQN N804JB B6 2012 Fixed wi… AIRBUS
## 5 2013 1 1 6 LGA ATL N668DN DL 1991 Fixed wi… BOEING
## 6 2013 1 1 5 EWR ORD N39463 UA 2012 Fixed wi… BOEING
## # … with 5 more variables: model <chr>, engines <int>, seats <int>,
## # speed <int>, engine <chr>, and abbreviated variable name ¹manufacturer
## # A tibble: 3 × 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA George… 30.0 -95.3 97
## 2 2013 1 1 5 LGA IAH N24211 UA George… 30.0 -95.3 97
## 3 2013 1 1 5 JFK MIA N619AA AA Miami … 25.8 -80.3 8
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
## # A tibble: 3 × 15
## year month day hour origin dest tailnum carrier name lat lon alt
## <int> <int> <int> <dbl> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 2013 1 1 5 EWR IAH N14228 UA Newark… 40.7 -74.2 18
## 2 2013 1 1 5 LGA IAH N24211 UA La Gua… 40.8 -73.9 22
## 3 2013 1 1 5 JFK MIA N619AA AA John F… 40.6 -73.8 13
## # … with 3 more variables: tz <dbl>, dst <chr>, tzone <chr>
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types.
semi_join(x, y) keeps all observations in x that have a match in y.
anti_join(x, y) drops all observations in x that have a match in y.
Suppose that we want to find the 10 days with the highest average delays. How would we construct the filter statement that used year, month, and day to match it back to flights?
topdest <- flights %>%
count(dest, sort=TRUE) %>%
head(10)
flights %>%
filter(dest %in% topdest$dest) %>%
head(5) # only show the first 5 rows due to the limited space on the page
## # A tibble: 5 × 19
## year month day dep_time sched_dep…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 542 540 2 923 850 33 AA
## 2 2013 1 1 554 600 -6 812 837 -25 DL
## 3 2013 1 1 554 558 -4 740 728 12 UA
## 4 2013 1 1 555 600 -5 913 854 19 B6
## 5 2013 1 1 557 600 -3 838 846 -8 B6
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The previous approach is difficult to extend to the situation when including multiple variables.
## # A tibble: 10 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 542 540 2 923 850 33 AA
## 2 2013 1 1 554 600 -6 812 837 -25 DL
## 3 2013 1 1 554 558 -4 740 728 12 UA
## 4 2013 1 1 555 600 -5 913 854 19 B6
## 5 2013 1 1 557 600 -3 838 846 -8 B6
## 6 2013 1 1 558 600 -2 753 745 8 AA
## 7 2013 1 1 558 600 -2 924 917 7 UA
## 8 2013 1 1 558 600 -2 923 937 -14 UA
## 9 2013 1 1 559 559 0 702 706 -4 B6
## 10 2013 1 1 600 600 0 851 858 -7 B6
## # … with 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
## # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
## # time_hour <dttm>, and abbreviated variable names ¹sched_dep_time,
## # ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
The following figure shows how a semi-join works. Only the existence of a match is important. It doesn’t matter which observation is matched.
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match.
The final types of two-table verb are the set operations.
These expects that x and y inputs to have the same variables, and treat the observations like sets.
We see a simple example here.
You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller