class: center, middle, inverse, title-slide # R Data Analytics Workshop at NTPU, 2022 ###
Ying-Ju Tessa Chen, PhD
Ying-Ju
ychen4@udayton.edu
###
June 20 & 21, 2022 --- ## README We facilitated a two-day workshop of R data analytics at the National Taipei University in June 2022. To the extent possible, the content of the lectures are recorded here. The lectures are based on [R for Data Science](https://r4ds.had.co.nz/). ### Table of Content - [Data Manipulation](#manipulation) - [Data Visualization](#visualization) - [Data Exploration](#exploration) - [R Markdown Presentations](#rmarkdown) - [Introduction to GitHub](#github) --- name: manipulation # Session 1: Data Manipulation In this session, we will talk about data manipulation using R package <span Style="Color:red">tidyverse</span>. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are: - dplyr: data manipulation - ggplot2: data visualization - purr: functional programming toolkit - readr: read data and write files - tibble: simple data frame - tidyr: data management --- ## Key packages included in tidyverse We will focus on the following key functions in <span Style="Color:red">dplyr</span> using the dataset **flights** from the R package <span Style="color:red">nycflights13</span>. - filter(): pick observations by their values - arrange(): reorder the rows - select(): select variables by their names - mutate(): create new variables with functions of existing variables - group_by(): group data by existing variables - summarize(): collapse many values done to a single summary (with group_by) --- ## How the functions in dplyr work All functions above work similarly. 1. The first argument is a data frame. 2. The subsequent arguments describe what to do with the data frame using the variable names. 3. The result is a new data frame (but we can save it back to the original data frame if needed). --- ## Load packages and read the Flights Data First, we load the necessary packages, check conflict functions, and import the dataset **flights** from the R package <span Style="color:red">nycflights13</span>. ```r library(tidyverse) library(conflicted) conflict_prefer("select", "dplyr") conflict_prefer("filter", "dplyr") df <- nycflights13::flights ``` --- ## Understanding the Flights data [Flights Data](https://rdrr.io/cran/nycflights13/man/flights.html) provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables. - year, month, day: Date of departure. - dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone. - sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone. - dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals. - carrier: Two letter carrier abbreviation. See airlines to get name. - flight: Flight number. - tailnum: Plane tail number. See planes for additional metadata. - origin, dest: Origin and destination. See airports for additional metadata. - air_time: Amount of time spent in the air, in minutes. - distance: Distance between airports, in miles. - hour, minute: Time of scheduled departure broken into hour and minutes. - time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data. --- ## Get a glimpse of the data ```r glimpse(df) ``` ``` ## Rows: 336,776 ## Columns: 19 ## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~ ## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~ ## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~ ## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~ ## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~ ## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~ ## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~ ## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~ ## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~ ## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~ ## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~ ## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~ ## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~ ## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~ ## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~ ## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~ ## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~ ## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~ ``` --- ## filter() function <span Style="color:blue">filter()</span> is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code. ```r filter(df, month == 12, day == 25) ``` ``` ## # A tibble: 719 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 12 25 456 500 -4 649 651 ## 2 2013 12 25 524 515 9 805 814 ## 3 2013 12 25 542 540 2 832 850 ## 4 2013 12 25 546 550 -4 1022 1027 ## 5 2013 12 25 556 600 -4 730 745 ## 6 2013 12 25 557 600 -3 743 752 ## 7 2013 12 25 557 600 -3 818 831 ## 8 2013 12 25 559 600 -1 855 856 ## 9 2013 12 25 559 600 -1 849 855 ## 10 2013 12 25 600 600 0 850 846 ## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r Christmas <- filter(df, month == 12, day == 25) ``` --- ## Comparisons R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal). If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses ```r (Jan1 <- filter(df, month == 1, day == 1)) ``` ``` ## # A tibble: 842 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## Logical Operations R provides the following syntax: & is "and", | is "or", ! is "not". The following code finds all flights that departed in July or August. ```r filter(df, month == 7 | month == 8) ``` ``` ## # A tibble: 58,752 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 7 1 1 2029 212 236 2359 ## 2 2013 7 1 2 2359 3 344 344 ## 3 2013 7 1 29 2245 104 151 1 ## 4 2013 7 1 43 2130 193 322 14 ## 5 2013 7 1 44 2150 174 300 100 ## 6 2013 7 1 46 2051 235 304 2358 ## 7 2013 7 1 48 2001 287 308 2305 ## 8 2013 7 1 58 2155 183 335 43 ## 9 2013 7 1 100 2146 194 327 30 ## 10 2013 7 1 100 2245 135 337 135 ## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r filter(df, month %in% c(7, 8)) ``` --- **Note:** 1. If we use <span Style="color:blue">filter(df, month == 7 | 8)</span>, it finds all months are equal 7 | 8, an expression that evaluates to **TRUE**. In a numeric context, TRUE becomes one, so this finds all fights in the data. 2. <span Style="color:blue">filter()</span> only includes rows where the condition is **TRUE** and it excludes both FALSE and NA values. If we want to find flights that weren't delayed on both arrival and departure by more than 1 hour, we could use either of the following codes. ```r filter(df, !(arr_delay > 60 | dep_delay > 60)) ``` ``` ## # A tibble: 295,893 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r filter(df, arr_delay <= 60, dep_delay <= 60) ``` --- ## arrange() function <span Style="color:blue">arrange()</span> is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates. ```r arrange(df, year, month, day) ``` ``` ## # A tibble: 336,776 x 19 ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- **Note:** 1. We can save the data frame back to the original data frame after sorting the data. 2. Use <span Style=:color:blue>desc()</span> for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order. 3. Missing values are always sorted at the end. --- ```r result_arrange <- arrange(df, desc(arr_delay)) head(select(result_arrange, arr_delay, everything())) ``` ``` ## # A tibble: 6 x 19 ## arr_delay year month day dep_time sched_dep_time dep_delay arr_time ## <dbl> <int> <int> <int> <int> <int> <dbl> <int> ## 1 1272 2013 1 9 641 900 1301 1242 ## 2 1127 2013 6 15 1432 1935 1137 1607 ## 3 1109 2013 1 10 1121 1635 1126 1239 ## 4 1007 2013 9 20 1139 1845 1014 1457 ## 5 989 2013 7 22 845 1600 1005 1044 ## 6 931 2013 4 10 1100 1900 960 1342 ## # ... with 11 more variables: sched_arr_time <int>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` ```r tail(select(result_arrange, arr_delay, everything())) ``` ``` ## # A tibble: 6 x 19 ## arr_delay year month day dep_time sched_dep_time dep_delay arr_time ## <dbl> <int> <int> <int> <int> <int> <dbl> <int> ## 1 NA 2013 9 30 NA 1842 NA NA ## 2 NA 2013 9 30 NA 1455 NA NA ## 3 NA 2013 9 30 NA 2200 NA NA ## 4 NA 2013 9 30 NA 1210 NA NA ## 5 NA 2013 9 30 NA 1159 NA NA ## 6 NA 2013 9 30 NA 840 NA NA ## # ... with 11 more variables: sched_arr_time <int>, carrier <chr>, ## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, ## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## select() function <span Style="color:blue">select()</span> is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables. ```r # select specific columns select(df, year, month, day) # select all columns between year and day select(df, year:day) # select all columns except those from year and day select(df, -(year:day)) ``` **Note:** 1. We can use a minus sign - to drop variables. 2. There are several helper functions we can use within <span Style="color:blue">select()</span>. See <span Style="color:blue">?select</span> for the information. 3. <span Style="color:blue">select()</span> can be used with the <span Style="color:blue">everything()</span> function when we have a handful of variables we would like to move to the start of the data frame. --- ```r # move carrier, origin, dest, and distance to the start of the data select(df, carrier, origin, dest, distance, everything()) ``` ``` ## # A tibble: 336,776 x 19 ## carrier origin dest distance year month day dep_time sched_dep_time ## <chr> <chr> <chr> <dbl> <int> <int> <int> <int> <int> ## 1 UA EWR IAH 1400 2013 1 1 517 515 ## 2 UA LGA IAH 1416 2013 1 1 533 529 ## 3 AA JFK MIA 1089 2013 1 1 542 540 ## 4 B6 JFK BQN 1576 2013 1 1 544 545 ## 5 DL LGA ATL 762 2013 1 1 554 600 ## 6 UA EWR ORD 719 2013 1 1 554 558 ## 7 B6 EWR FLL 1065 2013 1 1 555 600 ## 8 EV LGA IAD 229 2013 1 1 557 600 ## 9 B6 JFK MCO 944 2013 1 1 557 600 ## 10 AA LGA ORD 733 2013 1 1 558 600 ## # ... with 336,766 more rows, and 10 more variables: dep_delay <dbl>, ## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>, ## # tailnum <chr>, air_time <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- ## mutate() function <span Style="color:blue">mutate()</span> is used when we would like to add a new variable / column using the other variables in the data. **Note:** <span Style="color:blue">mutate()</span> always adds new columns at the end of the data. First, we start by creating a smaller dataset with a few variables. ```r # we start by creating a smaller dataset. df1 <- select(df, year:day, ends_with("delay"), distance, air_time) ``` --- We create four variables using variables in the data. ```r mutate(df1, gain= arr_delay - dep_delay, speed = distance / air_time * 60, hours = air_time / 60, gain_per_hour = gain / hours) ``` ``` ## # A tibble: 336,776 x 11 ## year month day dep_delay arr_delay distance air_time gain speed hours ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 1 2 11 1400 227 9 370. 3.78 ## 2 2013 1 1 4 20 1416 227 16 374. 3.78 ## 3 2013 1 1 2 33 1089 160 31 408. 2.67 ## 4 2013 1 1 -1 -18 1576 183 -17 517. 3.05 ## 5 2013 1 1 -6 -25 762 116 -19 394. 1.93 ## 6 2013 1 1 -4 12 719 150 16 288. 2.5 ## 7 2013 1 1 -5 19 1065 158 24 404. 2.63 ## 8 2013 1 1 -3 -14 229 53 -11 259. 0.883 ## 9 2013 1 1 -3 -8 944 140 -5 405. 2.33 ## 10 2013 1 1 -2 8 733 138 10 319. 2.3 ## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl> ``` --- If we want to keep only the new variables, use <span Style="color:blue">transmute()</span>. ```r transmute(df1, gain= arr_delay - dep_delay, speed = distance / air_time * 60, hours = air_time / 60, gain_per_hour = gain / hours) ``` ``` ## # A tibble: 336,776 x 4 ## gain speed hours gain_per_hour ## <dbl> <dbl> <dbl> <dbl> ## 1 9 370. 3.78 2.38 ## 2 16 374. 3.78 4.23 ## 3 31 408. 2.67 11.6 ## 4 -17 517. 3.05 -5.57 ## 5 -19 394. 1.93 -9.83 ## 6 16 288. 2.5 6.4 ## 7 24 404. 2.63 9.11 ## 8 -11 259. 0.883 -12.5 ## 9 -5 405. 2.33 -2.14 ## 10 10 319. 2.3 4.35 ## # ... with 336,766 more rows ``` **Note:** There are many functions for creating new variables that we can use with <span Style="color:blue">mutate()</span>. The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output. --- ## group_by() & summarize() functions <span Style="color:blue">summarize()</span> collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk. ```r summarize(df, delay = mean(dep_delay, na.rm=T)) ``` ``` ## # A tibble: 1 x 1 ## delay ## <dbl> ## 1 12.6 ``` --- In general, <span Style="color:blue">summarize()</span> function is used together with <span Style="color:blue">group_by()</span> as we group rows for some purposes. <span Style="color:blue">group_by()</span> is used to group rows by one or more variables, giving priority to the variable entered first. ```r group_by(df, year, month, day) ``` ``` ## # A tibble: 336,776 x 19 ## # Groups: year, month, day [365] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## 7 2013 1 1 555 600 -5 913 854 ## 8 2013 1 1 557 600 -3 709 723 ## 9 2013 1 1 557 600 -3 838 846 ## 10 2013 1 1 558 600 -2 753 745 ## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>, ## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, ## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm> ``` The result shows the original data but indicates groups: year, month, day, in our example. --- For example, we can study the average departure / arrival delays for each day. ```r by_day <- group_by(df, year, month, day) summarize(by_day, ave_dep_delay = mean(dep_delay, na.rm = T), ave_arr_delay = mean(arr_delay, na.rm = T) ) ``` ``` ## # A tibble: 365 x 5 ## # Groups: year, month [12] ## year month day ave_dep_delay ave_arr_delay ## <int> <int> <int> <dbl> <dbl> ## 1 2013 1 1 11.5 12.7 ## 2 2013 1 2 13.9 12.7 ## 3 2013 1 3 11.0 5.73 ## 4 2013 1 4 8.95 -1.93 ## 5 2013 1 5 5.73 -1.53 ## 6 2013 1 6 7.15 4.24 ## 7 2013 1 7 5.42 -4.95 ## 8 2013 1 8 2.55 -3.23 ## 9 2013 1 9 2.28 -0.264 ## 10 2013 1 10 2.84 -5.90 ## # ... with 355 more rows ``` --- ## Combining Multiple Operations with the Pipe In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package <span Style="color:red">magrittr</span> and it is loaded automatically when tidyverse is loaded. The logic when using pipe: object %>% function1 %>% function 2.... If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this. ```r by_dest <- group_by(df, dest) delay <- summarize(by_dest, count = n(), ave_dist = mean(distance, na.rm=T), ave_arr_delay = mean(arr_delay, na.rm=T) ) delay <- filter(delay, count > 20, dest != "HNL") ``` --- The following code chunk does the same task on the previous slide with the pipe, %>% and it makes the code easier to read. ```r delay <- df %>% group_by(dest) %>% summarize( count = n(), ave_dist = mean(distance, na.rm=T), ave_arr_delay = mean(arr_delay, na.rm=T) ) %>% filter(count > 20, dest != "HNL") delay ``` ``` ## # A tibble: 96 x 4 ## dest count ave_dist ave_arr_delay ## <chr> <int> <dbl> <dbl> ## 1 ABQ 254 1826 4.38 ## 2 ACK 265 199 4.85 ## 3 ALB 439 143 14.4 ## 4 ATL 17215 757. 11.3 ## 5 AUS 2439 1514. 6.02 ## 6 AVL 275 584. 8.00 ## 7 BDL 443 116 7.05 ## 8 BGR 375 378 8.03 ## 9 BHM 297 866. 16.9 ## 10 BNA 6333 758. 11.8 ## # ... with 86 more rows ``` --- ## Useful Summary Functions - Measures of location for a quantitative variable: <span Style="color:blue"> mean()</span>, <span Style="color:blue"> median()</span> - Measure of spread for a quantitative variable: <span Style="color:blue">sd()</span>, <span Style="color:blue"> IQR()</span>, <span Style="color:blue">mad()</span> Here, `\(MAD = median(|x_i-\bar{x}|)\)` is called the median absolute deviation which may be more useful if we have outliers. ```r not_cancelled <- df %>% filter(!is.na(dep_delay), !is.na(arr_delay)) not_cancelled %>% group_by(dest) %>% summarize( distance_mu = mean(distance), distance_sd = sd(distance)) %>% arrange(desc(distance_sd)) %>% head() ``` ``` ## # A tibble: 6 x 3 ## dest distance_mu distance_sd ## <chr> <dbl> <dbl> ## 1 EGE 1736. 10.5 ## 2 SAN 2437. 10.4 ## 3 SFO 2578. 10.2 ## 4 HNL 4973. 10.0 ## 5 SEA 2413. 9.98 ## 6 LAS 2241. 9.91 ``` --- - Measures of rank: <span Style="color:blue">min()</span>, <span Style="color:blue">quantile()</span>, <span Style="color:blue">max()</span> ```r not_cancelled %>% group_by(year, month, day) %>% summarize( first = min(dep_time), # the first flight departed each day last = max(dep_time) # the last flight departed each day ) %>% head() ``` ``` ## # A tibble: 6 x 5 ## # Groups: year, month [1] ## year month day first last ## <int> <int> <int> <int> <int> ## 1 2013 1 1 517 2356 ## 2 2013 1 2 42 2354 ## 3 2013 1 3 32 2349 ## 4 2013 1 4 25 2358 ## 5 2013 1 5 14 2357 ## 6 2013 1 6 16 2355 ``` --- - Measures of position: <span Style="color:blue">first()</span>, <span Style="color:blue">nth(x, 2)</span>, <span Style="color:blue">last()</span> The following code chunk finds the first and last departure for each day. ```r not_cancelled %>% group_by(year, month, day) %>% summarize( first_dep = first(dep_time), last_dep = last(dep_time) ) %>% head() ``` ``` ## # A tibble: 6 x 5 ## # Groups: year, month [1] ## year month day first_dep last_dep ## <int> <int> <int> <int> <int> ## 1 2013 1 1 517 2356 ## 2 2013 1 2 42 2354 ## 3 2013 1 3 32 2349 ## 4 2013 1 4 25 2358 ## 5 2013 1 5 14 2357 ## 6 2013 1 6 16 2355 ``` --- - Counts: We have seen <span Style="color:blue">n()</span>, which takes no arguments, and returns the size of the current group. To count the number of non-missing values, we can use <span Style="color:blue">sum(is.na(x))</span>. To count the number of distinct values, use <span Style="color:blue">n_distinct()</span>. ```r not_cancelled %>% group_by(dest) %>% summarize(carriers = n_distinct(carrier)) %>% arrange(desc(carriers)) %>% head() ``` ``` ## # A tibble: 6 x 2 ## dest carriers ## <chr> <int> ## 1 ATL 7 ## 2 BOS 7 ## 3 CLT 7 ## 4 ORD 7 ## 5 TPA 7 ## 6 AUS 6 ``` --- We can use <span Style="color:blue">count()</span> directly if all we want is a count. ```r not_cancelled %>% count(dest) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## dest n ## <chr> <int> ## 1 ABQ 254 ## 2 ACK 264 ## 3 ALB 418 ## 4 ANC 8 ## 5 ATL 16837 ``` We can optionally provide a weight variable. For example we could use this to "count" the total number of miles a plane flew. ```r not_cancelled %>% count(tailnum, wt = distance) %>% head(5) ``` ``` ## # A tibble: 5 x 2 ## tailnum n ## <chr> <dbl> ## 1 D942DN 3418 ## 2 N0EGMQ 239143 ## 3 N10156 109664 ## 4 N102UW 25722 ## 5 N103US 24619 ``` --- - Counts and proportions of logical values When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, <span Style="color:blue">sum()</span> gives the number of TRUEs and <span Style="color:blue">mean()</span> gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk. ```r not_cancelled %>% group_by(year, month, day) %>% summarize(n_early = sum(dep_time < 500)) %>% head() ``` ``` ## # A tibble: 6 x 4 ## # Groups: year, month [1] ## year month day n_early ## <int> <int> <int> <int> ## 1 2013 1 1 0 ## 2 2013 1 2 3 ## 3 2013 1 3 4 ## 4 2013 1 4 3 ## 5 2013 1 5 3 ## 6 2013 1 6 2 ``` --- Or what proportion of flights are delayed by more than one hour? ```r not_cancelled %>% group_by(year, month, day) %>% summarize(hour_perc = mean(arr_delay > 60)) %>% head() ``` ``` ## # A tibble: 6 x 4 ## # Groups: year, month [1] ## year month day hour_perc ## <int> <int> <int> <dbl> ## 1 2013 1 1 0.0722 ## 2 2013 1 2 0.0851 ## 3 2013 1 3 0.0567 ## 4 2013 1 4 0.0396 ## 5 2013 1 5 0.0349 ## 6 2013 1 6 0.0470 ``` --- ## Grouping by Multiple Variables Here we show some examples to demonstrate how to group the data by multiple variables. ```r per_day <- df %>% group_by(year, month, day) %>% summarize(flights = n()) per_day %>% head() ``` ``` ## # A tibble: 6 x 4 ## # Groups: year, month [1] ## year month day flights ## <int> <int> <int> <int> ## 1 2013 1 1 842 ## 2 2013 1 2 943 ## 3 2013 1 3 914 ## 4 2013 1 4 915 ## 5 2013 1 5 720 ## 6 2013 1 6 832 ``` --- ```r per_month <- summarize(per_day, flights = sum(flights)) per_month %>% head() ``` ``` ## # A tibble: 6 x 3 ## # Groups: year [1] ## year month flights ## <int> <int> <int> ## 1 2013 1 27004 ## 2 2013 2 24951 ## 3 2013 3 28834 ## 4 2013 4 28330 ## 5 2013 5 28796 ## 6 2013 6 28243 ``` ```r per_year <- summarize(per_month, flights = sum(flights)) per_year ``` ``` ## # A tibble: 1 x 2 ## year flights ## <int> <int> ## 1 2013 336776 ``` --- ## Ungrouping If we need to remove grouping, and return to operations on ungrouped data, use <span Style="color:blue">ungroup()</span>. ```r daily <- df %>% group_by(year, month, day) daily %>% ungroup() %>% # no longer grouped by date summarize(flights=n()) # all flights ``` ``` ## # A tibble: 1 x 1 ## flights ## <int> ## 1 336776 ``` --- ## Grouped Mutates and Filters We can also do convenient operations with <span Style="color:blue">mutate()</span> and <span Style="color:blue">filter()</span>. The following code chunk finds the worst members of each group. ```r df1 %>% group_by(year, month, day) %>% filter(rank(desc(arr_delay)) < 10) %>% head() ``` ``` ## # A tibble: 6 x 7 ## # Groups: year, month, day [1] ## year month day dep_delay arr_delay distance air_time ## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> ## 1 2013 1 1 853 851 184 41 ## 2 2013 1 1 290 338 1134 213 ## 3 2013 1 1 260 263 266 46 ## 4 2013 1 1 157 174 213 60 ## 5 2013 1 1 216 222 708 121 ## 6 2013 1 1 255 250 589 115 ``` --- The following code chunk finds all groups bigger than a threshold. ```r popular_dests <- df %>% group_by(dest) %>% filter(n()>365) popular_dests %>% head() ``` ``` ## # A tibble: 6 x 19 ## # Groups: dest [5] ## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time ## <int> <int> <int> <int> <int> <dbl> <int> <int> ## 1 2013 1 1 517 515 2 830 819 ## 2 2013 1 1 533 529 4 850 830 ## 3 2013 1 1 542 540 2 923 850 ## 4 2013 1 1 544 545 -1 1004 1022 ## 5 2013 1 1 554 600 -6 812 837 ## 6 2013 1 1 554 558 -4 740 728 ## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>, ## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, ## # hour <dbl>, minute <dbl>, time_hour <dttm> ``` --- The following code chunk standardizes to compute per group metrics. ```r popular_dests %>% filter(arr_delay > 0) %>% mutate(prop_delay = arr_delay / sum(arr_delay)) %>% select(year:day, arr_delay, prop_delay) %>% head() ``` ``` ## # A tibble: 6 x 6 ## # Groups: dest [4] ## dest year month day arr_delay prop_delay ## <chr> <int> <int> <int> <dbl> <dbl> ## 1 IAH 2013 1 1 11 0.000111 ## 2 IAH 2013 1 1 20 0.000201 ## 3 MIA 2013 1 1 33 0.000235 ## 4 ORD 2013 1 1 12 0.0000424 ## 5 FLL 2013 1 1 19 0.0000938 ## 6 ORD 2013 1 1 8 0.0000283 ``` --- name: visualization # Section 2: Data Visualization with ggplot2 In this session, we will introduce how to visualize our data using <span Style="color:red">ggplot2</span> and <span Style="color:red">plotly</span>. The lecture is based on [UC Business Analytics R Programming Guide]( https://uc-r.github.io/ggplot_intro). While we can use the built-in functions in the base package in R to obtain plots, the package <span Style="color:red">ggplot2</span> creates advanced graphs with simple and flexible commands. --- ## Load packages and read the Fuel Economy Data First, we load the necessary packages, check conflict functions, and get a glimpse of the dataset **mpg** from the R package <span Style="color:red">ggplot2</span>. ```r library(tidyverse) library(conflicted) conflict_prefer("lag", "dplyr") conflict_prefer("filter", "dplyr") glimpse(mpg) ``` ``` ## Rows: 234 ## Columns: 11 ## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~ ## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~ ## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~ ## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~ ## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~ ## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~ ## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~ ## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~ ## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~ ## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~ ## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c~ ``` --- Now we need to understand the data and each variable in the data. This dataset contains 38 popular models of cars from 1999 to 2008. [Fuel Economy Data](https://ggplot2.tidyverse.org/reference/mpg.html). - manufacturer: car manufacturer - model: model name - displ: engine displacement, in liters - year: year of manufacturing (1999-2008) - cyl: number of cylinders - trans: type of transmission - drv: drive type (f, r, 4, f=front wheel, r=rear wheel, 4=4 wheel) - cty: city mileage miles per gallon - hwy: highway mileage miles per gallon - fl: fuel type (diesel, petrol, electric, etc.) - class: vehicle class 7 types (compact, SUV, minivan etc.) --- ## Grammar of Graphics The basic idea of creating plots using <span Style="color:red">ggplot2</span> is to specify each component of the following and combine them with <span Style="color:blue">+</span>. ### ggplot() function <span Style="color:blue">ggplot()</span> function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays. The logic when using <span Style="color:blue">ggplot()</span> function is: `(data, mapping) + geom_function()`. --- ## The Basics First, we see how <span Style="color:blue">ggplot()</span> function works by creating canvas and including variables. ```r # need this package to create site-by-site plots library(patchwork) # create canvas p1 <- ggplot(mpg) # variables of interest mapped p2 <- ggplot(mpg, mapping = aes(x = displ, y = hwy)) p1+p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/ggplot_basics-1.png" width="65%" style="display: block; margin: auto;" /> --- The following code chunk shows how we can obtain a scatter plot to study the relationship between engine displacement and highway mileage per gallon. ```r # data plotted ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() ``` <img src="data_analytics_workshop_2022_files/figure-html/ggplot_basics_1-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Aesthetic Mappings The aesthetic mappings allow to select variables to be plotted and use data properties to influence visual characteristics such as color, size, shape, position, etc. As a result, each visual characteristic can encode a different part of the data and be utilized to communicate information. All aesthetics for a plot are specified in the <span Style="color:blue">aes()</span> function call. For example, we can add a mapping from the class of the cars to a color characteristic: ```r ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_color1-1.png" width="65%" style="display: block; margin: auto;" /> --- **Note:** 1. We should note that in the above code chunk, "class" is a variable in the data and therefore, the commend specifies a categorical variable is used as the third variable in the figure. 2. Using the <span Style="color:red">aes()</span> function will cause the visual channel to be based on the data specified in the argument. For example, using `aes(color = "blue")` won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c("blue") — as if we only had a single type of engine that happened to be called “red”. If we wish to apply an aesthetic property to an entire geometry, we can set that property as an argument to the geom method, outside of the <span Style="color:red">aes()</span> call. --- ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue") ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_color2-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Specifying Geometric Shapes Building on these basics, we can use <span Style="color:red">ggplot2</span> to create almost any kind of plot we may want. These plots are declared using functions that follow from the Grammar of Graphics. <span Style="color:red">ggplot2</span> supports a number of different types of geometric objects, including: - geom_bar(): bar charts - geom_boxplot(): boxplots - geom_histogram(): histograms - geom_line(): lines - geom_map(): polygons in the shape of a map. - geom_point(): individual points - geom_polygon(): arbitrary shapes - geom_smooth(): smoothed lines Each of these geometries will make use of the aesthetic mappings provided, albeit the visual qualities to which the data will be mapped will differ. For example, we can map data to the shape of a geom_point (e.g., if they should be circles or squares), or we can map data to the line-type of a geom_line (e.g., if it is solid or dotted), but not vice versa. --- Almost all geoms require an x and y mapping at the bare minimum. ```r # x and y mapping needed when creating a scatterplot p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() p2 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_smooth() p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom1a-1.png" width="65%" style="display: block; margin: auto;" /> --- There is no y mapping needed when creating a bar chart or a histogram. ```r p1 <- ggplot(mpg, aes(x = class)) + geom_bar() p2 <- ggplot(mpg, aes(x = hwy)) + geom_histogram() p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom1b-1.png" width="65%" style="display: block; margin: auto;" /> --- We improve the quality of the figures on the previous slide. ```r # no y mapping needed when creating a bar chart p1 <- ggplot(mpg, aes(y = class)) + geom_bar(fill = daytonred, alpha = 0.2) p2 <- ggplot(mpg, aes(x = hwy)) + geom_histogram(aes(y = ..density..), binwidth = density(mpg$hwy)$bw) + geom_density(fill=daytonred, alpha = 0.2) p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom1c-1.png" width="65%" style="display: block; margin: auto;" /> --- What makes this really powerful is that we can add multiple geometries to a plot, thus allowing you to create complex graphics showing multiple aspects of your data. ```r # plot with both points and smoothed line ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + geom_smooth() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom2-1.png" width="65%" style="display: block; margin: auto;" /> --- **Note:** 1. Since the aesthetics for each geom can be different, we could show multiple lines on the same plot (or with different colors, styles, etc). 2. It is also possible to give each geom a different data argument, so that we can show multiple data sets in the same plot. If we specify an aesthetic within <span Style="color:blue">ggplot()</span>, it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point). --- For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colors within each geom: ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(color = "blue") + geom_smooth(color = "red") ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom3-1.png" width="65%" style="display: block; margin: auto;" /> --- ```r # color aesthetic passed to each geom layer p1 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + geom_smooth(se = FALSE) # color aesthetic specified for only the geom_point layer p2 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth(se = FALSE) p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom4a-1.png" width="65%" style="display: block; margin: auto;" /> --- ```r # color aesthetic specified for only the geom_point layer ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_smooth(se = FALSE) ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_geom4b-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Statistical Transformations The following bar chart shows the frequency distribution of vehicle class. We can find that y axis was defined as the count of elements that have the particular type. This count is not part of the data set, but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation. ```r ggplot(mpg, aes(x = class)) + geom_bar() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_bar-1.png" width="65%" style="display: block; margin: auto;" /> --- <span Style="color:red">ggplot2</span> supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. We can specify which statistical transformation a geom uses by passing it as the stat argument. For example, consider our data already had the count as a variable: ```r (class_count <- count(mpg, class)) ``` ``` ## # A tibble: 7 x 2 ## class n ## <chr> <int> ## 1 2seater 5 ## 2 compact 47 ## 3 midsize 41 ## 4 minivan 11 ## 5 pickup 33 ## 6 subcompact 35 ## 7 suv 62 ``` --- We can use `stat = "identity"` within geom_bar to plot our bar height values to this variable. Also, note that we now include n for our y variable: ```r ggplot(class_count, aes(x = class, y = n)) + geom_bar(stat = "identity") ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_count1-1.png" width="65%" style="display: block; margin: auto;" /> --- We can also call <span Style="color:blue">stat_</span> functions directly to add additional layers. For example, here we create a scatter plot of highway miles for each displacement value and then use <span Style="color:blue">stat_summary()</span> to plot the mean highway miles at each displacement value. ```r ggplot(mpg, aes(displ, hwy)) + geom_point(color = "grey") + stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed") ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_summary-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Position Adjustments In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in <span Style="color:blue">geom_bar()</span> if we map a different variable to the color visual characteristic. ```r # bar chart of class, colored by drive (front, rear, 4-wheel) ggplot(mpg, aes(x = class, fill = drv)) + geom_bar() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_position1-1.png" width="65%" style="display: block; margin: auto;" /> The <span Style="color:blue">geom_bar()</span> by default uses a position adjustment of `stack`, which makes each rectangle's height proportional to its value and stacks them on top of each other. --- We can use the position argument to specify what position adjustment rules to follow: ```r # position = "dodge": values next to each other p1 <- ggplot(mpg, aes(x = class, fill = drv)) + geom_bar(position = "dodge") # position = "fill": percentage chart p2 <- ggplot(mpg, aes(x = class, fill = drv)) + geom_bar(position = "fill") p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_position2-1.png" width="65%" style="display: block; margin: auto;" /> **Note:** We may need to check the documentation for each particular geom to learn more about its positioning adjustments. --- ## Managing Scales Whenever we specify an aesthetic mapping, <span Style="color:blue">ggplot()</span> uses a particular **scale** to determine the range of values that the data should map to. It automatically adds a scale for each mapping to the plot. ```r # color the data by engine type ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_scale1-1.png" width="65%" style="display: block; margin: auto;" /> --- However, the sclae used in the figure could be changed if needed. Each scale can be represented by a function with the following name: **scale_**, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data, whereas a discrete scale will handle things like colors. ```r # same as above, with explicit scales ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + scale_x_continuous() + scale_y_continuous() + scale_colour_discrete() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_scale2-1.png" width="65%" style="display: block; margin: auto;" /> --- While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, we can use a scale to change the direction of an axis: ```r # milage relationship, ordered in reverse ggplot(mpg, aes(x = cty, y = hwy)) + geom_point() + scale_x_reverse() + scale_y_reverse() ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_reverse-1.png" width="65%" style="display: block; margin: auto;" /> Similarly, we can use <span Style="color:blue">scale_x_log10()</span> and <span Style="color:blue">scale_x_sqrt()</span> to transform the scale. --- We can use scales to format the axes as well. ```r ggplot(mpg, aes(x = class, fill = drv)) + geom_bar(position = "fill") + scale_y_continuous(breaks = seq(0, 1, by = .2), labels = scales::percent) + labs(y = "Percent") ``` <img src="data_analytics_workshop_2022_files/figure-html/visual_scale3-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Use Pre-Defined Palettees A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from [colorbrewer.org](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3). These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the <span Style="color:blue">scale_color_brewer()</span>, passing the pallette as an argument. ```r # default color brewer p1 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + scale_color_brewer() # specifying color palette p2 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + scale_color_brewer(palette = "Set3") p1 + p2 ``` --- The figures on the previous slide. <img src="data_analytics_workshop_2022_files/figure-html/color_brewer1-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Coordinate Systems Similar to scales, coordinate systems are specified with functions that all start with **coord_** and are added as a layer. There are a number of different possible coordinate systems to use, including: - coord_cartesian: the default Cartesian coordinate system, where you specify x and y values - coord_flip: a cartesian system with the x and y flipped - coord_fixed: a cartesian system with a “fixed” aspect ratio - coord_polar: a plot using polar coordinates - coord_quickmap: a coordinate system that approximates a good aspect ratio for maps. --- ```r # zoom in with coord_cartesian p1 <- ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + coord_cartesian(xlim = c(0, 5)) # flip x and y axis with coord_flip p2 <- ggplot(mpg, aes(x = class)) + geom_bar() + coord_flip() p1 + p2 ``` <img src="data_analytics_workshop_2022_files/figure-html/coord-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Facets If we want to divide the information into multiple subplots, facets are ways to go. It allows us to view a separate plot for each case in a categorical variable. We can construct a plot with multiple facets by using the <span Style="color:blue">facet_wrap()</span>. This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument). ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_wrap(~ class, nrow=2) ``` <img src="data_analytics_workshop_2022_files/figure-html/facets1-1.png" width="35%" style="display: block; margin: auto;" /> --- **NOte:** 1. We can use <span Style="color:blue">facet_grid()</span> to facet the data by more than one categorical variable. 2. We use a tilde (~) in our facet functions. With <span Style="color:blue">facet_grid()</span> the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns. ```r ggplot(mpg, aes(x = displ, y = hwy)) + geom_point() + facet_grid(year ~ cyl) ``` <img src="data_analytics_workshop_2022_files/figure-html/facets2-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Labels & Annotations Textual annotations and labels (on the plot, axes, geometry, and legend) are crucial for understanding and presenting information. - labs: assign title, subtitile, caption, x & y labels We can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!). ```r ggplot(mpg, aes(x = displ, y = hwy, color = class)) + geom_point() + labs(title = "Fuel Efficiency by Engine Power", subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars", x = "Engine Displacement (liters)", y = "Fuel Efficiency (miles per gallon)", color = "Car Type") ``` --- The figure on the previous slide. <img src="data_analytics_workshop_2022_files/figure-html/labels1-1.png" width="65%" style="display: block; margin: auto;" /> --- It is also possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, we are plotting an extra set of data which happen to be the variable names. ```r # a data table of each car that has best efficiency of its type best_in_class <- mpg %>% group_by(class) %>% filter(row_number(desc(hwy)) == 1) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_label(data = best_in_class, aes(label = model), alpha = 0.5) ``` <img src="data_analytics_workshop_2022_files/figure-html/label_points-1.png" width="65%" style="display: block; margin: auto;" /> --- However, we can find that two labels overlap one-another in the top left part of the plot on the previous slide. We can use the <span Style="color:blue">geom_text_repel()</span> from the <span Style="color:red">ggrepel</span> package to help position labels. ```r library(ggrepel) ggplot(mpg, aes(x = displ, y = hwy)) + geom_point(aes(color = class)) + geom_text_repel(data = best_in_class, aes(label = model)) ``` <img src="data_analytics_workshop_2022_files/figure-html/ggrepel-1.png" width="65%" style="display: block; margin: auto;" /> --- ## Themes Whenever we want to customize titles, labels, fonts, background, grid lines, and legends, we can use themes. ```r ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() + labs(title = "Fuel Efficiency by Engine Power", x = "Engine Displacement (Liters)", y = "Fuel Efficiency (Miles per gallon)") + theme(axis.text.x = element_text(size = 12), axis.text.y = element_text(size = 12), axis.title.x = element_text(size = 12), axis.title.y = element_text(size = 12)) ``` <img src="data_analytics_workshop_2022_files/figure-html/theme1-1.png" width="65%" style="display: block; margin: auto;" /> --- **Note:** 1. We only list some key components here. 2. See [Modify Components of A Theme](https://ggplot2.tidyverse.org/reference/theme.html) and [Complete Themes](https://ggplot2.tidyverse.org/reference/ggtheme.html for more details about the use of theme. --- ## Data Visualization with R Package: plotly The R package <span Style="color:red">plotly</span> can be used to make interactive graphic displays very easy when we already know how to use <span Style="color:blue">ggplot()</span> to create graphs. The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section. ```r library(plotly) p1 <- ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() ggplotly(p1, width = 400, height = 300) ```
-- name: exploration # Session 3: Data Exploration with R Package: DataExplorer In data science, it is important to get to know your data before advanced modeling or further analysis. We should understand what the data are about, what variables we have, the size of the data, how many missing values, what is the data type of each variable, any possible relationships between variables and anything unusual or interesting in the data. We will use the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance") to go over the use of functions in the package <span Style="color:red">DataExplorer</span>. For the demonstration purpose, we modified this data by having random missing values in some variables. --- ## Import the package, load the data and get a glimpse of the data ```r library(DataExplorer) insurance <- read_csv("https://raw.githubusercontent.com/Ying-Ju/R_Data_Analytics_Series_NTPU/main/insurance.csv") glimpse(insurance) ``` ``` ## Rows: 1,338 ## Columns: 7 ## $ age <dbl> 19, 18, 28, NA, 32, NA, 46, 37, 37, 60, 25, 62, NA, 56, 27, 1~ ## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~ ## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~ ## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~ ## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~ ## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~ ## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~ ``` --- First, we check the basic description for the data using the function <span Style="color:blue">plot_intro()</span> in the package <span Style="color:red">DataExplorer</span>. ```r plot_intro(insurance) ``` <img src="data_analytics_workshop_2022_files/figure-html/intro-1.png" width="65%" style="display: block; margin: auto;" /> --- Then, we study the distribution of missing values in the data using the function <span Style="color:blue">plot_missing()</span> in the package <span Style="color:red">DataExplorer</span>. ```r plot_missing(insurance) ``` <img src="data_analytics_workshop_2022_files/figure-html/missingdist-1.png" width="65%" style="display: block; margin: auto;" /> Since there are 7 variables, we will study all variables in the data. --- Now, we study the frequency distribution of all categorical variables in the data using the function <span Style="color:blue">plot_bar()</span> in the package <span Style="color:red">DataExplorer</span>. ```r plot_bar(insurance) ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_cat-1.png" width="65%" style="display: block; margin: auto;" /> --- The following code shows the distribution of sum of charges by the categorical variables in the data, individually. ```r plot_bar(insurance, with="charges") ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_bar1-1.png" width="65%" style="display: block; margin: auto;" /> --- Next, we study the distribution of all quantitative variables in the data using the function <span Style="color:blue">plot_histogram()</span> in the package <span Style="color:red">DataExplorer</span>. ```r plot_histogram(insurance, ncol=2) ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_num-1.png" width="65%" style="display: block; margin: auto;" /> --- We study the distributions of age, bmi, and charges with respect to region individually using the function <span Style="color:blue">plot_boxplot()</span> in the package <span Style="color:red">DataExplorer</span>. ```r insurance_Q <- insurance %>% select(age, bmi, charges, region) %>% drop_na() plot_boxplot(insurance_Q, by = "region") ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_boxplot-1.png" width="65%" style="display: block; margin: auto;" /> --- We can study the association between any quantitative variable with a given response variable in the data using the function <span Style="color:blue">plot_scatterplot()</span> in the package <span Style="color:red">DataExplorer</span>. Here, we study the association between charges and other quantitative variables in the data. ```r plot_scatterplot(insurance_Q %>% select(-region), by = "charges") ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_scatterplot0-1.png" width="65%" style="display: block; margin: auto;" /> --- We can get a scatterplot with sample observations as well. ```r plot_scatterplot(insurance_Q %>% select(-region), by = "charges", sampled_rows=100) ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_scatterplot1-1.png" width="65%" style="display: block; margin: auto;" /> --- ```r plot_scatterplot(insurance_Q %>% filter(region=="northwest") %>% select(-region), by = "charges") ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_scatterplot2-1.png" width="65%" style="display: block; margin: auto;" /> The above figure only shows the association between charges and other quantitative variables in the northwest. --- We can check the correlation of all quantitative variables in the data using the function <span Style="color:blue">plot_correlation()</span> in the package <span Style="color:red">DataExplorer</span>. ```r plot_correlation(insurance_Q %>% select(-region), cor_args = list( "use" = "complete.obs")) ``` <img src="data_analytics_workshop_2022_files/figure-html/EDA_corr-1.png" width="65%" style="display: block; margin: auto;" /> --- In you are new to data exploration and have no ideas about where to start. <span Style="color:blue">create_report()</span> function in the package <span Style="color:red">DataExplorer</span> can help to create a report for the data exploration of the data. ```r create_report(insurance, output_file = "report.html", output_dir = "C:\Users\Tessa\Document") ``` **Note:** Use <span Style="color:blue">help("create_report")</span> to find the usage of <span Style="color:blue">create_report()</span>. --- name: rmarkdown # Session 4: Learn R Rmarkdown Presentation In this session, we will introduce 1. R markdown Presentation 2. Flex Dashboard ### What is R markdown? R Markdown is a file format for creating dynamic documents with R and RStudio. R Markdown documents are written in Markdown which has easy-to-write plain text format with embedded R code. --- ## Rmarkdown Presentation In order to create a Rmarkdown presentation, we click <span Style="color:#3384FF">File</span> and then find <span Style="color:#3384FF">New File</span> and then <span Style="color:#3384FF">R markdown ...</span> There are four options: - Html (ioslides) This format allows us to create a slide show and the slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (---). - Html (Slidy) Similar to ioslides, this format allows to create a slide show broken up into sections by using the heading tag ##. If a header is not needed, a new slide could be created using a horizontal rule (---). A Slidy presentation gives a table of content while An ioslides presentation doesn't. - PDF (beamer) This format allows to create a beamer presenation (LaTex). The slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (---). - PowerPoint --- <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/presentation.jpg" height="600px" /></center> --- In the following, we show an example of the header of a Rmarkdown file. <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/Rmarkdown_heading.jpg" height="200px" /></center> We can use the output option to manipulate which presentation we would like to have. - output: ioslides_presentation - output: slidy_presentation - output: beamer_presentation - output: powerpoint_presentation To render an R Markdown document into its final output format, we can click the <span Style="color:#3384FF">Knit</span> button to render the document in RStudio and RStudio will show a preview of it. The further settings for presentations could be found at [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/presentations.html). --- ### xaringan Presentation A easy way to start creating a xaringan presentation is to use the R markdown template with <span Style="color:#3384FF">Ninja Presentation</span> or <span Style="color:#3384FF">Ninja Themed Presentation</span>. <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/xaringan.jpg" height="400px" /></center> A comprehensive tutorial regarding xaringan presentation could be found at [xaringan Presentation](https://bookdown.org/yihui/rmarkdown/xaringan.html). --- ## Flex Dashboard A easy way to start creating a Flex dashboard is to use the R markdown template with <span Style="color:#3384FF">Flex Dashboard</span>. <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/dashboard.jpg" height="300px" /></center> - We can use # to create multiple pages. - We can use <span Style="color:#3384FF">orientation</span> in the <span Style="color:#3384FF">output</span> options to specify the layout to be <span Style="color:#3384FF">columns</span> or <span Style="color:#3384FF">rows</span>. A comprehensive tutorial regarding Flex Dashboard could be found at [flexdashboard](https://pkgs.rstudio.com/flexdashboard/) --- name: github # Session 5: A Quick Overview of GitHub In this session, we will provide a quick overview of GitHub. It will cover some basic usages of GitHub Desktop due to the time limitation of the class. ### What is Git? Git is a version control system that allows us to track changes in any set of files. It is typically used by programmers who are working on source code together. ### What is GitHub? GitHub is a version control and collaboration tools for programming. It allows us to collaborate on projects from any location with other people. --- ## Register for a GitHub account We can register for a GitHub account at [www.github.com](https://github.com/). <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/GitHub.jpg" height="400px" /></center> --- ## Install and Set up GitHub Desktop <div class="figure" style="text-align: center"> <iframe src="https://desktop.github.com/" width="80%" height="450px" data-external="1"></iframe> <p class="caption">Installing GitHub Desktop from https://desktop.github.com/</p> </div> --- ## Create a repository, track changes, and explore a file's history We will create a process that shows how to create a repository, track changes, and explore a file's history. Here is an example that shows how GitHub Desktop looks like. <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/GitHub_desktop.jpg" height="350px" /></center> --- - Current repository - Current branch - Fetch origin Fetch downloads the most recent updates from origin but it does not update our local working copy with the changes. After we click Fetch origin, the button changes to Pull Origin. Clicking Pull Origin will update our local working copy with the fetched updates. - Summary (required) & Description (optional) - Commit to master When we commit the changes, the list of uncommitted changes was gone from the left pane. We have, however, just committed the changes locally. The commit must be pushed to the remote (origin) repository. --- ## Use GitHub Pages to Publish a html file The html file needs to be named as <span Style="Color:#ff8000">index.html</span>. 1. Sign in our GitHub account at [www.github.com](https://github.com/) 2. Navigate to the repository where our html file is 3. Click <span Style="color:#3384FF">Settings</span> and find <span Style="color:#3384FF">Pages</span> from the left menu 4. Under "GitHub Pages", use the None or Branch drop-down menu and select a publishing source 5. Click <span Style="color:#3384FF">Save</span> <center><img src="https://github.com/Ying-Ju/R_Data_Analytics_Workshop_at_NTPU/raw/main/Figures/GitHub_Pages.jpg" height="300px" /></center> --- ## Thanks .pull-left[ - Please do not hesitate to contact Dr. Chen if you have questions pertaining to learning R or other languages. Please email me at <a href="mailto:ychen@udayton.edu"><i class="fa fa-paper-plane fa-fw"></i> ychen4@udayton.edu</a>. - The R code used in this presentation can be found [here](https://raw.githubusercontent.com/Ying-Ju/MathClub.github.io/main/job_analysis.R). - Slides were created via the R package **xaringan**, with styling based on: * [xariganthemer](https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html) package, and * Alison Hill's [@apreshill](https://github.com/apreshill/) CSS resources for customizing themes and fonts - The formatting of slides is provided by Dr. Fadel M. Megahed [@fmegahed](https://github.com/fmegahed). ] .pull-right[ <img src="https://c.tenor.com/XgehSxepiagAAAAC/are-there-any-questions-eric-cartman.gif" width="350" height="350" /> ]