R Data Analytics Workshop at NTPU, 2022

class: center, middle, inverse, title-slide

# R Data Analytics Workshop at NTPU, 2022
### Ying-Ju Tessa Chen, PhD 
<div align="left">
<a href="https://github.com/Ying-Ju"> 
Ying-Ju</a>
</div>
 
<div align="left">
<a href="mailto:ychen4@udayton.edu"> 
ychen4@udayton.edu</a>
</div>
 
### June 20 & 21, 2022

---

## README

We facilitated a two-day workshop of R data analytics at the National Taipei University in June 2022. To the extent possible, the content of the lectures are recorded here. The lectures are based on [R for Data Science](https://r4ds.had.co.nz/).

### Table of Content

- [Data Manipulation](#manipulation)

- [Data Visualization](#visualization)

- [Data Exploration](#exploration)

- [R Markdown Presentations](#rmarkdown)

- [Introduction to GitHub](#github)

---
name: manipulation

# Session 1: Data Manipulation

In this session, we will talk about data manipulation using R package tidyverse. This package contains a collection of R packages that help us doing data management & exploration. The key packages in tidyverse are:

- dplyr: data manipulation
  
  - ggplot2: data visualization 
  
  - purr: functional programming toolkit
  
  - readr: read data and write files
  
  - tibble: simple data frame
  
  - tidyr: data management

---
## Key packages included in tidyverse

We will focus on the following key functions in dplyr using the dataset **flights** from the R package nycflights13.

- filter(): pick observations by their values
  
  - arrange(): reorder the rows
  
  - select(): select variables by their names
  
  - mutate(): create new variables with functions of existing variables 
  - group_by(): group data by existing variables
  
  - summarize(): collapse many values done to a single summary (with group_by)

---
## How the functions in dplyr work

All functions above work similarly.

1. The first argument is a data frame.

2. The subsequent arguments describe what to do with the data frame using the variable names.

3. The result is a new data frame (but we can save it back to the original data frame if needed).

---
## Load packages and read the Flights Data

First, we load the necessary packages, check conflict functions, and import the dataset **flights** from the R package nycflights13.

```r
library(tidyverse)
library(conflicted)
conflict_prefer("select", "dplyr")
conflict_prefer("filter", "dplyr")
df <- nycflights13::flights
```

---
## Understanding the Flights data

[Flights Data](https://rdrr.io/cran/nycflights13/man/flights.html) provides on-time data for all flights that departed NYC (i.e. JFK, LGA or EWR) in 2013 and there are 19 variables.

- year, month, day: Date of departure.
- dep_time, arr_time: Actual departure and arrival times (format HHMM or HMM), local time zone.
- sched_dep_time, sched_arr_time: Scheduled departure and arrival times (format HHMM or HMM), local time zone.
- dep_delay, arr_delay: Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
- carrier: Two letter carrier abbreviation. See airlines to get name.
- flight: Flight number.
- tailnum: Plane tail number. See planes for additional metadata.
- origin, dest: Origin and destination. See airports for additional metadata.
- air_time: Amount of time spent in the air, in minutes.
- distance: Distance between airports, in miles.
- hour, minute: Time of scheduled departure broken into hour and minutes.
- time_hour: Scheduled date and hour of the flight as a POSIXct date. Along with origin, can be used to join flights data to weather data.

---
## Get a glimpse of the data

```r
glimpse(df)
```

```
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2~
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, ~
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, ~
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1~
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,~
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,~
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1~
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "~
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4~
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394~
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",~
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",~
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1~
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, ~
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6~
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0~
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0~
```

---
## filter() function

filter() is used when we want to subset observations based on a logical condition. For example, we can select all fights on December 25th using the following code.

```r
filter(df, month == 12, day == 25)
```

```
## # A tibble: 719 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 12 25 456 500 -4 649 651
## 2 2013 12 25 524 515 9 805 814
## 3 2013 12 25 542 540 2 832 850
## 4 2013 12 25 546 550 -4 1022 1027
## 5 2013 12 25 556 600 -4 730 745
## 6 2013 12 25 557 600 -3 743 752
## 7 2013 12 25 557 600 -3 818 831
## 8 2013 12 25 559 600 -1 855 856
## 9 2013 12 25 559 600 -1 849 855
## 10 2013 12 25 600 600 0 850 846
## # ... with 709 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

```r
Christmas <- filter(df, month == 12, day == 25)
```

---
## Comparisons

R provides the standard suite: <, <=, >, >=, != (not equal), and == (equal).

If we would like to save the results to a variable as well as print them, we can wrap the assignment in parentheses

```r
(Jan1 <- filter(df, month == 1, day == 1)) 
```

```
## # A tibble: 842 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 832 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
## Logical Operations

R provides the following syntax: & is "and", | is "or", ! is "not". The following code finds all flights that departed in July or August.

```r
filter(df, month == 7 | month == 8)
```

```
## # A tibble: 58,752 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 7 1 1 2029 212 236 2359
## 2 2013 7 1 2 2359 3 344 344
## 3 2013 7 1 29 2245 104 151 1
## 4 2013 7 1 43 2130 193 322 14
## 5 2013 7 1 44 2150 174 300 100
## 6 2013 7 1 46 2051 235 304 2358
## 7 2013 7 1 48 2001 287 308 2305
## 8 2013 7 1 58 2155 183 335 43
## 9 2013 7 1 100 2146 194 327 30
## 10 2013 7 1 100 2245 135 337 135
## # ... with 58,742 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

```r
filter(df, month %in% c(7, 8))
```

---

**Note:** 
 
1. If we use filter(df, month == 7 | 8), it finds all months are equal 7 | 8, an expression that evaluates to **TRUE**. In a numeric context, TRUE becomes one, so this finds all fights in the data.

2. filter() only includes rows where the condition is **TRUE** and it excludes both FALSE and NA values.

If we want to find flights that weren't delayed on both arrival and departure by more than 1 hour, we could use either of the following codes.

```r
filter(df, !(arr_delay > 60 | dep_delay > 60))
```

```
## # A tibble: 295,893 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 295,883 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

```r
filter(df, arr_delay <= 60, dep_delay <= 60)
```

---
## arrange() function

arrange() is used when we want to sort a dataset by a variable. If more variables are specified for sorting a dataset, the variables entered first taking priority over those come later. The following code chunk gives an example that sorts the flights by dates.

```r
arrange(df, year, month, day)
```

```
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

**Note:** 
  
1. We can save the data frame back to the original data frame after sorting the data.

2. Use desc() for sorting data via descending order. The following code chunk arranges the Flights Data by arrival delay in descending order.

3. Missing values are always sorted at the end.

---

```r
result_arrange <- arrange(df, desc(arr_delay))
head(select(result_arrange, arr_delay, everything()))
```

```
## # A tibble: 6 x 19
## arr_delay year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 1272 2013 1 9 641 900 1301 1242
## 2 1127 2013 6 15 1432 1935 1137 1607
## 3 1109 2013 1 10 1121 1635 1126 1239
## 4 1007 2013 9 20 1139 1845 1014 1457
## 5 989 2013 7 22 845 1600 1005 1044
## 6 931 2013 4 10 1100 1900 960 1342
## # ... with 11 more variables: sched_arr_time <int>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

```r
tail(select(result_arrange, arr_delay, everything()))
```

```
## # A tibble: 6 x 19
## arr_delay year month day dep_time sched_dep_time dep_delay arr_time
## <dbl> <int> <int> <int> <int> <int> <dbl> <int>
## 1 NA 2013 9 30 NA 1842 NA NA
## 2 NA 2013 9 30 NA 1455 NA NA
## 3 NA 2013 9 30 NA 2200 NA NA
## 4 NA 2013 9 30 NA 1210 NA NA
## 5 NA 2013 9 30 NA 1159 NA NA
## 6 NA 2013 9 30 NA 840 NA NA
## # ... with 11 more variables: sched_arr_time <int>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
## select() function

select() is used when we would like to obtain several variables in the data. For example, we can use the following code chunk to select the Flights Data with only a few variables.

```r
# select specific columns
select(df, year, month, day)

# select all columns between year and day
select(df, year:day)

# select all columns except those from year and day
select(df, -(year:day)) 
```

**Note:**
1. We can use a minus sign - to drop variables. 
2. There are several helper functions we can use within select(). See ?select for the information. 
3. select() can be used with the everything() function when we have a handful of variables we would like to move to the start of the data frame.

---

```r
# move carrier, origin, dest, and distance to the start of the data
select(df, carrier, origin, dest, distance, everything())
```

```
## # A tibble: 336,776 x 19
## carrier origin dest distance year month day dep_time sched_dep_time
## <chr> <chr> <chr> <dbl> <int> <int> <int> <int> <int>
## 1 UA EWR IAH 1400 2013 1 1 517 515
## 2 UA LGA IAH 1416 2013 1 1 533 529
## 3 AA JFK MIA 1089 2013 1 1 542 540
## 4 B6 JFK BQN 1576 2013 1 1 544 545
## 5 DL LGA ATL 762 2013 1 1 554 600
## 6 UA EWR ORD 719 2013 1 1 554 558
## 7 B6 EWR FLL 1065 2013 1 1 555 600
## 8 EV LGA IAD 229 2013 1 1 557 600
## 9 B6 JFK MCO 944 2013 1 1 557 600
## 10 AA LGA ORD 733 2013 1 1 558 600
## # ... with 336,766 more rows, and 10 more variables: dep_delay <dbl>,
## # arr_time <int>, sched_arr_time <int>, arr_delay <dbl>, flight <int>,
## # tailnum <chr>, air_time <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

---
## mutate() function
 
mutate() is used when we would like to add a new variable / column using the other variables in the data.

**Note:** mutate() always adds new columns at the end of the data.

First, we start by creating a smaller dataset with a few variables.

```r
# we start by creating a smaller dataset.
df1 <- select(df, year:day, ends_with("delay"), distance, air_time)
```

---

We create four variables using variables in the data.

```r
mutate(df1, 
       gain= arr_delay - dep_delay, 
       speed = distance / air_time * 60,
       hours = air_time / 60,
       gain_per_hour = gain / hours)
```

```
## # A tibble: 336,776 x 11
## year month day dep_delay arr_delay distance air_time gain speed hours
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 2 11 1400 227 9 370. 3.78 
## 2 2013 1 1 4 20 1416 227 16 374. 3.78 
## 3 2013 1 1 2 33 1089 160 31 408. 2.67 
## 4 2013 1 1 -1 -18 1576 183 -17 517. 3.05 
## 5 2013 1 1 -6 -25 762 116 -19 394. 1.93 
## 6 2013 1 1 -4 12 719 150 16 288. 2.5 
## 7 2013 1 1 -5 19 1065 158 24 404. 2.63 
## 8 2013 1 1 -3 -14 229 53 -11 259. 0.883
## 9 2013 1 1 -3 -8 944 140 -5 405. 2.33 
## 10 2013 1 1 -2 8 733 138 10 319. 2.3 
## # ... with 336,766 more rows, and 1 more variable: gain_per_hour <dbl>
```

---

If we want to keep only the new variables, use transmute().

```r
transmute(df1, 
          gain= arr_delay - dep_delay, 
          speed = distance / air_time * 60,
          hours = air_time / 60,
          gain_per_hour = gain / hours)
```

```
## # A tibble: 336,776 x 4
## gain speed hours gain_per_hour
## <dbl> <dbl> <dbl> <dbl>
## 1 9 370. 3.78 2.38
## 2 16 374. 3.78 4.23
## 3 31 408. 2.67 11.6 
## 4 -17 517. 3.05 -5.57
## 5 -19 394. 1.93 -9.83
## 6 16 288. 2.5 6.4 
## 7 24 404. 2.63 9.11
## 8 -11 259. 0.883 -12.5 
## 9 -5 405. 2.33 -2.14
## 10 10 319. 2.3 4.35
## # ... with 336,766 more rows
```

**Note:** There are many functions for creating new variables that we can use with mutate(). The key property is that the function must be vectorized, which means it must take a vector of values as input and returns a vector with the same number of values as output.

---
## group_by() & summarize() functions
 
summarize() collapses a data frame to a single row. For example, we can summarize the average departure delays using the following code chunk.

```r
summarize(df, delay = mean(dep_delay, na.rm=T))
```

```
## # A tibble: 1 x 1
## delay
## <dbl>
## 1 12.6
```

---

In general, summarize() function is used together with group_by() as we group rows for some purposes. group_by() is used to group rows by one or more variables, giving priority to the variable entered first.

```r
group_by(df, year, month, day)
```

```
## # A tibble: 336,776 x 19
## # Groups: year, month, day [365]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
```

The result shows the original data but indicates groups: year, month, day, in our example.

---

For example, we can study the average departure / arrival delays for each day.

```r
by_day <- group_by(df, year, month, day) 
summarize(by_day, 
 ave_dep_delay = mean(dep_delay, na.rm = T),
 ave_arr_delay = mean(arr_delay, na.rm = T)
)
```

```
## # A tibble: 365 x 5
## # Groups: year, month [12]
## year month day ave_dep_delay ave_arr_delay
## <int> <int> <int> <dbl> <dbl>
## 1 2013 1 1 11.5 12.7 
## 2 2013 1 2 13.9 12.7 
## 3 2013 1 3 11.0 5.73 
## 4 2013 1 4 8.95 -1.93 
## 5 2013 1 5 5.73 -1.53 
## 6 2013 1 6 7.15 4.24 
## 7 2013 1 7 5.42 -4.95 
## 8 2013 1 8 2.55 -3.23 
## 9 2013 1 9 2.28 -0.264
## 10 2013 1 10 2.84 -5.90 
## # ... with 355 more rows
```

---
## Combining Multiple Operations with the Pipe

In other to handle the data processing well in data science, it is essential to know the use of pipes. Pipes are great tool for presenting a sequence of multiple operations and therefore, pipes increase readability of the code. The pipe, %>%, is from the package magrittr and it is loaded automatically when tidyverse is loaded.

The logic when using pipe: object %>% function1 %>% function 2....

If we want to group the Flights Data by the destination and then find the number of flights, the average distance, the average arrival delay at each destination, and filter to remove Honolulu airport (HNL), we may use the following code chunk to achieve this.

```r
by_dest <- group_by(df, dest)
delay <- summarize(by_dest,
 count = n(),
 ave_dist = mean(distance, na.rm=T),
 ave_arr_delay = mean(arr_delay, na.rm=T)
)
delay <- filter(delay, count > 20, dest != "HNL")
```

---

The following code chunk does the same task on the previous slide with the pipe, %>% and it makes the code easier to read.

```r
delay <- df %>% 
 group_by(dest) %>%
 summarize(
 count = n(),
 ave_dist = mean(distance, na.rm=T),
 ave_arr_delay = mean(arr_delay, na.rm=T)
 ) %>%
 filter(count > 20, dest != "HNL")

delay
```

```
## # A tibble: 96 x 4
## dest count ave_dist ave_arr_delay
## <chr> <int> <dbl> <dbl>
## 1 ABQ 254 1826 4.38
## 2 ACK 265 199 4.85
## 3 ALB 439 143 14.4 
## 4 ATL 17215 757. 11.3 
## 5 AUS 2439 1514. 6.02
## 6 AVL 275 584. 8.00
## 7 BDL 443 116 7.05
## 8 BGR 375 378 8.03
## 9 BHM 297 866. 16.9 
## 10 BNA 6333 758. 11.8 
## # ... with 86 more rows
```

---
## Useful Summary Functions

- Measures of location for a quantitative variable: mean(), median()
 
- Measure of spread for a quantitative variable: sd(), IQR(), mad()
 
Here, `$MAD = median(|x_i-\bar{x}|)$` is called the median absolute deviation which may be more useful if we have outliers.

```r
not_cancelled <- df %>% 
 filter(!is.na(dep_delay), !is.na(arr_delay))

not_cancelled %>% 
  group_by(dest) %>%
  summarize(
    distance_mu = mean(distance),
    distance_sd = sd(distance)) %>%
  arrange(desc(distance_sd)) %>% 
  head()
```

```
## # A tibble: 6 x 3
## dest distance_mu distance_sd
## <chr> <dbl> <dbl>
## 1 EGE 1736. 10.5 
## 2 SAN 2437. 10.4 
## 3 SFO 2578. 10.2 
## 4 HNL 4973. 10.0 
## 5 SEA 2413. 9.98
## 6 LAS 2241. 9.91
```

---

- Measures of rank: min(), quantile(), max()

```r
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first = min(dep_time), # the first flight departed each day
    last = max(dep_time) # the last flight departed each day
  ) %>% 
  head()
```

```
## # A tibble: 6 x 5
## # Groups: year, month [1]
## year month day first last
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
```

---

- Measures of position: first(), nth(x, 2), last()
 
The following code chunk finds the first and last departure for each day.

```r
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(
    first_dep = first(dep_time),
    last_dep = last(dep_time)
  ) %>% 
  head()
```

```
## # A tibble: 6 x 5
## # Groups: year, month [1]
## year month day first_dep last_dep
## <int> <int> <int> <int> <int>
## 1 2013 1 1 517 2356
## 2 2013 1 2 42 2354
## 3 2013 1 3 32 2349
## 4 2013 1 4 25 2358
## 5 2013 1 5 14 2357
## 6 2013 1 6 16 2355
```

---

- Counts: We have seen n(), which takes no arguments, and returns the size of the current group. To count the number of non-missing values, we can use sum(is.na(x)). To count the number of distinct values, use n_distinct().

```r
not_cancelled %>%
  group_by(dest) %>%
  summarize(carriers = n_distinct(carrier)) %>%
  arrange(desc(carriers)) %>% 
  head()
```

```
## # A tibble: 6 x 2
## dest carriers
## <chr> <int>
## 1 ATL 7
## 2 BOS 7
## 3 CLT 7
## 4 ORD 7
## 5 TPA 7
## 6 AUS 6
```

---

We can use count() directly if all we want is a count.

```r
not_cancelled %>% 
  count(dest) %>% 
  head(5)
```

```
## # A tibble: 5 x 2
## dest n
## <chr> <int>
## 1 ABQ 254
## 2 ACK 264
## 3 ALB 418
## 4 ANC 8
## 5 ATL 16837
```

We can optionally provide a weight variable. For example we could use this to "count" the total number of miles a plane flew.

```r
not_cancelled %>%
  count(tailnum, wt = distance) %>% 
  head(5)
```

```
## # A tibble: 5 x 2
## tailnum n
## <chr> <dbl>
## 1 D942DN 3418
## 2 N0EGMQ 239143
## 3 N10156 109664
## 4 N102UW 25722
## 5 N103US 24619
```

---

- Counts and proportions of logical values

When used with numeric functions, TRUE is converted to 1 and FALSE to 0. Thus, sum() gives the number of TRUEs and mean() gives the proportion in the variable. For example, we can check how many flights left before 5AM using the following code chunk.

```r
not_cancelled %>%
 group_by(year, month, day) %>%
 summarize(n_early = sum(dep_time < 500)) %>% 
 head()
```

```
## # A tibble: 6 x 4
## # Groups: year, month [1]
## year month day n_early
## <int> <int> <int> <int>
## 1 2013 1 1 0
## 2 2013 1 2 3
## 3 2013 1 3 4
## 4 2013 1 4 3
## 5 2013 1 5 3
## 6 2013 1 6 2
```

---

Or what proportion of flights are delayed by more than one hour?

```r
not_cancelled %>%
  group_by(year, month, day) %>%
  summarize(hour_perc = mean(arr_delay > 60)) %>% 
  head()
```

```
## # A tibble: 6 x 4
## # Groups: year, month [1]
## year month day hour_perc
## <int> <int> <int> <dbl>
## 1 2013 1 1 0.0722
## 2 2013 1 2 0.0851
## 3 2013 1 3 0.0567
## 4 2013 1 4 0.0396
## 5 2013 1 5 0.0349
## 6 2013 1 6 0.0470
```

---
## Grouping by Multiple Variables

Here we show some examples to demonstrate how to group the data by multiple variables.

```r
per_day <- df %>% 
 group_by(year, month, day) %>%
 summarize(flights = n())

per_day %>% head()
```

```
## # A tibble: 6 x 4
## # Groups: year, month [1]
## year month day flights
## <int> <int> <int> <int>
## 1 2013 1 1 842
## 2 2013 1 2 943
## 3 2013 1 3 914
## 4 2013 1 4 915
## 5 2013 1 5 720
## 6 2013 1 6 832
```

---

```r
per_month <- summarize(per_day, flights = sum(flights))

per_month %>% head()
```

```
## # A tibble: 6 x 3
## # Groups: year [1]
## year month flights
## <int> <int> <int>
## 1 2013 1 27004
## 2 2013 2 24951
## 3 2013 3 28834
## 4 2013 4 28330
## 5 2013 5 28796
## 6 2013 6 28243
```

```r
per_year <- summarize(per_month, flights = sum(flights))

per_year
```

```
## # A tibble: 1 x 2
## year flights
## <int> <int>
## 1 2013 336776
```

---
## Ungrouping

If we need to remove grouping, and return to operations on ungrouped data, use ungroup().

```r
daily <- df %>% group_by(year, month, day)
daily %>% 
 ungroup() %>% # no longer grouped by date
 summarize(flights=n()) # all flights
```

```
## # A tibble: 1 x 1
## flights
## <int>
## 1 336776
```

---
## Grouped Mutates and Filters

We can also do convenient operations with mutate() and filter().

The following code chunk finds the worst members of each group.

```r
df1 %>% 
 group_by(year, month, day) %>% 
 filter(rank(desc(arr_delay)) < 10) %>% 
 head()
```

```
## # A tibble: 6 x 7
## # Groups: year, month, day [1]
## year month day dep_delay arr_delay distance air_time
## <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
## 1 2013 1 1 853 851 184 41
## 2 2013 1 1 290 338 1134 213
## 3 2013 1 1 260 263 266 46
## 4 2013 1 1 157 174 213 60
## 5 2013 1 1 216 222 708 121
## 6 2013 1 1 255 250 589 115
```

---

The following code chunk finds all groups bigger than a threshold.

```r
popular_dests <- df %>%
 group_by(dest) %>% 
 filter(n()>365)

popular_dests %>% head()
```

```
## # A tibble: 6 x 19
## # Groups: dest [5]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
```

---

The following code chunk standardizes to compute per group metrics.

```r
popular_dests %>% 
  filter(arr_delay > 0) %>% 
  mutate(prop_delay = arr_delay / sum(arr_delay)) %>% 
  select(year:day, arr_delay, prop_delay) %>%
  head()
```

```
## # A tibble: 6 x 6
## # Groups: dest [4]
## dest year month day arr_delay prop_delay
## <chr> <int> <int> <int> <dbl> <dbl>
## 1 IAH 2013 1 1 11 0.000111 
## 2 IAH 2013 1 1 20 0.000201 
## 3 MIA 2013 1 1 33 0.000235 
## 4 ORD 2013 1 1 12 0.0000424
## 5 FLL 2013 1 1 19 0.0000938
## 6 ORD 2013 1 1 8 0.0000283
```

---
name: visualization

# Section 2: Data Visualization with ggplot2

In this session, we will introduce how to visualize our data using ggplot2 and plotly. The lecture is based on [UC Business Analytics R Programming Guide]( https://uc-r.github.io/ggplot_intro).

While we can use the built-in functions in the base package in R to obtain plots, the package ggplot2 creates advanced graphs with simple and flexible commands.

---
## Load packages and read the Fuel Economy Data

First, we load the necessary packages, check conflict functions, and get a glimpse of the dataset **mpg** from the R package ggplot2.

```r
library(tidyverse)
library(conflicted)
conflict_prefer("lag", "dplyr")
conflict_prefer("filter", "dplyr")
glimpse(mpg)
```

```
## Rows: 234
## Columns: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", "~
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", "~
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2.~
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 200~
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8, ~
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "auto~
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "4~
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, 1~
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, 2~
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p~
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "c~
```

---

Now we need to understand the data and each variable in the data. This dataset contains 38 popular models of cars from 1999 to 2008. [Fuel Economy Data](https://ggplot2.tidyverse.org/reference/mpg.html).

- manufacturer:	car manufacturer

- model: model name

- displ: engine displacement, in liters

- year: year of manufacturing	(1999-2008)

- cyl: number of cylinders

- trans: type of transmission

- drv: drive type (f, r, 4, f=front wheel, r=rear wheel, 4=4 wheel)

- cty: city mileage	miles per gallon

- hwy: highway mileage miles per gallon

- fl:	fuel type (diesel, petrol, electric, etc.)

- class: vehicle class	7 types (compact, SUV, minivan etc.)

---
## Grammar of Graphics

The basic idea of creating plots using ggplot2 is to specify each component of the following and combine them with +.

### ggplot() function

ggplot() function plays an important role in data visualization as it is very flexible for plotting many different types of graphic displays.

The logic when using ggplot() function is: `(data, mapping) + geom_function()`.

---
## The Basics

First, we see how ggplot() function works by creating canvas and including variables.

```r
# need this package to create site-by-site plots
library(patchwork)
# create canvas
p1 <- ggplot(mpg)
# variables of interest mapped
p2 <- ggplot(mpg, mapping = aes(x = displ, y = hwy))

p1+p2
```

---

The following code chunk shows how we can obtain a scatter plot to study the relationship between engine displacement and highway mileage per gallon.

```r
# data plotted
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
```

---
## Aesthetic Mappings

The aesthetic mappings allow to select variables to be plotted and use data properties to influence visual characteristics such as color, size, shape, position, etc. As a result, each visual characteristic can encode a different part of the data and be utilized to communicate information.

All aesthetics for a plot are specified in the aes() function call. For example, we can add a mapping from the class of the cars to a color characteristic:

```r
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()
```

---

**Note:** 
  
1. We should note that in the above code chunk, "class" is a variable in the data and therefore, the commend specifies a categorical variable is used as the third variable in the figure.

2. Using the aes() function will cause the visual channel to be based on the data specified in the argument. For example, using `aes(color = "blue")` won’t cause the geometry’s color to be “blue”, but will instead cause the visual channel to be mapped from the vector c("blue") — as if we only had a single type of engine that happened to be called “red”. If we wish to apply an aesthetic property to an entire geometry, we can set that property as an argument to the geom method, outside of the aes() call.

---

```r
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue")
```

---
## Specifying Geometric Shapes

Building on these basics, we can use ggplot2 to create almost any kind of plot we may want. These plots are declared using functions that follow from the Grammar of Graphics. ggplot2 supports a number of different types of geometric objects, including:
 
-	geom_bar(): bar charts
- geom_boxplot(): boxplots
-	geom_histogram(): histograms
-	geom_line(): lines 
-	geom_map(): polygons in the shape of a map. 
-	geom_point(): individual points 
-	geom_polygon(): arbitrary shapes
-	geom_smooth(): smoothed lines

Each of these geometries will make use of the aesthetic mappings provided, albeit the visual qualities to which the data will be mapped will differ. For example, we can map data to the shape of a geom_point (e.g., if they should be circles or squares), or we can map data to the line-type of a geom_line (e.g., if it is solid or dotted), but not vice versa.

---

Almost all geoms require an x and y mapping at the bare minimum.

```r
# x and y mapping needed when creating a scatterplot
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
 geom_point()

p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
 geom_smooth()

p1 + p2
```

---

There is no y mapping needed when creating a bar chart or a histogram.

```r
p1 <- ggplot(mpg, aes(x = class)) +
 geom_bar()

p2 <- ggplot(mpg, aes(x = hwy)) +
 geom_histogram()

p1 + p2
```

---

We improve the quality of the figures on the previous slide.

```r
# no y mapping needed when creating a bar chart
p1 <- ggplot(mpg, aes(y = class)) +
 geom_bar(fill = daytonred, alpha = 0.2)

p2 <- ggplot(mpg, aes(x = hwy)) +
 geom_histogram(aes(y = ..density..), binwidth = density(mpg$hwy)$bw) +
 geom_density(fill=daytonred, alpha = 0.2)

p1 + p2
```

---

What makes this really powerful is that we can add multiple geometries to a plot, thus allowing you to create complex graphics showing multiple aspects of your data.

```r
# plot with both points and smoothed line
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth()
```

---

**Note:**
1. Since the aesthetics for each geom can be different, we could show multiple lines on the same plot (or with different colors, styles, etc). 
2. It is also possible to give each geom a different data argument, so that we can show multiple data sets in the same plot.

If we specify an aesthetic within ggplot(), it will be passed on to each geom that follows. Or we can specify certain aes within each geom, which allows us to only show certain characteristics for that specific layer (i.e. geom_point).

---

For example, we can plot both points and a smoothed line for the same x and y variable but specify unique colors within each geom:

```r
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(color = "blue") +
  geom_smooth(color = "red")
```

---

```r
# color aesthetic passed to each geom layer
p1 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
 geom_point() +
 geom_smooth(se = FALSE)

# color aesthetic specified for only the geom_point layer
p2 <- ggplot(mpg, aes(x = displ, y = hwy)) +
 geom_point(aes(color = class)) +
 geom_smooth(se = FALSE)

p1 + p2
```

---

```r
# color aesthetic specified for only the geom_point layer
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(color = class)) +
  geom_smooth(se = FALSE)
```

---
## Statistical Transformations

The following bar chart shows the frequency distribution of vehicle class. We can find that y axis was defined as the count of elements that have the particular type. This count is not part of the data set, but is instead a statistical transformation that the geom_bar automatically applies to the data. In particular, it applies the stat_count transformation.

```r
ggplot(mpg, aes(x = class)) +
  geom_bar()
```

---

ggplot2 supports many different statistical transformations. For example, the “identity” transformation will leave the data “as is”. We can specify which statistical transformation a geom uses by passing it as the stat argument. For example, consider our data already had the count as a variable:

```r
(class_count <- count(mpg, class))
```

```
## # A tibble: 7 x 2
## class n
## <chr> <int>
## 1 2seater 5
## 2 compact 47
## 3 midsize 41
## 4 minivan 11
## 5 pickup 33
## 6 subcompact 35
## 7 suv 62
```

---

We can use `stat = "identity"` within geom_bar to plot our bar height values to this variable. Also, note that we now include n for our y variable:

```r
ggplot(class_count, aes(x = class, y = n)) +
  geom_bar(stat = "identity")
```

---

We can also call stat_ functions directly to add additional layers. For example, here we create a scatter plot of highway miles for each displacement value and then use stat_summary() to plot the mean highway miles at each displacement value.

```r
ggplot(mpg, aes(displ, hwy)) + 
  geom_point(color = "grey") + 
  stat_summary(fun.y = "mean", geom = "line", size = 1, linetype = "dashed")
```

---
## Position Adjustments
 
In addition to a default statistical transformation, each geom also has a default position adjustment which specifies a set of “rules” as to how different components should be positioned relative to each other. This position is noticeable in geom_bar() if we map a different variable to the color visual characteristic.

```r
# bar chart of class, colored by drive (front, rear, 4-wheel)
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar()
```

The geom_bar() by default uses a position adjustment of `stack`, which makes each rectangle's height proportional to its value and stacks them on top of each other.

---

We can use the position argument to specify what position adjustment rules to follow:

```r
# position = "dodge": values next to each other
p1 <- ggplot(mpg, aes(x = class, fill = drv)) + 
 geom_bar(position = "dodge")

# position = "fill": percentage chart
p2 <- ggplot(mpg, aes(x = class, fill = drv)) + 
 geom_bar(position = "fill")

p1 + p2
```

**Note:** We may need to check the documentation for each particular geom to learn more about its positioning adjustments.

---
## Managing Scales

Whenever we specify an aesthetic mapping, ggplot() uses a particular **scale** to determine the range of values that the data should map to. It automatically adds a scale for each mapping to the plot.

```r
# color the data by engine type
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point()
```

---

However, the sclae used in the figure could be changed if needed. Each scale can be represented by a function with the following name: **scale_**, followed by the name of the aesthetic property, followed by an _ and the name of the scale. A continuous scale will handle things like numeric data, whereas a discrete scale will handle things like colors.

```r
# same as above, with explicit scales
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  scale_x_continuous() +
  scale_y_continuous() +
  scale_colour_discrete()
```

---

While the default scales will work fine, it is possible to explicitly add different scales to replace the defaults. For example, we can use a scale to change the direction of an axis:

```r
# milage relationship, ordered in reverse
ggplot(mpg, aes(x = cty, y = hwy)) +
  geom_point() +
  scale_x_reverse() +
  scale_y_reverse()
```

Similarly, we can use scale_x_log10() and scale_x_sqrt() to transform the scale.

---

We can use scales to format the axes as well.

```r
ggplot(mpg, aes(x = class, fill = drv)) + 
  geom_bar(position = "fill") +
  scale_y_continuous(breaks = seq(0, 1, by = .2), 
                     labels = scales::percent) + 
  labs(y = "Percent")
```

---
## Use Pre-Defined Palettees

A common parameter to change is which set of colors to use in a plot. While you can use the default coloring, a more common option is to leverage the pre-defined palettes from [colorbrewer.org](https://colorbrewer2.org/#type=sequential&scheme=BuGn&n=3). These color sets have been carefully designed to look good and to be viewable to people with certain forms of color blindness. We can leverage color brewer palletes by specifying the scale_color_brewer(), passing the pallette as an argument.

```r
# default color brewer
p1 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
 geom_point() +
 scale_color_brewer()

# specifying color palette
p2 <- ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
 geom_point() +
 scale_color_brewer(palette = "Set3")

p1 + p2
```

---

The figures on the previous slide.

---
## Coordinate Systems

Similar to scales, coordinate systems are specified with functions that all start with **coord_** and are added as a layer. There are a number of different possible coordinate systems to use, including:

-	coord_cartesian: the default Cartesian coordinate system, where you specify x and y values

-	coord_flip: a cartesian system with the x and y flipped

-	coord_fixed: a cartesian system with a “fixed” aspect ratio

-	coord_polar: a plot using polar coordinates

-	coord_quickmap: a coordinate system that approximates a good aspect ratio for maps.

---

```r
# zoom in with coord_cartesian
p1 <- ggplot(mpg, aes(x = displ, y = hwy)) +
 geom_point() +
 coord_cartesian(xlim = c(0, 5))

# flip x and y axis with coord_flip
p2 <- ggplot(mpg, aes(x = class)) +
 geom_bar() +
 coord_flip()

p1 + p2
```

---
## Facets

If we want to divide the information into multiple subplots, facets are ways to go. It allows us to view a separate plot for each case in a categorical variable. We can construct a plot with multiple facets by using the facet_wrap(). This will produce a “row” of subplots, one for each categorical variable (the number of rows can be specified with an additional argument).

```r
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_wrap(~ class, nrow=2)
```

---

**NOte:** 
1. We can use facet_grid() to facet the data by more than one categorical variable. 
2. We use a tilde (~) in our facet functions. With facet_grid() the variable to the left of the tilde will be represented in the rows and the variable to the right will be represented across the columns.

```r
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(year ~ cyl)
```

---
## Labels & Annotations

Textual annotations and labels (on the plot, axes, geometry, and legend) are crucial for understanding and presenting information.

- labs: assign title, subtitile, caption, x & y labels

We can add titles and axis labels to a chart using the labs() function (not labels, which is a different R function!).

```r
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       subtitle = "Fuel economy data from 1999 and 2008 for 38 popular models of cars",
       x = "Engine Displacement (liters)",
       y = "Fuel Efficiency (miles per gallon)",
       color = "Car Type")
```

---
The figure on the previous slide.

---
It is also possible to add labels into the plot itself (e.g., to label each point or line) by adding a new geom_text or geom_label to the plot; effectively, we are plotting an extra set of data which happen to be the variable names.

```r
# a data table of each car that has best efficiency of its type

best_in_class <- mpg %>%
 group_by(class) %>%
 filter(row_number(desc(hwy)) == 1)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_label(data = best_in_class, aes(label = model), alpha = 0.5)
```

---

However, we can find that two labels overlap one-another in the top left part of the plot on the previous slide. We can use the geom_text_repel() from the ggrepel package to help position labels.

```r
library(ggrepel)

ggplot(mpg, aes(x = displ, y = hwy)) + 
  geom_point(aes(color = class)) +
  geom_text_repel(data = best_in_class, aes(label = model))
```

---

## Themes
Whenever we want to customize titles, labels, fonts, background, grid lines, and legends, we can use themes.

```r
ggplot(mpg, aes(x=displ, y=hwy)) +
  geom_point() +
  labs(title = "Fuel Efficiency by Engine Power",
       x = "Engine Displacement (Liters)",
       y = "Fuel Efficiency (Miles per gallon)") + 
  theme(axis.text.x = element_text(size = 12),
        axis.text.y = element_text(size = 12),  
        axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12))
```

---

**Note:** 
  
1. We only list some key components here.
2. See [Modify Components of A Theme](https://ggplot2.tidyverse.org/reference/theme.html) and [Complete Themes](https://ggplot2.tidyverse.org/reference/ggtheme.html for more details about the use of theme.

---
## Data Visualization with R Package: plotly

The R package plotly can be used to make interactive graphic displays very easy when we already know how to use ggplot() to create graphs. The following code chunk shows the interactive plots corresponding to the figures we have created in the previous section.

```r
library(plotly)
p1 <- ggplot(mpg, aes(x=displ, y=hwy)) +
 geom_point() 
ggplotly(p1, width = 400, height = 300)
```

<div id="htmlwidget-9ce22bb5d6f8f9df24d3" style="width:400px;height:300px;" class="plotly html-widget"></div>
<script type="application/json" data-for="htmlwidget-9ce22bb5d6f8f9df24d3">{"x":{"data":[{"x":[1.8,1.8,2,2,2.8,2.8,3.1,1.8,1.8,2,2,2.8,2.8,3.1,3.1,2.8,3.1,4.2,5.3,5.3,5.3,5.7,6,5.7,5.7,6.2,6.2,7,5.3,5.3,5.7,6.5,2.4,2.4,3.1,3.5,3.6,2.4,3,3.3,3.3,3.3,3.3,3.3,3.8,3.8,3.8,4,3.7,3.7,3.9,3.9,4.7,4.7,4.7,5.2,5.2,3.9,4.7,4.7,4.7,5.2,5.7,5.9,4.7,4.7,4.7,4.7,4.7,4.7,5.2,5.2,5.7,5.9,4.6,5.4,5.4,4,4,4,4,4.6,5,4.2,4.2,4.6,4.6,4.6,5.4,5.4,3.8,3.8,4,4,4.6,4.6,4.6,4.6,5.4,1.6,1.6,1.6,1.6,1.6,1.8,1.8,1.8,2,2.4,2.4,2.4,2.4,2.5,2.5,3.3,2,2,2,2,2.7,2.7,2.7,3,3.7,4,4.7,4.7,4.7,5.7,6.1,4,4.2,4.4,4.6,5.4,5.4,5.4,4,4,4.6,5,2.4,2.4,2.5,2.5,3.5,3.5,3,3,3.5,3.3,3.3,4,5.6,3.1,3.8,3.8,3.8,5.3,2.5,2.5,2.5,2.5,2.5,2.5,2.2,2.2,2.5,2.5,2.5,2.5,2.5,2.5,2.7,2.7,3.4,3.4,4,4.7,2.2,2.2,2.4,2.4,3,3,3.5,2.2,2.2,2.4,2.4,3,3,3.3,1.8,1.8,1.8,1.8,1.8,4.7,5.7,2.7,2.7,2.7,3.4,3.4,4,4,2,2,2,2,2.8,1.9,2,2,2,2,2.5,2.5,2.8,2.8,1.9,1.9,2,2,2.5,2.5,1.8,1.8,2,2,2.8,2.8,3.6],"y":[29,29,31,30,26,26,27,26,25,28,27,25,25,25,25,24,25,23,20,15,20,17,17,26,23,26,25,24,19,14,15,17,27,30,26,29,26,24,24,22,22,24,24,17,22,21,23,23,19,18,17,17,19,19,12,17,15,17,17,12,17,16,18,15,16,12,17,17,16,12,15,16,17,15,17,17,18,17,19,17,19,19,17,17,17,16,16,17,15,17,26,25,26,24,21,22,23,22,20,33,32,32,29,32,34,36,36,29,26,27,30,31,26,26,28,26,29,28,27,24,24,24,22,19,20,17,12,19,18,14,15,18,18,15,17,16,18,17,19,19,17,29,27,31,32,27,26,26,25,25,17,17,20,18,26,26,27,28,25,25,24,27,25,26,23,26,26,26,26,25,27,25,27,20,20,19,17,20,17,29,27,31,31,26,26,28,27,29,31,31,26,26,27,30,33,35,37,35,15,18,20,20,22,17,19,18,20,29,26,29,29,24,44,29,26,29,29,29,29,23,24,44,41,29,26,28,29,29,29,28,29,26,26,26],"text":["displ: 1.8 hwy: 29","displ: 1.8 hwy: 29","displ: 2.0 hwy: 31","displ: 2.0 hwy: 30","displ: 2.8 hwy: 26","displ: 2.8 hwy: 26","displ: 3.1 hwy: 27","displ: 1.8 hwy: 26","displ: 1.8 hwy: 25","displ: 2.0 hwy: 28","displ: 2.0 hwy: 27","displ: 2.8 hwy: 25","displ: 2.8 hwy: 25","displ: 3.1 hwy: 25","displ: 3.1 hwy: 25","displ: 2.8 hwy: 24","displ: 3.1 hwy: 25","displ: 4.2 hwy: 23","displ: 5.3 hwy: 20","displ: 5.3 hwy: 15","displ: 5.3 hwy: 20","displ: 5.7 hwy: 17","displ: 6.0 hwy: 17","displ: 5.7 hwy: 26","displ: 5.7 hwy: 23","displ: 6.2 hwy: 26","displ: 6.2 hwy: 25","displ: 7.0 hwy: 24","displ: 5.3 hwy: 19","displ: 5.3 hwy: 14","displ: 5.7 hwy: 15","displ: 6.5 hwy: 17","displ: 2.4 hwy: 27","displ: 2.4 hwy: 30","displ: 3.1 hwy: 26","displ: 3.5 hwy: 29","displ: 3.6 hwy: 26","displ: 2.4 hwy: 24","displ: 3.0 hwy: 24","displ: 3.3 hwy: 22","displ: 3.3 hwy: 22","displ: 3.3 hwy: 24","displ: 3.3 hwy: 24","displ: 3.3 hwy: 17","displ: 3.8 hwy: 22","displ: 3.8 hwy: 21","displ: 3.8 hwy: 23","displ: 4.0 hwy: 23","displ: 3.7 hwy: 19","displ: 3.7 hwy: 18","displ: 3.9 hwy: 17","displ: 3.9 hwy: 17","displ: 4.7 hwy: 19","displ: 4.7 hwy: 19","displ: 4.7 hwy: 12","displ: 5.2 hwy: 17","displ: 5.2 hwy: 15","displ: 3.9 hwy: 17","displ: 4.7 hwy: 17","displ: 4.7 hwy: 12","displ: 4.7 hwy: 17","displ: 5.2 hwy: 16","displ: 5.7 hwy: 18","displ: 5.9 hwy: 15","displ: 4.7 hwy: 16","displ: 4.7 hwy: 12","displ: 4.7 hwy: 17","displ: 4.7 hwy: 17","displ: 4.7 hwy: 16","displ: 4.7 hwy: 12","displ: 5.2 hwy: 15","displ: 5.2 hwy: 16","displ: 5.7 hwy: 17","displ: 5.9 hwy: 15","displ: 4.6 hwy: 17","displ: 5.4 hwy: 17","displ: 5.4 hwy: 18","displ: 4.0 hwy: 17","displ: 4.0 hwy: 19","displ: 4.0 hwy: 17","displ: 4.0 hwy: 19","displ: 4.6 hwy: 19","displ: 5.0 hwy: 17","displ: 4.2 hwy: 17","displ: 4.2 hwy: 17","displ: 4.6 hwy: 16","displ: 4.6 hwy: 16","displ: 4.6 hwy: 17","displ: 5.4 hwy: 15","displ: 5.4 hwy: 17","displ: 3.8 hwy: 26","displ: 3.8 hwy: 25","displ: 4.0 hwy: 26","displ: 4.0 hwy: 24","displ: 4.6 hwy: 21","displ: 4.6 hwy: 22","displ: 4.6 hwy: 23","displ: 4.6 hwy: 22","displ: 5.4 hwy: 20","displ: 1.6 hwy: 33","displ: 1.6 hwy: 32","displ: 1.6 hwy: 32","displ: 1.6 hwy: 29","displ: 1.6 hwy: 32","displ: 1.8 hwy: 34","displ: 1.8 hwy: 36","displ: 1.8 hwy: 36","displ: 2.0 hwy: 29","displ: 2.4 hwy: 26","displ: 2.4 hwy: 27","displ: 2.4 hwy: 30","displ: 2.4 hwy: 31","displ: 2.5 hwy: 26","displ: 2.5 hwy: 26","displ: 3.3 hwy: 28","displ: 2.0 hwy: 26","displ: 2.0 hwy: 29","displ: 2.0 hwy: 28","displ: 2.0 hwy: 27","displ: 2.7 hwy: 24","displ: 2.7 hwy: 24","displ: 2.7 hwy: 24","displ: 3.0 hwy: 22","displ: 3.7 hwy: 19","displ: 4.0 hwy: 20","displ: 4.7 hwy: 17","displ: 4.7 hwy: 12","displ: 4.7 hwy: 19","displ: 5.7 hwy: 18","displ: 6.1 hwy: 14","displ: 4.0 hwy: 15","displ: 4.2 hwy: 18","displ: 4.4 hwy: 18","displ: 4.6 hwy: 15","displ: 5.4 hwy: 17","displ: 5.4 hwy: 16","displ: 5.4 hwy: 18","displ: 4.0 hwy: 17","displ: 4.0 hwy: 19","displ: 4.6 hwy: 19","displ: 5.0 hwy: 17","displ: 2.4 hwy: 29","displ: 2.4 hwy: 27","displ: 2.5 hwy: 31","displ: 2.5 hwy: 32","displ: 3.5 hwy: 27","displ: 3.5 hwy: 26","displ: 3.0 hwy: 26","displ: 3.0 hwy: 25","displ: 3.5 hwy: 25","displ: 3.3 hwy: 17","displ: 3.3 hwy: 17","displ: 4.0 hwy: 20","displ: 5.6 hwy: 18","displ: 3.1 hwy: 26","displ: 3.8 hwy: 26","displ: 3.8 hwy: 27","displ: 3.8 hwy: 28","displ: 5.3 hwy: 25","displ: 2.5 hwy: 25","displ: 2.5 hwy: 24","displ: 2.5 hwy: 27","displ: 2.5 hwy: 25","displ: 2.5 hwy: 26","displ: 2.5 hwy: 23","displ: 2.2 hwy: 26","displ: 2.2 hwy: 26","displ: 2.5 hwy: 26","displ: 2.5 hwy: 26","displ: 2.5 hwy: 25","displ: 2.5 hwy: 27","displ: 2.5 hwy: 25","displ: 2.5 hwy: 27","displ: 2.7 hwy: 20","displ: 2.7 hwy: 20","displ: 3.4 hwy: 19","displ: 3.4 hwy: 17","displ: 4.0 hwy: 20","displ: 4.7 hwy: 17","displ: 2.2 hwy: 29","displ: 2.2 hwy: 27","displ: 2.4 hwy: 31","displ: 2.4 hwy: 31","displ: 3.0 hwy: 26","displ: 3.0 hwy: 26","displ: 3.5 hwy: 28","displ: 2.2 hwy: 27","displ: 2.2 hwy: 29","displ: 2.4 hwy: 31","displ: 2.4 hwy: 31","displ: 3.0 hwy: 26","displ: 3.0 hwy: 26","displ: 3.3 hwy: 27","displ: 1.8 hwy: 30","displ: 1.8 hwy: 33","displ: 1.8 hwy: 35","displ: 1.8 hwy: 37","displ: 1.8 hwy: 35","displ: 4.7 hwy: 15","displ: 5.7 hwy: 18","displ: 2.7 hwy: 20","displ: 2.7 hwy: 20","displ: 2.7 hwy: 22","displ: 3.4 hwy: 17","displ: 3.4 hwy: 19","displ: 4.0 hwy: 18","displ: 4.0 hwy: 20","displ: 2.0 hwy: 29","displ: 2.0 hwy: 26","displ: 2.0 hwy: 29","displ: 2.0 hwy: 29","displ: 2.8 hwy: 24","displ: 1.9 hwy: 44","displ: 2.0 hwy: 29","displ: 2.0 hwy: 26","displ: 2.0 hwy: 29","displ: 2.0 hwy: 29","displ: 2.5 hwy: 29","displ: 2.5 hwy: 29","displ: 2.8 hwy: 23","displ: 2.8 hwy: 24","displ: 1.9 hwy: 44","displ: 1.9 hwy: 41","displ: 2.0 hwy: 29","displ: 2.0 hwy: 26","displ: 2.5 hwy: 28","displ: 2.5 hwy: 29","displ: 1.8 hwy: 29","displ: 1.8 hwy: 29","displ: 2.0 hwy: 28","displ: 2.0 hwy: 29","displ: 2.8 hwy: 26","displ: 2.8 hwy: 26","displ: 3.6 hwy: 26"],"type":"scatter","mode":"markers","marker":{"autocolorscale":false,"color":"rgba(0,0,0,1)","opacity":1,"size":5.66929133858268,"symbol":"circle","line":{"width":1.88976377952756,"color":"rgba(0,0,0,1)"}},"hoveron":"points","showlegend":false,"xaxis":"x","yaxis":"y","hoverinfo":"text","frame":null}],"layout":{"margin":{"t":25.7412480974125,"r":7.30593607305936,"b":39.6955859969559,"l":37.2602739726027},"plot_bgcolor":"rgba(235,235,235,1)","paper_bgcolor":"rgba(255,255,255,1)","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187},"xaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[1.33,7.27],"tickmode":"array","ticktext":["2","3","4","5","6","7"],"tickvals":[2,3,4,5,6,7],"categoryorder":"array","categoryarray":["2","3","4","5","6","7"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"y","title":{"text":"displ","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"yaxis":{"domain":[0,1],"automargin":true,"type":"linear","autorange":false,"range":[10.4,45.6],"tickmode":"array","ticktext":["20","30","40"],"tickvals":[20,30,40],"categoryorder":"array","categoryarray":["20","30","40"],"nticks":null,"ticks":"outside","tickcolor":"rgba(51,51,51,1)","ticklen":3.65296803652968,"tickwidth":0.66417600664176,"showticklabels":true,"tickfont":{"color":"rgba(77,77,77,1)","family":"","size":11.689497716895},"tickangle":-0,"showline":false,"linecolor":null,"linewidth":0,"showgrid":true,"gridcolor":"rgba(255,255,255,1)","gridwidth":0.66417600664176,"zeroline":false,"anchor":"x","title":{"text":"hwy","font":{"color":"rgba(0,0,0,1)","family":"","size":14.6118721461187}},"hoverformat":".2f"},"shapes":[{"type":"rect","fillcolor":null,"line":{"color":null,"width":0,"linetype":[]},"yref":"paper","xref":"paper","x0":0,"x1":1,"y0":0,"y1":1}],"showlegend":false,"legend":{"bgcolor":"rgba(255,255,255,1)","bordercolor":"transparent","borderwidth":1.88976377952756,"font":{"color":"rgba(0,0,0,1)","family":"","size":11.689497716895}},"hovermode":"closest","width":400,"height":300,"barmode":"relative"},"config":{"doubleClick":"reset","showSendToCloud":false},"source":"A","attrs":{"473057e76948":{"x":{},"y":{},"type":"scatter"}},"cur_data":"473057e76948","visdat":{"473057e76948":["function (y) ","x"]},"highlight":{"on":"plotly_click","persistent":false,"dynamic":false,"selectize":false,"opacityDim":0.2,"selected":{"opacity":1},"debounce":0},"shinyEvents":["plotly_hover","plotly_click","plotly_selected","plotly_relayout","plotly_brushed","plotly_brushing","plotly_clickannotation","plotly_doubleclick","plotly_deselect","plotly_afterplot","plotly_sunburstclick"],"base_url":"https://plot.ly"},"evals":[],"jsHooks":[]}</script>

--
name: exploration

# Session 3: Data Exploration with R Package: DataExplorer

In data science, it is important to get to know your data before advanced modeling or further analysis. We should understand what the data are about, what variables we have, the size of the data, how many missing values, what is the data type of each variable, any possible relationships between variables and anything unusual or interesting in the data.

We will use the [Medical Cost Personal Dataset](https://www.kaggle.com/datasets/mirichoi0218/insurance")
to go over the use of functions in the package DataExplorer. For the demonstration purpose, we modified this data by having random missing values in some variables.

---
## Import the package, load the data and get a glimpse of the data

```r
library(DataExplorer)
insurance <- read_csv("https://raw.githubusercontent.com/Ying-Ju/R_Data_Analytics_Series_NTPU/main/insurance.csv")
glimpse(insurance)
```

```
## Rows: 1,338
## Columns: 7
## $ age <dbl> 19, 18, 28, NA, 32, NA, 46, 37, 37, 60, 25, 62, NA, 56, 27, 1~
## $ sex <chr> "female", "male", "male", "male", "male", "female", "female",~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <dbl> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <chr> "yes", "no", "no", "no", "no", "no", "no", "no", "no", "no", ~
## $ region <chr> "southwest", "southeast", "southeast", "northwest", "northwes~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
```

---

First, we check the basic description for the data using the function plot_intro() in the package DataExplorer.

```r
plot_intro(insurance)
```

---

Then, we study the distribution of missing values in the data using the function plot_missing() in the package DataExplorer.

```r
plot_missing(insurance)
```

Since there are 7 variables, we will study all variables in the data.

---

Now, we study the frequency distribution of all categorical variables in the data using the function plot_bar() in the package DataExplorer.

```r
plot_bar(insurance)
```

---

The following code shows the distribution of sum of charges by the categorical variables in the data, individually.

```r
plot_bar(insurance, with="charges")
```

---

Next, we study the distribution of all quantitative variables in the data using the function plot_histogram() in the package DataExplorer.

```r
plot_histogram(insurance, ncol=2) 
```

---

We study the distributions of age, bmi, and charges with respect to region individually using the function plot_boxplot() in the package DataExplorer.

```r
insurance_Q <- insurance %>% 
 select(age, bmi, charges, region) %>% 
 drop_na()
plot_boxplot(insurance_Q, by = "region")
```

---

We can study the association between any quantitative variable with a given response variable in the data using the function plot_scatterplot() in the package DataExplorer. Here, we study the association between charges and other quantitative variables in the data.

```r
plot_scatterplot(insurance_Q %>% select(-region), by = "charges")
```

---

We can get a scatterplot with sample observations as well.

```r
plot_scatterplot(insurance_Q %>% select(-region), by = "charges", sampled_rows=100)
```

---

```r
plot_scatterplot(insurance_Q %>% 
                   filter(region=="northwest") %>% 
                   select(-region), 
                 by = "charges")
```

The above figure only shows the association between charges and other quantitative variables in the northwest.

---

We can check the correlation of all quantitative variables in the data using the function plot_correlation() in the package DataExplorer.

```r
plot_correlation(insurance_Q %>% 
                   select(-region), 
                 cor_args = list( "use" = "complete.obs"))
```

---

In you are new to data exploration and have no ideas about where to start. create_report() function in the package DataExplorer can help to create a report for the data exploration of the data.

```r
create_report(insurance, output_file = "report.html", 
              output_dir = "C:\Users\Tessa\Document")
```

**Note:** Use help("create_report") to find the usage of create_report().

---
name: rmarkdown

# Session 4: Learn R Rmarkdown Presentation

In this session, we will introduce

1. R markdown Presentation
2. Flex Dashboard

### What is R markdown?

R Markdown is a file format for creating dynamic documents with R and RStudio. R Markdown documents are written in Markdown which has easy-to-write plain text format with embedded R code.

---
## Rmarkdown Presentation

In order to create a Rmarkdown presentation, we click File and then find New File and then R markdown ... There are four options:

- Html (ioslides)

This format allows us to create a slide show and the slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (---).

- Html (Slidy)

Similar to ioslides, this format allows to create a slide show broken up into sections by using the heading tag ##. If a header is not needed, a new slide could be created using a horizontal rule (---). A Slidy presentation gives a table of content while An ioslides presentation doesn't.

- PDF (beamer)

This format allows to create a beamer presenation (LaTex). The slides could be broken up into sections by using the heading tags # and ##. If a header is not needed, a new slide could be created using a horizontal rule (---).

- PowerPoint

---

---
In the following, we show an example of the header of a Rmarkdown file.

We can use the output option to manipulate which presentation we would like to have.

- output: ioslides_presentation
- output: slidy_presentation
- output: beamer_presentation
- output: powerpoint_presentation

To render an R Markdown document into its final output format, we can click the Knit button to render the document in RStudio and RStudio will show a preview of it.

The further settings for presentations could be found at [R Markdown: The Definitive Guide](https://bookdown.org/yihui/rmarkdown/presentations.html).

---
### xaringan Presentation

A easy way to start creating a xaringan presentation is to use the R markdown template with Ninja Presentation or Ninja Themed Presentation.

A comprehensive tutorial regarding xaringan presentation could be found at [xaringan Presentation](https://bookdown.org/yihui/rmarkdown/xaringan.html).

---
## Flex Dashboard

A easy way to start creating a Flex dashboard is to use the R markdown template with Flex Dashboard.

- We can use # to create multiple pages. 
- We can use orientation in the output options to specify the layout to be columns or rows.

A comprehensive tutorial regarding Flex Dashboard could be found at [flexdashboard](https://pkgs.rstudio.com/flexdashboard/)

---
name: github

# Session 5: A Quick Overview of GitHub

In this session, we will provide a quick overview of GitHub. It will cover some basic usages of GitHub Desktop due to the time limitation of the class.

### What is Git?

Git is a version control system that allows us to track changes in any set of files. It is typically used by programmers who are working on source code together.

### What is GitHub?

GitHub is a version control and collaboration tools for programming. It allows us to collaborate on projects from any location with other people.

---
## Register for a GitHub account

We can register for a GitHub account at [www.github.com](https://github.com/).

---
## Install and Set up GitHub Desktop

<div class="figure" style="text-align: center">
<iframe src="https://desktop.github.com/" width="80%" height="450px" data-external="1"></iframe>
Installing GitHub Desktop from https://desktop.github.com/
</div>

---
## Create a repository, track changes, and explore a file's history

We will create a process that shows how to create a repository, track changes, and explore a file's history.

Here is an example that shows how GitHub Desktop looks like.

---

- Current repository
- Current branch
- Fetch origin

Fetch downloads the most recent updates from origin but it does not update our local working copy with the changes. After we click Fetch origin, the button changes to Pull Origin. Clicking Pull Origin will update our local working copy with the fetched updates.

- Summary (required) & Description (optional)
- Commit to master

When we commit the changes, the list of uncommitted changes was gone from the left pane. We have, however, just committed the changes locally. The commit must be pushed to the remote (origin) repository.

---
## Use GitHub Pages to Publish a html file

The html file needs to be named as index.html.

1. Sign in our GitHub account at [www.github.com](https://github.com/)
2. Navigate to the repository where our html file is
3. Click Settings and find Pages from the left menu 
4. Under "GitHub Pages", use the None or Branch drop-down menu and select a publishing source
5. Click Save

---

## Thanks

.pull-left[
- Please do not hesitate to contact Dr. Chen if you have questions pertaining to learning R or other languages. Please email me at <a href="mailto:ychen@udayton.edu">&nbsp; ychen4@udayton.edu</a>.

- The R code used in this presentation can be found [here](https://raw.githubusercontent.com/Ying-Ju/MathClub.github.io/main/job_analysis.R).

- Slides were created via the R package **xaringan**, with styling based on:  
  * [xariganthemer](https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html) package, and  
  * Alison Hill's [@apreshill](https://github.com/apreshill/) CSS resources for customizing themes and fonts

- The formatting of slides is provided by Dr. Fadel M. Megahed [@fmegahed](https://github.com/fmegahed). 
]

.pull-right[
<img src="https://c.tenor.com/XgehSxepiagAAAAC/are-there-any-questions-eric-cartman.gif" width="350" height="350" />
]