MTH 209 Data Manipulation and Management

class: center, middle, inverse, title-slide

.title[
# MTH 209 Data Manipulation and Management
]
.subtitle[
## Lesson 5: Installing R Packages & Reading Data Files
]
.author[
### <br>Ying-Ju Tessa Chen, PhD <br><br> Associate Professor <br> Department of Mathematics<br> University of Dayton <br><br> <a href="https://twitter.com/ju_tessa"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <span class="citation">@ying-ju</span></a> <br> <a href="https://github.com/ying-ju/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> ying-ju</a> <br> <a href="mailto:ychen4@udayton.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> ychen4@udayton.edu</a><br>
]

---

# Learning Objectives

In this session, we will learn

- How to install and load R packages

- How to read and write csv files

---

## R Packages

.small[
The commonly used units that people adopt to share code in R are `packages`. In general, a package contains code, data, documentation, tests, etc. Most people upload their packages to [CRAN](https://cran.r-project.org/), a comprehensive R Archive Network while a few people share their code on [GitHub](https://github.com/) or other web sites.  It is recommended that you ONLY download packages from CRAN since these packages are well-maintained.

In order to import packages in RStudio, you need to

1. know the name of the package.

2. download the package. Here, we introduce two basic methods:
    - In the Console window, run **install.packages("package's name")**. 
  
**Note:** It is essential to put the quotation marks around the package's name.]

---
## Installing Packages

.small[Click the **Packages** tab in RStudio (bottom right window) and then click .green[Install], find .green[Install From:] and select .green[Repository (CRAN)], type the name of the package in the box under **Packages (separate multiple with space or comma)** and click .green[Install].]
  
<img src="data:image/png;base64,#../Figures/CRAN.jpg" width="43%" style="display: block; margin: auto;" />

.small[**Note:** we should leave **Install dependencies** checked so R will download any additional packages needed in order to use some functions or data in the package you are currently downloading. ]

---
## Loading Packages

.small[Use .green[library()] or .green[require()] function to import the package you would like to use. Here, we show how to install the package .flyerblue[tidyverse] which is designed for data science and how to import it. ]

```r
install.packages("tidyverse")
library(tidyverse)
```

.small[
**Note:**

1. Sometimes, warning messages are given in the Console when installing certain packages indicating that the package was built using an older version of R.  In general, these warnings can be ignored since they are still compatible with newer versions of R.
  
  2. You only need to install a package once when the first time you need it. You can always import the package after you install it. 
  
  3. The main difference between .green[library()] and .green[require()] functions is .green[library()] returns an error if the package doesn't exist while .green[require()] returns FALSE and gives a warning.]

---
## Importing Data - 1

.small[In this section, we introduce two methods of importing data from some commonly used formats and write files.

1. Using the .green[Import Dataset] tab in RStudio (on the top right window).]

---
## Importing Data - 2

.left-column[
.center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readr.png" width="60%">]
]
.right-column[
### Reading Plain-Text Rectangular <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 464c8.8 0 16-7.2 16-16V160H256c-17.7 0-32-14.3-32-32V48H64c-8.8 0-16 7.2-16 16V448c0 8.8 7.2 16 16 16H320zM0 64C0 28.7 28.7 0 64 0H229.5c17 0 33.3 6.7 45.3 18.7l90.5 90.5c12 12 18.7 28.3 18.7 45.3V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V64z"/></svg>
#### .small[(a.k.a. flat or spreadsheet-like files)]
* delimited text files with .green[read_delim()]
  + `.csv`: comma (",") separated values with .green[read_csv()]
  + `.csv`: semicolon (“;”) separated values with .green[read_csv2()]
  + `.tsv`: tab ("\t") separated values .green[read_tsv()]
* `.fwf`: fixed width files with .green[read_fwf()]

.small[**Note:** A CSV (comma-separated values) file is a .flyerblue[text] file in which information is separated by commas.]]

---
## Importing Data - 3

.small[
Another useful function:
<p style = "margin-left: 25px;">
  .green[read_table()]: tabular files where columns are separated by white-space.
</p>

Some Common arguments in these functions: 
  - .purple[file]: can be either a path to a file, a connection, or literal data
  - .purple[col_names]: can be either TRUE, FALSE, or a character of column names

In general, these functions will work well.  We include the path to a file, and we will obtain a tibble which is a modern reimagining of the data frame.  It is much easier to navigate, view, and manipulate the contents of data using a tibble as every row is corresponding to an observation and every column is corresponding with a variable.
]

---
## Importing Data - 4

.small[The following code chunk gives an example of reading a data file. ]

```r
library(tidyverse)
ds_salaries <- read_csv("C:/Users/Tessa Chen/Documents/GitHub/ying-ju-web.github.io/teaching/MTH209/Lectures/Datasets/ds_salaries.csv")
head(ds_salaries) # use head() to read the first six rows of the data
```

```
## # A tibble: 6 × 11
##   work_year experience_level employment_type job_title    salary salary_currency
##       <dbl> <chr>            <chr>           <chr>         <dbl> <chr>          
## 1      2020 MI               FT              Data Scient…  70000 EUR            
## 2      2020 SE               FT              Machine Lea… 260000 USD            
## 3      2020 SE               FT              Big Data En…  85000 GBP            
## 4      2020 MI               FT              Product Dat…  20000 USD            
## 5      2020 SE               FT              Machine Lea… 150000 USD            
## 6      2020 EN               FT              Data Analyst  72000 USD            
## # ℹ 5 more variables: salary_in_usd <dbl>, employee_residence <chr>,
## #   remote_ratio <dbl>, company_location <chr>, company_size <chr>
```

---
## Importing Data - 5

.small[Use the .green[glimpse()] function to get a glimpse of data

```r
glimpse(ds_salaries) # use glimpse() to get a glimpse of the data
```

```
## Rows: 607
## Columns: 11
## $ work_year          <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
## $ experience_level   <chr> "MI", "SE", "SE", "MI", "SE", "EN", "SE", "MI", "MI…
## $ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
## $ job_title          <chr> "Data Scientist", "Machine Learning Scientist", "Bi…
## $ salary             <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
## $ salary_currency    <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
## $ salary_in_usd      <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
## $ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
## $ remote_ratio       <dbl> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
## $ company_location   <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
## $ company_size       <chr> "L", "S", "M", "S", "L", "L", "S", "L", "L", "S", "…
```

**Note:** .green[glimpse()] is a function included in .flyerblue[tidyverse].]

---
## Importing Data - 6

.pull-left-2[.small[
**GitHub** <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> Repositories, e.g.,

- [Bank Marketing](https://github.com/selva86/datasets/blob/master/bank-full.csv) - focusing on `bank-full.csv`
  - Original Source: [UCI Machine Learning](https://archive.ics.uci.edu/dataset/222/bank+marketing)]

<img src="data:image/png;base64,#../Figures/bankdata.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right-2[

#### R Code

.small[
Since we already load the r package .flyerblue[readr] before, we can use the function .green[read_csv2()] directly. ]

```r
# Import the Bank Marketing Data
bank <- read_csv2("https://raw.githubusercontent.com/selva86/datasets/master/bank-full.csv")
```

```r
# Read the first 6 rows of data
head(bank)
library(dplyr)
glimpse(bank)
```

]

---
## Importing Data - 7

.small[The second data file could be queried from [CDC WONDER](https://wonder.cdc.gov/ucd-icd10.html).]

.footnotesize[

```r
CDC_Death <- read_tsv("../Datasets/Underlying Cause of Death.txt")
glimpse(CDC_Death)
```

```
## Rows: 4,220
## Columns: 14
## $ Notes                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Year                        <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, …
## $ `Year Code`                 <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, …
## $ `Five-Year Age Groups`      <chr> "< 1 year", "< 1 year", "< 1 year", "< 1 y…
## $ `Five-Year Age Groups Code` <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ Gender                      <chr> "Female", "Female", "Female", "Female", "F…
## $ `Gender Code`               <chr> "F", "F", "F", "F", "F", "F", "F", "F", "M…
## $ Race                        <chr> "American Indian or Alaska Native", "Asian…
## $ `Race Code`                 <chr> "1002-5", "A-PI", "2054-5", "2054-5", "205…
## $ `Hispanic Origin`           <chr> "Not Hispanic or Latino", "Not Hispanic or…
## $ `Hispanic Origin Code`      <chr> "2186-2", "2186-2", "2135-2", "2186-2", "N…
## $ Deaths                      <dbl> 56, 94, 26, 895, 10, 382, 1373, 15, 83, 10…
## $ Population                  <chr> "8158", "20150", "7024", "74736", "Not App…
## $ `Crude Rate`                <chr> "686.4", "466.5", "370.2", "1197.5", "Not …
```

**Note:** In many programming languages like C, C++, Java, MatLab, Python, Perl, R, a backslash, \\, works as an escape character in strings. So in these languages, we need to use either slash, /, or double backslash, \\\\, in the string in order to get a single backslash for a path. ]

---
## Reading Proprietary Binary Files

.left-column[
.center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/haven.png" width="60%">]
]
.right-column[

Several functions from the [haven](https://haven.tidyverse.org/) <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:gold;overflow:visible;position:relative;"><path d="M50.7 58.5L0 160H208V32H93.7C75.5 32 58.9 42.3 50.7 58.5zM240 160H448L397.3 58.5C389.1 42.3 372.5 32 354.3 32H240V160zm208 32H0V416c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192z"/></svg> can be used to read and write formats used by other statistical packages. Example functions include:

- SAS
  + `.sas7bdat` with .green[read_sas()]
  
- Stata
  + `.dta` with .green[read_dta()]
  
- SPSS
  + `.sav` with .green[read_sav()]

**Please refer to the help files for each of those packages for more details.**
]
---
## Writing Data - 1

.small[
Similarly, [readr](https://readr.tidyverse.org/reference/write_delim.html) provides the following functions to write files:

- .green[write_csv()]: comma separated (CSV) files
  - .green[write_csv2()]: semicolon separated files
  - .green[write_delim()]: general delimited files
  - .green[write_excel_csv]
  - .green[write_excel_csv2]
  - .green[write_tsv]: tab separated files
]

---
## Writing Data - 2

.small[
Some Common arguments in the functions on the previous page:

- x: a data frame 
  - path: Path or connection to write to (including the file name).
  - delim: delimiter used to separate values.
  - na: string used for missing values. Defaults to NA.
  - append: if FALSE, the function overwrites existing file. If TRUE, it appends to existing file. A new file will be created if the file does not exist.
  - col_names: If TRUE, write columns names at the top of the file.

We can save the CDC wonder data to a CSV file.

```r
write_csv(CDC_Death, "C:/Users/ychen4/Dropbox/MTH 209/Class Handouts/Data/data/name_of_file.csv")
```
]

---
# Summary of Main Points

By now, you should know

- How to install and load R packages

- How to read and write csv files

---
# Supplementary Materials

Here are some useful supplementary materials for self-learning.

.pull-left[
.center[[<img src="https://d33wubrfki0l68.cloudfront.net/b88ef926a004b0fce72b2526b0b5c4413666a4cb/24a30/cover.png" height="250px">](https://r4ds.had.co.nz)]
.small[
* [Data Import](https://r4ds.had.co.nz/data-import.html)
]
]
.pull-right[
.center[[<img src="../Figures/RPubs.png" height="250px">](https://RPubs.com)]
.small[
* [Import SAS data with haven](https://rpubs.com/potentialwjy/ImportDataIntoR04)
]
]