MTH 209 Data Manipulation and Management

Lesson 5: Installing R Packages & Reading Data Files

Ying-Ju Tessa Chen
ychen4@udayton.edu
University of Dayton

Installing and Loading Packages - 1

The commonly used units that people adopt to share code in R are packages. In general, a package contains code, data, documentation, tests, etc. Most people upload their packages to CRAN, a comprehensive R Archive Network while a few people share their code on GitHub or other web sites. It is recommended that you ONLY download packages from CRAN since these packages are well-maintained.

In order to import packages in RStudio, you need to

  1. know the name of the package.

  2. download the package. Here, we introduce two basic methods:

    • In the Console window, run install.packages(“package’s name”).

Note: It is essential to put the quotation marks around the package’s name.

  • Click the Packages tab in RStudio (bottom right window) and then click Install, find Install From: and select Repository (CRAN), type the name of the package in the box under Packages (separate multiple with space or comma) and click Install.

Note: we should leave Install dependencies checked so R will download any additional packages needed in order to use some functions or data in the package you are currently downloading.

Installing and Loading Packages - 2

  1. Use library() or require() function to import the package you would like to use. Here, we show how to install the package tidyverse which is designed for data science and how to import it.
install.packages("tidyverse")
library(tidyverse)

Note:

  1. Sometimes, warning messages are given in the Console when installing certain packages indicating that the package was built using an older version of R. In general, these warnings can be ignored since they are still compatible with newer versions of R.

  2. You only need to install a package once when the first time you need it. You can always import the package after you install it.

  3. The main difference between library() and require() functions is library() returns an error if the package doesn’t exist while require() returns FALSE and gives a warning.

Importing Data and Writing Files - 1

In this section, we introduce two methods of importing data from some commonly used formats and write files.

  1. Using the Import Dataset tab in RStudio (on the top right window).

Importing Data and Writing Files - 2

  1. Using the code. Since there are many file types, we will focus on two commonly used file types: text files and comma separated value files. We will use the package readr which is included in tidyverse as it provides a fast and convenient way to read rectangular data (e.g. csv, tsv, and fwf). readr supports the following file types using the following functions to read files:
            - read_csv(): comma separated (CSV) files
            - read_csv2(): semicolon separated files
            - read_delim(): general delimited files
            - read_fwf(): fixed width files
            - read_log(): web log files
            - read_table(): tabular files where columns are separated by white-space.
            - read_tsv(): tab separated files
          Some Common arguments in these functions:
              - file: can be either a path to a file, a connection, or literal data
              - col_names: can be either TRUE, FALSE, or a character of column names

Note: A CSV (comma-separated values) file is a text file in which information is separated by commas.

In general, these functions will work well. We include the path to a file, and we will obtain a tibble which is a modern reimagining of the data frame. It is much easier to navigate, view, and manipulate the contents of data using a tibble as every row is corresponding to an observation and every column is corresponding with a variable.

Importing Data and Writing Files - 3

The following code chunk gives an example of reading a data file.

library(tidyverse)
ds_salaries <- read_csv("C:/Users/ychen4/Dropbox/MTH 209/Data for Brainstorm Activities/ds_salaries.csv")
head(ds_salaries) # use head() to read the first six rows of the data
## # A tibble: 6 × 12
##    ...1 work_year exper…¹ emplo…² job_t…³ salary salar…⁴ salar…⁵ emplo…⁶ remot…⁷
##   <dbl>     <dbl> <chr>   <chr>   <chr>    <dbl> <chr>     <dbl> <chr>     <dbl>
## 1     0      2020 MI      FT      Data S…  70000 EUR       79833 DE            0
## 2     1      2020 SE      FT      Machin… 260000 USD      260000 JP            0
## 3     2      2020 SE      FT      Big Da…  85000 GBP      109024 GB           50
## 4     3      2020 MI      FT      Produc…  20000 USD       20000 HN            0
## 5     4      2020 SE      FT      Machin… 150000 USD      150000 US           50
## 6     5      2020 EN      FT      Data A…  72000 USD       72000 US          100
## # … with 2 more variables: company_location <chr>, company_size <chr>, and
## #   abbreviated variable names ¹​experience_level, ²​employment_type, ³​job_title,
## #   ⁴​salary_currency, ⁵​salary_in_usd, ⁶​employee_residence, ⁷​remote_ratio

Importing Data and Writing Files - 3

glimpse(ds_salaries) # use glimpse() to get a glimpse of the data
## Rows: 607
## Columns: 12
## $ ...1               <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ work_year          <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 202…
## $ experience_level   <chr> "MI", "SE", "SE", "MI", "SE", "EN", "SE", "MI", "MI…
## $ employment_type    <chr> "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT", "FT…
## $ job_title          <chr> "Data Scientist", "Machine Learning Scientist", "Bi…
## $ salary             <dbl> 70000, 260000, 85000, 20000, 150000, 72000, 190000,…
## $ salary_currency    <chr> "EUR", "USD", "GBP", "USD", "USD", "USD", "USD", "H…
## $ salary_in_usd      <dbl> 79833, 260000, 109024, 20000, 150000, 72000, 190000…
## $ employee_residence <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
## $ remote_ratio       <dbl> 0, 0, 50, 0, 50, 100, 100, 50, 100, 50, 0, 0, 0, 10…
## $ company_location   <chr> "DE", "JP", "GB", "HN", "US", "US", "US", "HU", "US…
## $ company_size       <chr> "L", "S", "M", "S", "L", "L", "S", "L", "L", "S", "…

Note: glimpse() is a function included in tidyverse.

Importing Data and Writing Files - 4

We can read the data available online as well.

OH_COVID <- read_csv("https://coronavirus.ohio.gov/static/dashboards/COVIDDeathData_CountyOfDeath.csv")

glimpse(OH_COVID)
## Rows: 765,639
## Columns: 11
## $ County                                         <chr> "Adams", "Adams", "Adam…
## $ Sex                                            <chr> NA, NA, NA, NA, NA, NA,…
## $ `Age Range`                                    <chr> "0-19", "0-19", "0-19",…
## $ `Onset Date`                                   <date> 2020-12-10, 2020-12-11…
## $ `Admission Date`                               <chr> NA, NA, NA, NA, NA, NA,…
## $ `Date Of Death`                                <date> NA, NA, NA, NA, NA, NA…
## $ `Case Count`                                   <dbl> 2, 1, 1, 1, 1, 2, 1, 1,…
## $ `Hospitalized Count`                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `Death Due To Illness Count - County Of Death` <dbl> 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `State of Death`                               <chr> NA, NA, NA, NA, NA, NA,…
## $ `State of Residence`                           <chr> NA, NA, NA, NA, NA, NA,…

Note: You may see the single quotes are included in some names of variables. This is because there is at least one space included in the name.

Question: If we want to remove the single quotes in the names of variables, what could be possible solution?

Importing Data and Writing Files - 5

The second data file could be queried from CDC WONDER.

CDC_Death <- read_tsv("C:/Users/ychen4/Dropbox/MTH 209/Class Handouts/Data/Underlying Cause of Death.txt")
glimpse(CDC_Death)
## Rows: 4,220
## Columns: 14
## $ Notes                       <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Year                        <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, …
## $ `Year Code`                 <dbl> 2005, 2005, 2005, 2005, 2005, 2005, 2005, …
## $ `Five-Year Age Groups`      <chr> "< 1 year", "< 1 year", "< 1 year", "< 1 y…
## $ `Five-Year Age Groups Code` <chr> "1", "1", "1", "1", "1", "1", "1", "1", "1…
## $ Gender                      <chr> "Female", "Female", "Female", "Female", "F…
## $ `Gender Code`               <chr> "F", "F", "F", "F", "F", "F", "F", "F", "M…
## $ Race                        <chr> "American Indian or Alaska Native", "Asian…
## $ `Race Code`                 <chr> "1002-5", "A-PI", "2054-5", "2054-5", "205…
## $ `Hispanic Origin`           <chr> "Not Hispanic or Latino", "Not Hispanic or…
## $ `Hispanic Origin Code`      <chr> "2186-2", "2186-2", "2135-2", "2186-2", "N…
## $ Deaths                      <dbl> 56, 94, 26, 895, 10, 382, 1373, 15, 83, 10…
## $ Population                  <chr> "8158", "20150", "7024", "74736", "Not App…
## $ `Crude Rate`                <chr> "686.4", "466.5", "370.2", "1197.5", "Not …

Note: In many programming languages like C, C++, Java, MatLab, Python, Perl, R, a backslash, \, works as an escape character in strings. So in these languages, we need to use either slash, /, or double backslash, \\, in the string in order to get a single backslash for a path.

Importing Data and Writing Files - 6

Similarly, readr provides the following functions to write files:

      - write_csv(): comma separated (CSV) files
      - write_csv2(): semicolon separated files
      - write_delim(): general delimited files
      - write_excel_csv
      - write_excel_csv2
      - write_tsv: tab separated files

Some Common arguments in these functions:

      - x: a data frame
      - path: Path or connection to write to (including the file name).
      - delim: delimiter used to separate values.
      - na: string used for missing values. Defaults to NA.
      - append: if FALSE, the function overwrites existing file. If TRUE, it appends to existing file. A new file will be created if the file does not exist.
      - col_names: If TRUE, write columns names at the top of the file.

We can save the CDC wonder data to a CSV file.

write_csv(CDC_Death, "C:/Users/ychen4/Dropbox/MTH 209/Class Handouts/Data/data/name_of_file.csv")

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.