ACR Conference

.title[
# ACR Conference
]
.subtitle[
## Social-Data Scraping in <svg viewBox="0 0 581 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg>
]
.author[
### <br>Ying-Ju Tessa Chen, PhD <br><br> Associate Professor <br> Department of Mathematics<br> University of Dayton <br><br> <a href="https://twitter.com/ju_tessa"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <span class="citation">@ying-ju</span></a> <br> <a href="https://github.com/ying-ju/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> ying-ju</a> <br> <a href="mailto:ychen4@udayton.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> ychen4@udayton.edu</a><br>
]
.date[
### September 15, 2023
]

---

# We assume:

### <img src="data:image/png;base64,#figures/r_icon.png" width="5%" style="float:left"/></i>&nbsp; you know R

### <img src="data:image/png;base64,#https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/RStudio.png" width="5%" style="float:left"/></i>&nbsp; you know RStudio

### <img src="data:image/png;base64,#figures/data_icon.png" width="5%" style="float:left"/></i>&nbsp; you know some basic data file formats

### <img src="data:image/png;base64,#figures/web_crawler.png" width="5%" style="float:left"/></i>&nbsp; you want to scrape data from a real website

---
name: novice
class: middle, inverse

# What you may not know:

### <img src="data:image/png;base64,#figures/packages.png" width="5%" style="float:left"/></i>&nbsp; some R packages we plan to use today

### <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:orange;overflow:visible;position:relative;"><path d="M0 32l34.9 395.8L191.5 480l157.6-52.2L384 32H0zm308.2 127.9H124.4l4.1 49.4h175.6l-13.6 148.4-97.9 27v.3h-1.1l-98.7-27.3-6-75.8h47.7L138 320l53.5 14.5 53.7-14.5 6-62.2H84.3L71.5 112.2h241.1l-4.4 47.7z"/></svg> <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:blue;overflow:visible;position:relative;"><path d="M0 32l34.9 395.8L192 480l157.1-52.2L384 32H0zm313.1 80l-4.8 47.3L193 208.6l-.3.1h111.5l-12.8 146.6-98.2 28.7-98.8-29.2-6.4-73.9h48.9l3.2 38.3 52.6 13.3 54.7-15.4 3.7-61.6-166.3-.5v-.1l-.2.1-3.6-46.3L193.1 162l6.5-2.7H76.7L70.9 112h242.2z"/></svg> <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:gold;overflow:visible;position:relative;"><path d="M400 32H48C21.5 32 0 53.5 0 80v352c0 26.5 21.5 48 48 48h352c26.5 0 48-21.5 48-48V80c0-26.5-21.5-48-48-48zM243.8 381.4c0 43.6-25.6 63.5-62.9 63.5-33.7 0-53.2-17.4-63.2-38.5l34.3-20.7c6.6 11.7 12.6 21.6 27.1 21.6 13.8 0 22.6-5.4 22.6-26.5V237.7h42.1v143.7zm99.6 63.5c-39.1 0-64.4-18.6-76.7-43l34.3-19.8c9 14.7 20.8 25.6 41.5 25.6 17.4 0 28.6-8.7 28.6-20.8 0-14.4-11.4-19.5-30.7-28l-10.5-4.5c-30.4-12.9-50.5-29.2-50.5-63.5 0-31.6 24.1-55.6 61.6-55.6 26.8 0 46 9.3 59.8 33.7L368 290c-7.2-12.9-15-18-27.1-18-12.3 0-20.1 7.8-20.1 18 0 12.6 7.8 17.7 25.9 25.6l10.5 4.5c35.8 15.3 55.9 31 55.9 66.2 0 37.8-29.8 58.6-69.7 58.6z"/></svg> web technology

---

# Learning Objectives

- Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. online using <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#E4002B;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>

- Scrape a webpage using <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#E4002B;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>

- Understand when can we scrape data (i.e., `robots.txt`)

---
In this workshop, we will need the following packages:

.small[
- jsonlite: A package designed to read, write, and manipulate JSON data seamlessly in R.
- httr: A tool for working with HTTP connections, simplifying the process of sending and receiving HTTP requests and responses in R.
- pacman: A package management tool that streamlines the installation, loading, and maintenance of R packages.
- R.utils: A collection of utility functions that facilitate various programming tasks in R, enhancing efficiency and productivity.
- rvest: A web scraping package that enables the easy extraction of information from websites directly into R for data analysis.
- tidyverse: A cohesive collection of R packages that adhere to a common design philosophy, grammar, and data structures, providing a comprehensive toolkit for data science in R.]

```r
install.packages(c("httr", "pacman", "R.utils", "rvest", "tidyverse"))
pacman::p_load(jsonlite, httr, R.utils, rvest, tidyverse)
```

.footnote[.red[The code included can be copied by hovering the top right corner of each code chunk.]]
---

# Importing Data ⬇️

---

.left-column[
.center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readr.png" width="60%">]
]
.right-column[
# Reading Plain-Text Rectangular <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 464c8.8 0 16-7.2 16-16V160H256c-17.7 0-32-14.3-32-32V48H64c-8.8 0-16 7.2-16 16V448c0 8.8 7.2 16 16 16H320zM0 64C0 28.7 28.7 0 64 0H229.5c17 0 33.3 6.7 45.3 18.7l90.5 90.5c12 12 18.7 28.3 18.7 45.3V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V64z"/></svg>
## .small[(a.k.a. flat or spreadsheet-like files)]
* delimited text files with `read_delim()`
  + `.csv`: comma (",") separated values with `read_csv()`
  + `.csv`: semicolon (“;”) separated values with `read_csv2()`
  + `.tsv`: tab ("\t") separated values `read_tsv()`
* `.fwf`: fixed width files with `read_fwf()`

<hr>
]

---
# Demo : Reading CSV Data on the web <svg aria-hidden="true" role="img" viewBox="0 0 384 512" style="height:1em;width:0.75em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M320 464c8.8 0 16-7.2 16-16V160H256c-17.7 0-32-14.3-32-32V48H64c-8.8 0-16 7.2-16 16V448c0 8.8 7.2 16 16 16H320zM0 64C0 28.7 28.7 0 64 0H229.5c17 0 33.3 6.7 45.3 18.7l90.5 90.5c12 12 18.7 28.3 18.7 45.3V448c0 35.3-28.7 64-64 64H64c-35.3 0-64-28.7-64-64V64z"/></svg>

In this hands-on demo, you will learn how to import:
  
  * files that are hosted on the web.  
      + **Data in Webpages <svg aria-hidden="true" role="img" viewBox="0 0 576 512" style="height:1em;width:1.12em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M208 80c0-26.5 21.5-48 48-48h64c26.5 0 48 21.5 48 48v64c0 26.5-21.5 48-48 48h-8v40H464c30.9 0 56 25.1 56 56v32h8c26.5 0 48 21.5 48 48v64c0 26.5-21.5 48-48 48H464c-26.5 0-48-21.5-48-48V368c0-26.5 21.5-48 48-48h8V288c0-4.4-3.6-8-8-8H312v40h8c26.5 0 48 21.5 48 48v64c0 26.5-21.5 48-48 48H256c-26.5 0-48-21.5-48-48V368c0-26.5 21.5-48 48-48h8V280H112c-4.4 0-8 3.6-8 8v32h8c26.5 0 48 21.5 48 48v64c0 26.5-21.5 48-48 48H48c-26.5 0-48-21.5-48-48V368c0-26.5 21.5-48 48-48h8V288c0-30.9 25.1-56 56-56H264V192h-8c-26.5 0-48-21.5-48-48V80z"/></svg>:** 
          - **FRED Data:** e.g., [Unempolyment Rate (UNRATE)](https://fred.stlouisfed.org/series/UNRATE)  
      + **GitHub** <svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:currentColor;overflow:visible;position:relative;"><path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"/></svg> Repositories, e.g.,
          - [Bank Marketing](https://github.com/selva86/datasets/blob/master/bank-full.csv) - focusing on `bank-full.csv`
          - Original Source: [UCI Machine Learning](https://archive.ics.uci.edu/dataset/222/bank+marketing)

---

# Demo 1: FRED Data

.pull-left-2[
<img src="data:image/png;base64,#./figures/unrate.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right-2[
### R Code

```r
# Import data using read_csv() function
unrate <- read_csv("https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2023-07-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2023-08-09&revision_date=2023-08-09&nd=1948-01-01")
# Check out the first 6 rows of the data
head(unrate)
```

]

---

# Demo 2: Bank Marketing Data

.pull-left-2[
<img src="data:image/png;base64,#./figures/bankdata.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right-2[
### R Code

Since we already load the r package `readr` before, we can use the function `read_csv2()` directly.

```r
# Import the Bank Marketing Data
bank <- read_csv2("https://raw.githubusercontent.com/selva86/datasets/master/bank-full.csv")

# Read the first 6 rows of data
head(bank)

glimpse(bank)
```

]
---

.left-column[
.center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/haven.png" width="60%">]
]
.right-column[
# Reading Proprietary Binary Files

Several functions from the [haven](https://haven.tidyverse.org/) <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:gold;overflow:visible;position:relative;"><path d="M50.7 58.5L0 160H208V32H93.7C75.5 32 58.9 42.3 50.7 58.5zM240 160H448L397.3 58.5C389.1 42.3 372.5 32 354.3 32H240V160zm208 32H0V416c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192z"/></svg> can be used to read and write formats used by other statistical packages. Example functions include:

- SAS
  + `.sas7bdat` with `read_sas()`
  
- Stata
  + `.dta` with `read_dta()`
  
- SPSS
  + `.sav` with `read_sav()`

**Please refer to the help files for each of those packages for more details.**
]

---

# JSON Files

> _JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses **human-readable** text to store and transmit data **objects** consisting of **attribute–value pairs** and **arrays**... It is a common data format with diverse uses ... including that of web applications with servers._ --- [Wikipedia's Definition of JSON](https://en.wikipedia.org/wiki/JSON)

* **object:** `{}`
* **array:** `[]`
* **value:** string/character, number, object, array, logical, `null`

---

# JSON Files

.pull-left[
### JSON
```json
{
  "firstName": "Mickey",
  "lastName": "Mouse",
  "address": {
    "city": "Mousetown",
    "postalCode": 10000
  }
  "logical": [true, false]
}
```
]
.pull-right[
### R list
```r
list(
  firstName = "Mickey",
  lastName = "Mouse",
  address = list(
    city = "Mousetown",
    postalCode = 10000
  ),
  logical = c(TRUE, FALSE)
)
```
]

---

# Useful Tools 🔆

---
# Packages & Functions

.left[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/rvest.png" width="5%">]

- `read_html()`: used to read HTML content from a URL or file and returns an object of class *xml_document*.

- `html_elements()`: takes an *xml_document* object and a CSS selector or XPath expression. It returns a collection of nodes that match the selector or expression. It's useful for selecting specific elements within the HTML document for further processing.

- `html_attr()`: used to extract a specific attribute from an HTML element or a collection of elements. You can use it in combination with *html_elements()* to extract a particular attribute (like href for links) from the selected elements.

- `html_text()`: used to extract the text content from an HTML element or a collection of elements. It will return the text as a character vector, excluding any HTML tags. This is useful for scraping visible text content from a webpage.

---
# CSS Selector & XPath

.footnotesize[Both CSS selectors and XPath are used to navigate through elements in an HTML or XML document, but they have different syntax and characteristics.]
.pull-left[
.footnotesize[
CSS selectors are patterns used to select elements in an HTML document. They are the same selectors used in CSS to style elements on a webpage. Here are some examples:

- **Element Selector**: `"p"` selects all `<p>` elements.
- **ID Selector**: `"#myID"` selects the element with `id="myID"`.
- **Class Selector**: `".myClass"` selects all elements with `class="myClass"`.
- **Child Selector**: `"div > p"` selects all `<p>` elements that are direct children of a `<div>` element.
- **Descendant Selector**: `"div p"` selects all `<p>` elements inside a `<div>`, regardless of how deeply nested they are.
]
]

.pull-right[
.footnotesize[XPath (XML Path Language) is a querying language for selecting nodes in XML or HTML documents, allowing navigation through elements and attributes. Here are some examples:

- **Element Selection**: `"/html/body/div"` selects the `<div>` element inside the `<body>` element at the root of the document.
- **Wildcard**: `"/*"` selects all child elements of the current node.
- **Attribute Selection**: `"/@id"` selects the `id` attribute of the current element.
- **Conditional Selection**: `"/div[@class='myClass']"` selects all `<div>` elements with an attribute `class="myClass"`.
]
]

---

# Strings

- "^.*/" matches everything from the start of the string up to and including the last forward slash.
  + `sub("^.*/", "", "C:/Users/Tessa Chen/Documents/hello.txt")` will return "hello.txt".

- "\\\\.gz$" searches for the occurrence of .gz at the very end of a given string.
  + `sub("\\.gz$", "", "homework.json.gz")` will return "homework.json".
  + `str_remove("homework.csv.zip", "\\.zip")` will return "homework.csv".

- XPath expression "//a" is used to select all anchor (`<a>`) elements in the HTML document represented by page. So "//p" is for selecting `<p>` elements.

- `//a[starts-with(@class,'profile-listing__copy-header')]` is a conditional expression that filters the &lt;a&gt; elements to only those where the class attribute starts with the specific string <code>profile-listing__copy-header</code>.
    
---

# Strings - Explanations

- `^` : This symbol matches the start of a line.
- `.` : This symbol matches any single character except a newline.
- `*` : This symbol matches zero or more of the preceding element (in this case, the preceding element is . which means any character).
- `/` : This symbol is a literal forward slash character that we are trying to match in the input string.
]
]
.pull-right[.small[
Here's a breakdown of each symbol in this pattern "\.gz":

- `\\` : This is an escape character in R, which means the next character should be treated as a literal character rather than a special character.
- `.` : In regular expressions, a dot normally matches any character except a newline. However, since it is preceded by an escape character here, it will literally match a dot (.) character.
- `gz` : These are literal characters, so the pattern will try to match the string "gz".
- `$` : This symbol matches the end of a line.
]
]

---

# Web Scraping 🕸

---
# Demo 3: Indiegogo Datasets

The example shows how we can download [Indiegogo datasets](https://webrobots.io/indiegogo-dataset/), a crowdfunding platform dedicated to realizing creative projects and products.

.pull-left[
<iframe src="https://webrobots.io/indiegogo-dataset/" width="100%" height="400px" data-external="1"></iframe>
]
.pull-right[
<img src="data:image/png;base64,#./figures/flowchart.PNG" width="65%" style="display: block; margin: auto;" />
]

---

# Demo 3: Preparing Necessary Functions

.pull-left[
.center[
<iframe src="https://ying-ju.github.io/talks/ACR/get_CSV.html" width="100%" height="400px" data-external="1"></iframe>
]
]

.pull-right[
.center[
<iframe src="https://ying-ju.github.io/talks/ACR/get_JSON.html" width="100%" height="400px" data-external="1"></iframe>
]
]

---
# Demo 3: Scraping Indiegogo Datasets

```r
# Import custom functions
source("https://raw.githubusercontent.com/Ying-Ju/ying-ju.github.io/main/talks/ACR/all_functions.R")

# Define the URL of the web page to be scraped
url <- "https://webrobots.io/indiegogo-dataset/"

# Read the HTML content of the web page using the read_html function
page <- read_html(url)

# Extract all hyperlinks (<a> tags) from the HTML content
# The html_elements and html_attr functions are used to get the 'href' attributes, which contain the URLs
links <- page %>% 
  html_elements(xpath = "//a") %>%
  html_attr("href")

# Find and extract all links that end with ".gz" from the list of links
# These are likely links to gzipped files
JSON_links <- grep("\\.gz$", links, value = TRUE)

# Find and extract all links that end with ".zip" from the list of links
# These are likely links to zipped files containing CSV data
CSV_links <- grep("\\.zip$", links, value = TRUE)
```
]

```r
# Import data using the first url in CSV_links
df1_csv <- get_CSV(CSV_links[1])

# Import data using the second url in JSON_links
df2_json <- get_JSON(JSON_links[2])

# Get all datasets using CSV_links
all_csv <- lapply(1:length(CSV_links), 
                  function(x) get_CSV(CSV_links[x]))

# Get all datasets using JSON_links
all_json <- lapply(1:length(JSON_links),
                   function(x) get_JSON(JSON_links[x]))
```

.footnotesize[`Note:`Since there are 88 links corresponding to 88 datasets, loading all datasets will require a high memory usage. The sizes of the lists: *all_csv* and *all_json* will be large. ]
]

---

# Demo 4: UD Faculty Information

.pull-left[
.footnotesize[In this demo, we will scrape the faculty data for the [Department of Mathematics](https://udayton.edu/directory/artssciences/mathematics/index.php) at the University of Dayton.]
.center[
<iframe src="https://udayton.edu/directory/artssciences/mathematics/index.php" width="100%" height="400px" data-external="1"></iframe>
]
]
.pull-right[
<img src="data:image/png;base64,#./figures/flowchart2.PNG" width="65%" style="display: block; margin: auto;" />
]

---

# Demo 4: Preparing Necessary Functions

.pull-left[
.center[
<iframe src="https://ying-ju.github.io/talks/ACR/get_faculty.html" width="100%" height="400px" data-external="1"></iframe>
]
]
.pull-right[
.center[
<iframe src="https://ying-ju.github.io/talks/ACR/get_individual.html" width="100%" height="400px" data-external="1"></iframe>
]
]

---
# Demo 4: Scraping UD Faculty Information

```r
# There are 5 pages, but the page number started from 0.
# Initialize a list called faculty by applying the get_faculty function to each page number (from 0 to 4).
# The get_faculty function is expected to scrape faculty information from the given URL.
faculty <- lapply(0:4, function(x) get_faculty(sprintf("https://udayton.edu/directory/artssciences/mathematics/index.php?page=%d", x)))

# Combine the data frames contained in the faculty list into a single data frame called all_faculty.
# The bind_rows function takes all the data frames in the list and binds them by rows.
all_faculty <- bind_rows(faculty)

# Apply the `get_individual` function to each URL.
# The result will be a list of data frames containing extra information about each faculty member.
df_extra <- lapply(all_faculty$all_links, function(x) get_individual(x))

# Combine the data frames in 'df_extra' into a single data frame called 'df_extra_all'.
# This will bring together all the extra information for further processing.
df_extra_all <- bind_rows(df_extra)

# Join the original 'all_faculty' data frame with the extra information in 'df_extra_all'.

final_info <- all_faculty %>%
  inner_join(df_extra_all, by = c("faculty" = "name")) %>%
  select(-all_links)
```

---
# Demo 4: Scraping Faculty Information - Result

We show the first six rows of the faculty data below.

```
## # A tibble: 6 × 5
##   faculty           position                     degree         profile research
##   <chr>             <chr>                        <chr>          <chr>   <chr>   
## 1 Atif Abueida      Professor                    B.Sc., UAE Un… Atif A… Graph T…
## 2 Bob Bennington    Lecturer                     Ph.D., Electr… <NA>    MTH116  
## 3 Reza Bidar        Visiting Assistant Professor Ph.D., Mathem… <NA>    Riemann…
## 4 Samuel Brensinger Lecturer                     <NA>           <NA>    <NA>    
## 5 Jonathan Brown    Associate Professor          Ph.D., Dartmo… Jonath… Functio…
## 6 Richard Buckalew  Visiting Assistant Professor Ph.D., Mathem… <NA>    Applied…
```
---
class: inverse, center, middle

# Legal and Ethical Issues with Web Scraping

.footnote[
<html>
<hr>
<html>
.left[
.large[Source:
Slides 25-30 are from [Dr. Fadel Megahed's ISA 401 Scraping Webpage Slides](https://fmegahed.github.io/isa401/fall2022/class04/04_scraping_webpages.html?panelset2=q12&panelset3=activity2#38).
]
]
]

---

# `Robots.txt`

When scraping/crawling the web you need to be aware of `robots.txt`.

> _The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned_. --- [Wikipedia](https://en.wikipedia.org/wiki/Robots_exclusion_standard)

Using the excellent [robotstxt](https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html) <svg aria-hidden="true" role="img" viewBox="0 0 448 512" style="height:1em;width:0.88em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:gold;overflow:visible;position:relative;"><path d="M50.7 58.5L0 160H208V32H93.7C75.5 32 58.9 42.3 50.7 58.5zM240 160H448L397.3 58.5C389.1 42.3 372.5 32 354.3 32H240V160zm208 32H0V416c0 35.3 28.7 64 64 64H384c35.3 0 64-28.7 64-64V192z"/></svg> to check if scraping/crawling a specific directory is allowed.

```r
if(require(robotstxt)==FALSE) install.packages("robotstxt")

robotstxt::paths_allowed(paths  = "airlines.htm", domain = "planecrashinfo.com", bot    = "*")
```

```
## [1] TRUE
```

---

# Terms of Service

Most large companies have **terms of service** that supplement what is permitted and/or disallowed on their `robots.txt` file. Examples include:

- [Yelp's US Terms of Service](https://terms.yelp.com/tos/en_us/20200101_en_us/)
  
  - [LinkedIn Terms of Service](https://www.linkedin.com/legal/l/service-terms)
  
---

counter: false

# Ethical/Legal Considerations

- **Use of publicly available reviews as a part of your service:** Would you classify the [Yelp vs Google Feud as such an example](https://www.nytimes.com/2017/07/01/technology/yelp-google-european-union-antitrust.html)?

<center>
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Wow Google, congrats on a new low. Consumer searches for Yelp gets &quot;reviews&quot; which are Google Ads. <a href="https://t.co/gKSeOOhzWG">pic.twitter.com/gKSeOOhzWG</a></p>&mdash; Jeremy Stoppelman (@jeremys) <a href="https://twitter.com/jeremys/status/876978936177082368?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
</center>

---

counter: false

# Ethical/Legal Considerations

- **Use of publicly available profiles as a part of your service:**

+ [LinkedIn vs Hiq Labs: Ninth Circuit Decision in 2019](https://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf)
  
  + [Revival of Case in 2021 by Supreme Court](https://techcrunch.com/2021/06/14/supreme-court-revives-linkedin-bid-to-protect-user-data-from-web-scrapers/)

---

counter: false

# Ethical/Legal Considerations

- **What about scraping entire websites/webpages for the purpose of archiving the internet?**

<div class="figure" style="text-align: center">
<img src="data:image/png;base64,#./figures/wayback_google.PNG" alt="The evolution of the home page for Google per the Wayback Machine" width="80%" />
<p class="caption">The evolution of the home page for Google per the Wayback Machine</p>
</div>

---
class: inverse, center, middle

# Wrap-Up

---

# Summary

✅ Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. online using <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#E4002B;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>

✅ Scrape a webpage using <svg aria-hidden="true" role="img" viewBox="0 0 581 512" style="height:1em;width:1.13em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:#E4002B;overflow:visible;position:relative;"><path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"/></svg>

✅ Understand when can we scrape data (i.e., `robots.txt`)

---

# Thanks

.pull-left[
- Please do not hesitate to contact me (Tessa Chen) if you have questions pertaining to learning R or other languages.  Please email me at <a href="mailto:ychen@udayton.edu"><i class="fa fa-paper-plane fa-fw"></i>&nbsp; ychen4@udayton.edu</a>.

- Slides were created via the R package **xaringan**, with styling based on:  
  * [xariganthemer](https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html) package, and  
  * Alison Hill's [@apreshill](https://github.com/apreshill/) CSS resources for customizing themes and fonts

- The formatting of slides is provided by Dr. Fadel M. Megahed [@fmegahed](https://github.com/fmegahed). 
]

<img src="data:image/png;base64,#./figures/Tessa_grey_G.gif" width="60%" style="display: block; margin: auto;" />
]