class: center, middle, inverse, title-slide .title[ # ACR Conference ] .subtitle[ ## Social-Data Scraping in
] .author[ ###
Ying-Ju Tessa Chen, PhD
Associate Professor
Department of Mathematics
University of Dayton
@ying-ju
ying-ju
ychen4@udayton.edu
] .date[ ### September 15, 2023 ] --- name: ninja class: middle, inverse # We assume: -- ### <img src="data:image/png;base64,#figures/r_icon.png" width="5%" style="float:left"/></i> you know R -- ### <img src="data:image/png;base64,#https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/RStudio.png" width="5%" style="float:left"/></i> you know RStudio -- ### <img src="data:image/png;base64,#figures/data_icon.png" width="5%" style="float:left"/></i> you know some basic data file formats -- ### <img src="data:image/png;base64,#figures/web_crawler.png" width="5%" style="float:left"/></i> you want to scrape data from a real website --- name: novice class: middle, inverse # What you may not know: -- ### <img src="data:image/png;base64,#figures/packages.png" width="5%" style="float:left"/></i> some R packages we plan to use today -- ###
web technology --- # Learning Objectives - Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. online using
- Scrape a webpage using
- Understand when can we scrape data (i.e., `robots.txt`) --- In this workshop, we will need the following packages: .small[ - jsonlite: A package designed to read, write, and manipulate JSON data seamlessly in R. - httr: A tool for working with HTTP connections, simplifying the process of sending and receiving HTTP requests and responses in R. - pacman: A package management tool that streamlines the installation, loading, and maintenance of R packages. - R.utils: A collection of utility functions that facilitate various programming tasks in R, enhancing efficiency and productivity. - rvest: A web scraping package that enables the easy extraction of information from websites directly into R for data analysis. - tidyverse: A cohesive collection of R packages that adhere to a common design philosophy, grammar, and data structures, providing a comprehensive toolkit for data science in R.] ```r install.packages(c("httr", "pacman", "R.utils", "rvest", "tidyverse")) pacman::p_load(jsonlite, httr, R.utils, rvest, tidyverse) ``` .footnote[.red[The code included can be copied by hovering the top right corner of each code chunk.]] --- class: inverse middle # Importing Data ⬇️ --- .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/readr.png" width="60%">] ] .right-column[ # Reading Plain-Text Rectangular
## .small[(a.k.a. flat or spreadsheet-like files)] * delimited text files with `read_delim()` + `.csv`: comma (",") separated values with `read_csv()` + `.csv`: semicolon (“;”) separated values with `read_csv2()` + `.tsv`: tab ("\t") separated values `read_tsv()` * `.fwf`: fixed width files with `read_fwf()` <hr> ] --- # Demo : Reading CSV Data on the web
In this hands-on demo, you will learn how to import: * files that are hosted on the web. + **Data in Webpages
:** - **FRED Data:** e.g., [Unempolyment Rate (UNRATE)](https://fred.stlouisfed.org/series/UNRATE) + **GitHub**
Repositories, e.g., - [Bank Marketing](https://github.com/selva86/datasets/blob/master/bank-full.csv) - focusing on `bank-full.csv` - Original Source: [UCI Machine Learning](https://archive.ics.uci.edu/dataset/222/bank+marketing) --- # Demo 1: FRED Data .pull-left-2[ <img src="data:image/png;base64,#./figures/unrate.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-2[ ### R Code ```r # Import data using read_csv() function unrate <- read_csv("https://fred.stlouisfed.org/graph/fredgraph.csv?bgcolor=%23e1e9f0&chart_type=line&drp=0&fo=open%20sans&graph_bgcolor=%23ffffff&height=450&mode=fred&recession_bars=on&txtcolor=%23444444&ts=12&tts=12&width=1318&nt=0&thu=0&trc=0&show_legend=yes&show_axis_titles=yes&show_tooltip=yes&id=UNRATE&scale=left&cosd=1948-01-01&coed=2023-07-01&line_color=%234572a7&link_values=false&line_style=solid&mark_type=none&mw=3&lw=2&ost=-99999&oet=99999&mma=0&fml=a&fq=Monthly&fam=avg&fgst=lin&fgsnd=2020-02-01&line_index=1&transformation=lin&vintage_date=2023-08-09&revision_date=2023-08-09&nd=1948-01-01") # Check out the first 6 rows of the data head(unrate) ``` ] --- # Demo 2: Bank Marketing Data .pull-left-2[ <img src="data:image/png;base64,#./figures/bankdata.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right-2[ ### R Code Since we already load the r package `readr` before, we can use the function `read_csv2()` directly. ```r # Import the Bank Marketing Data bank <- read_csv2("https://raw.githubusercontent.com/selva86/datasets/master/bank-full.csv") # Read the first 6 rows of data head(bank) glimpse(bank) ``` ] --- .left-column[ .center[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/haven.png" width="60%">] ] .right-column[ # Reading Proprietary Binary Files Several functions from the [haven](https://haven.tidyverse.org/)
can be used to read and write formats used by other statistical packages. Example functions include: - SAS + `.sas7bdat` with `read_sas()` - Stata + `.dta` with `read_dta()` - SPSS + `.sav` with `read_sav()` **Please refer to the help files for each of those packages for more details.** ] --- # JSON Files > _JSON (JavaScript Object Notation) is an open standard file format and data interchange format that uses **human-readable** text to store and transmit data **objects** consisting of **attribute–value pairs** and **arrays**... It is a common data format with diverse uses ... including that of web applications with servers._ --- [Wikipedia's Definition of JSON](https://en.wikipedia.org/wiki/JSON) * **object:** `{}` * **array:** `[]` * **value:** string/character, number, object, array, logical, `null` --- # JSON Files .pull-left[ ### JSON ```json { "firstName": "Mickey", "lastName": "Mouse", "address": { "city": "Mousetown", "postalCode": 10000 } "logical": [true, false] } ``` ] .pull-right[ ### R list ```r list( firstName = "Mickey", lastName = "Mouse", address = list( city = "Mousetown", postalCode = 10000 ), logical = c(TRUE, FALSE) ) ``` ] --- class: inverse, center, middle # Useful Tools 🔆 --- # Packages & Functions .left[<img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/rvest.png" width="5%">] - `read_html()`: used to read HTML content from a URL or file and returns an object of class *xml_document*. - `html_elements()`: takes an *xml_document* object and a CSS selector or XPath expression. It returns a collection of nodes that match the selector or expression. It's useful for selecting specific elements within the HTML document for further processing. - `html_attr()`: used to extract a specific attribute from an HTML element or a collection of elements. You can use it in combination with *html_elements()* to extract a particular attribute (like href for links) from the selected elements. - `html_text()`: used to extract the text content from an HTML element or a collection of elements. It will return the text as a character vector, excluding any HTML tags. This is useful for scraping visible text content from a webpage. --- # CSS Selector & XPath .footnotesize[Both CSS selectors and XPath are used to navigate through elements in an HTML or XML document, but they have different syntax and characteristics.] .pull-left[ .footnotesize[ CSS selectors are patterns used to select elements in an HTML document. They are the same selectors used in CSS to style elements on a webpage. Here are some examples: - **Element Selector**: `"p"` selects all `<p>` elements. - **ID Selector**: `"#myID"` selects the element with `id="myID"`. - **Class Selector**: `".myClass"` selects all elements with `class="myClass"`. - **Child Selector**: `"div > p"` selects all `<p>` elements that are direct children of a `<div>` element. - **Descendant Selector**: `"div p"` selects all `<p>` elements inside a `<div>`, regardless of how deeply nested they are. ] ] .pull-right[ .footnotesize[XPath (XML Path Language) is a querying language for selecting nodes in XML or HTML documents, allowing navigation through elements and attributes. Here are some examples: - **Element Selection**: `"/html/body/div"` selects the `<div>` element inside the `<body>` element at the root of the document. - **Wildcard**: `"/*"` selects all child elements of the current node. - **Attribute Selection**: `"/@id"` selects the `id` attribute of the current element. - **Conditional Selection**: `"/div[@class='myClass']"` selects all `<div>` elements with an attribute `class="myClass"`. ] ] --- # Strings - "^.*/" matches everything from the start of the string up to and including the last forward slash. + `sub("^.*/", "", "C:/Users/Tessa Chen/Documents/hello.txt")` will return "hello.txt". - "\\\\.gz$" searches for the occurrence of .gz at the very end of a given string. + `sub("\\.gz$", "", "homework.json.gz")` will return "homework.json". + `str_remove("homework.csv.zip", "\\.zip")` will return "homework.csv". - XPath expression "//a" is used to select all anchor (`<a>`) elements in the HTML document represented by page. So "//p" is for selecting `<p>` elements. - `//a[starts-with(@class,'profile-listing__copy-header')]` is a conditional expression that filters the <a> elements to only those where the class attribute starts with the specific string <code>profile-listing__copy-header</code>. --- # Strings - Explanations .pull-left[.small[Here's a breakdown of what each symbol means in this pattern "^.*/": - `^` : This symbol matches the start of a line. - `.` : This symbol matches any single character except a newline. - `*` : This symbol matches zero or more of the preceding element (in this case, the preceding element is . which means any character). - `/` : This symbol is a literal forward slash character that we are trying to match in the input string. ] ] .pull-right[.small[ Here's a breakdown of each symbol in this pattern "\.gz": - `\\` : This is an escape character in R, which means the next character should be treated as a literal character rather than a special character. - `.` : In regular expressions, a dot normally matches any character except a newline. However, since it is preceded by an escape character here, it will literally match a dot (.) character. - `gz` : These are literal characters, so the pattern will try to match the string "gz". - `$` : This symbol matches the end of a line. ] ] --- class: inverse, center, middle # Web Scraping 🕸 --- # Demo 3: Indiegogo Datasets The example shows how we can download [Indiegogo datasets](https://webrobots.io/indiegogo-dataset/), a crowdfunding platform dedicated to realizing creative projects and products. .pull-left[ <iframe src="https://webrobots.io/indiegogo-dataset/" width="100%" height="400px" data-external="1"></iframe> ] .pull-right[ <img src="data:image/png;base64,#./figures/flowchart.PNG" width="65%" style="display: block; margin: auto;" /> ] --- # Demo 3: Preparing Necessary Functions .pull-left[ .center[ <iframe src="https://ying-ju.github.io/talks/ACR/get_CSV.html" width="100%" height="400px" data-external="1"></iframe> ] ] .pull-right[ .center[ <iframe src="https://ying-ju.github.io/talks/ACR/get_JSON.html" width="100%" height="400px" data-external="1"></iframe> ] ] --- # Demo 3: Scraping Indiegogo Datasets .pull-left[ ```r # Import custom functions source("https://raw.githubusercontent.com/Ying-Ju/ying-ju.github.io/main/talks/ACR/all_functions.R") # Define the URL of the web page to be scraped url <- "https://webrobots.io/indiegogo-dataset/" # Read the HTML content of the web page using the read_html function page <- read_html(url) # Extract all hyperlinks (<a> tags) from the HTML content # The html_elements and html_attr functions are used to get the 'href' attributes, which contain the URLs links <- page %>% html_elements(xpath = "//a") %>% html_attr("href") # Find and extract all links that end with ".gz" from the list of links # These are likely links to gzipped files JSON_links <- grep("\\.gz$", links, value = TRUE) # Find and extract all links that end with ".zip" from the list of links # These are likely links to zipped files containing CSV data CSV_links <- grep("\\.zip$", links, value = TRUE) ``` ] .pull-right[ ```r # Import data using the first url in CSV_links df1_csv <- get_CSV(CSV_links[1]) # Import data using the second url in JSON_links df2_json <- get_JSON(JSON_links[2]) # Get all datasets using CSV_links all_csv <- lapply(1:length(CSV_links), function(x) get_CSV(CSV_links[x])) # Get all datasets using JSON_links all_json <- lapply(1:length(JSON_links), function(x) get_JSON(JSON_links[x])) ``` .footnotesize[`Note:`Since there are 88 links corresponding to 88 datasets, loading all datasets will require a high memory usage. The sizes of the lists: *all_csv* and *all_json* will be large. ] ] --- # Demo 4: UD Faculty Information .pull-left[ .footnotesize[In this demo, we will scrape the faculty data for the [Department of Mathematics](https://udayton.edu/directory/artssciences/mathematics/index.php) at the University of Dayton.] .center[ <iframe src="https://udayton.edu/directory/artssciences/mathematics/index.php" width="100%" height="400px" data-external="1"></iframe> ] ] .pull-right[ <img src="data:image/png;base64,#./figures/flowchart2.PNG" width="65%" style="display: block; margin: auto;" /> ] --- # Demo 4: Preparing Necessary Functions .pull-left[ .center[ <iframe src="https://ying-ju.github.io/talks/ACR/get_faculty.html" width="100%" height="400px" data-external="1"></iframe> ] ] .pull-right[ .center[ <iframe src="https://ying-ju.github.io/talks/ACR/get_individual.html" width="100%" height="400px" data-external="1"></iframe> ] ] --- # Demo 4: Scraping UD Faculty Information ```r # There are 5 pages, but the page number started from 0. # Initialize a list called faculty by applying the get_faculty function to each page number (from 0 to 4). # The get_faculty function is expected to scrape faculty information from the given URL. faculty <- lapply(0:4, function(x) get_faculty(sprintf("https://udayton.edu/directory/artssciences/mathematics/index.php?page=%d", x))) # Combine the data frames contained in the faculty list into a single data frame called all_faculty. # The bind_rows function takes all the data frames in the list and binds them by rows. all_faculty <- bind_rows(faculty) # Apply the `get_individual` function to each URL. # The result will be a list of data frames containing extra information about each faculty member. df_extra <- lapply(all_faculty$all_links, function(x) get_individual(x)) # Combine the data frames in 'df_extra' into a single data frame called 'df_extra_all'. # This will bring together all the extra information for further processing. df_extra_all <- bind_rows(df_extra) # Join the original 'all_faculty' data frame with the extra information in 'df_extra_all'. final_info <- all_faculty %>% inner_join(df_extra_all, by = c("faculty" = "name")) %>% select(-all_links) ``` --- # Demo 4: Scraping Faculty Information - Result We show the first six rows of the faculty data below. ``` ## # A tibble: 6 × 5 ## faculty position degree profile research ## <chr> <chr> <chr> <chr> <chr> ## 1 Atif Abueida Professor B.Sc., UAE Un… Atif A… Graph T… ## 2 Bob Bennington Lecturer Ph.D., Electr… <NA> MTH116 ## 3 Reza Bidar Visiting Assistant Professor Ph.D., Mathem… <NA> Riemann… ## 4 Samuel Brensinger Lecturer <NA> <NA> <NA> ## 5 Jonathan Brown Associate Professor Ph.D., Dartmo… Jonath… Functio… ## 6 Richard Buckalew Visiting Assistant Professor Ph.D., Mathem… <NA> Applied… ``` --- class: inverse, center, middle # Legal and Ethical Issues with Web Scraping .footnote[ <html> <hr> <html> .left[ .large[Source: Slides 25-30 are from [Dr. Fadel Megahed's ISA 401 Scraping Webpage Slides](https://fmegahed.github.io/isa401/fall2022/class04/04_scraping_webpages.html?panelset2=q12&panelset3=activity2#38). ] ] ] --- # `Robots.txt` When scraping/crawling the web you need to be aware of `robots.txt`. > _The robots exclusion standard, also known as the robots exclusion protocol or simply robots.txt, is a standard used by websites to communicate with web crawlers and other web robots. The standard specifies how to inform the web robot about which areas of the website should not be processed or scanned_. --- [Wikipedia](https://en.wikipedia.org/wiki/Robots_exclusion_standard) Using the excellent [robotstxt](https://cran.r-project.org/package=robotstxt/vignettes/using_robotstxt.html)
to check if scraping/crawling a specific directory is allowed. ```r if(require(robotstxt)==FALSE) install.packages("robotstxt") robotstxt::paths_allowed(paths = "airlines.htm", domain = "planecrashinfo.com", bot = "*") ``` ``` ## [1] TRUE ``` --- # Terms of Service Most large companies have **terms of service** that supplement what is permitted and/or disallowed on their `robots.txt` file. Examples include: - [Yelp's US Terms of Service](https://terms.yelp.com/tos/en_us/20200101_en_us/) - [LinkedIn Terms of Service](https://www.linkedin.com/legal/l/service-terms) --- counter: false # Ethical/Legal Considerations - **Use of publicly available reviews as a part of your service:** Would you classify the [Yelp vs Google Feud as such an example](https://www.nytimes.com/2017/07/01/technology/yelp-google-european-union-antitrust.html)? <center> <blockquote class="twitter-tweet"><p lang="en" dir="ltr">Wow Google, congrats on a new low. Consumer searches for Yelp gets "reviews" which are Google Ads. <a href="https://t.co/gKSeOOhzWG">pic.twitter.com/gKSeOOhzWG</a></p>— Jeremy Stoppelman (@jeremys) <a href="https://twitter.com/jeremys/status/876978936177082368?ref_src=twsrc%5Etfw">June 20, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> </center> --- counter: false # Ethical/Legal Considerations - **Use of publicly available profiles as a part of your service:** + [LinkedIn vs Hiq Labs: Ninth Circuit Decision in 2019](https://cdn.ca9.uscourts.gov/datastore/opinions/2019/09/09/17-16783.pdf) + [Revival of Case in 2021 by Supreme Court](https://techcrunch.com/2021/06/14/supreme-court-revives-linkedin-bid-to-protect-user-data-from-web-scrapers/) --- counter: false # Ethical/Legal Considerations - **What about scraping entire websites/webpages for the purpose of archiving the internet?** <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#./figures/wayback_google.PNG" alt="The evolution of the home page for Google per the Wayback Machine" width="80%" /> <p class="caption">The evolution of the home page for Google per the Wayback Machine</p> </div> --- class: inverse, center, middle # Wrap-Up --- # Summary ✅ Read text-files, binary files (e.g., Excel, SAS, SPSS, Stata, etc), json files, etc. online using
✅ Scrape a webpage using
✅ Understand when can we scrape data (i.e., `robots.txt`) --- # Thanks .pull-left[ - Please do not hesitate to contact me (Tessa Chen) if you have questions pertaining to learning R or other languages. Please email me at <a href="mailto:ychen@udayton.edu"><i class="fa fa-paper-plane fa-fw"></i> ychen4@udayton.edu</a>. - Slides were created via the R package **xaringan**, with styling based on: * [xariganthemer](https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html) package, and * Alison Hill's [@apreshill](https://github.com/apreshill/) CSS resources for customizing themes and fonts - The formatting of slides is provided by Dr. Fadel M. Megahed [@fmegahed](https://github.com/fmegahed). ] .pull-right[ <img src="data:image/png;base64,#./figures/Tessa_grey_G.gif" width="60%" style="display: block; margin: auto;" /> ]