Brief Overview

In this session, we will use the Black Friday Data available in Kaggle to study how to make the following graphical displays.

library(tidyverse)
Friday <- read_csv("C:/Users/tessa/Dropbox/MTH 209/Class Handouts/Data/Black_Friday.csv")

Categorical Data
- Bar Chart
- Pie Chart
Quantitative Data
- Histogram
- Boxplot
- Scatterplot
- Line

Here is a list of common arguments:

col: a vector of colors
main: title for the plot
xlim or ylim: limits for the x or y axis
xlab or ylab: a label for the x axis
font: font used for text, 1=plain; 2=bold; 3=italic, 4=bold italic
font.axis: font used for axis
cex.axis: font size for x and y axes
font.lab: font for x and y labels
cex.lab: font size for x and y labels

Understand Your Data

In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contains the following variables.

User_ID: User ID
Product_ID: Product ID
Gender: Sex of User
Age: Age in bins
Occupation: Occupation (Masked)
City_Category: Category of the City (A,B,C)
Stay_In_Current_City_Years: Number of years stay in current city
Marital_Status: Marital Status
Product_Category_1: Product Category (Masked)
Product_Category_2: Product may belongs to other category also (Masked)
Product_Category_3: Product may belongs to other category also (Masked)
Purchase: Purchase Amount

Get a Glimpse of the Data

We can use the glimpse() function to get a glimpse of the data.

glimpse(Friday)

## Rows: 550,068
## Columns: 12
## $ User_ID                    <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
## $ Product_ID                 <chr> "P00069042", "P00248942", "P00087842", "P00…
## $ Gender                     <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
## $ Age                        <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
## $ Occupation                 <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
## $ City_Category              <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
## $ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
## $ Marital_Status             <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
## $ Product_Category_1         <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
## $ Product_Category_2         <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
## $ Product_Category_3         <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
## $ Purchase                   <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…

Bar Chart - 1

Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company’s customers who purchased their products on Black Friday.

Usage: barplot(height, …)

barplot(table(Friday$Age))

Bar Chart - 2

We can have the horizontal bars. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure.

par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure 
barplot(table(Friday$Age), col = "lightblue", main = "Distribution of Customers' Age Group", horiz = TRUE, xlab = "Number of Customers",
        ylab = "Age Group", las=1)

Note: The margin of a figure could be set using the par() function. The order of the setting is c(bottom, left, top, right).

Bar Chart - 3

We can use RGB color code to assign colors.

barplot(table(Friday$Age), col = "#69b3a2", main = "Distribution of Customers' Age Group", xlab = "Age Group", ylab ="Number of Customers")

Pie Chart - 1

Similarly, we can use pie chart to study the distribution of the city category.

Usage: pie(height, …)

library(tidyverse)
pie(table(Friday$City_Category), main = "Distribution of City Category", col = c("#264e70", "#679186", "#bbd4ce"))

Pie Chart - 2

The following code chunk shows an advanced setting.

H <- table(Friday$City_Category)
percent <- round(100*H/sum(H), 1) # calculate percentages
pie_labels <- paste(percent, "%", sep="") # include %
pie(H, main = "Distribution of City Category", labels = pie_labels, col = c("#54d2d2", "#ffcb00", "#f8aa4b"))
legend("topright", c("A","B","C"), cex = 0.8, fill = c("#54d2d2", "#ffcb00", "#f8aa4b"))

Tip: Use color palette to choose colors (Google search: color scheme generator).

Histogram

Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.

Usage: hist(x, …)

hist(Friday$Purchase, main = "Distribution of Customer Purchase Amount", xlab = "Purchase Amount (British Pounds)")

We can find that the distribution of histogram is multimodal.

Boxplot - 1

Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).

Usage: boxplot(x, …) or boxplot(formula, …)

boxplot(Friday$Purchase, xlab="Purchase Amount", ylab="British Pounds")

Boxplot -2

In general, a boxplot is used When we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.

In order to know how this can be done, we need to know how to define a formula in R.

Usage: A ~ B

A: response variable
B: explanatory variables

boxplot(Friday$Purchase ~ Friday$Age)

Boxplot - 3

We can improve the quality of the boxplot by changing a few settings in the default settings.

boxplot(Friday$Purchase ~ Friday$Age, main = "Distribution of Purchase by Age Group", 
        xlab = "Customer Age Group", ylab = "Purchase Amount", cex.lab = 1.25, cex.axis = 1.25, col = "#54d2d2")

Boxplot - 4

We can use the argument data to indicate that variables used are from a given data.

boxplot(Purchase ~ Gender + Marital_Status, data = Friday, main="Distribution of Purchase by Sex and Marital_Status", 
        xlab="Sex and Marital Status", ylab="Purchase",  cex.lab=1.25, cex.axis=1.25, 
        names = c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))

Scatterplot - 1

When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn’t have another quantitative variable, for the demonstration purpose we create a variable Estimated_age by generating a random age for each customer based on their age group.

# create a function to generate a random integer in the age group
random_age <- function(age_group){
  if (str_detect(age_group, "\\+") == F){
      age_range <- as.numeric(unlist(strsplit(age_group, "-")))
      age <- sample(age_range[1]:age_range[2], 1)
  }else{
    age_left <- as.numeric(gsub("\\+", "", age_group))
    age <- sample(age_left:100, 1)
  }
  return(age)
}

# create a new column in the data 
Friday$Estimated_age <- rep(NA, nrow(Friday))

# apply the function to each customer's age group
for (i in 1:nrow(Friday)){
  Friday$Estimated_age[i] <- random_age(Friday$Age[i])
}

Note: Since we haven’t talked about how to create functions and the use of loops and conditional statements, it is okay to skip this part and foucs on the creation of figures at this point.

Scatterplot - 2

Then we study the relationship of the customer purchase amount against the customer’s estimated age using 100 customer’s data.

index <- sample(1:nrow(Friday), 50)
# create the scatterplot
plot(Purchase ~ Estimated_age, data=Friday[index,], xlab="Customer's Estimated Age", ylab="Customer Purchase Amount")

Line Plot - 1

When we want to show how a quantitative variable changes over a period of time, a line plot can be used. Line plots can also be used to compare changes over the same period of time for several groups.
In order to graph a line plot, we need to know two additional arguments

- type: “p” to draw only points; “l” to draw only lines; “o” to draw both points and lines
- lty: line types. 0 = blank; 1 = solid; 2 = dashed; 4 = dotdash, 5 = longdash, 6 = twodash

Line Plot - 2

Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 (The Weather Channel).

Date <- 13:22
Dayton_OH <- c(84, 86, 91, 89, 89, 91, 92, 91, 91, 91)
Houston_TX <- c(100, 97, 96, 94, 94, 94, 93, 93, 92, 91)
Denver_CO <- c(95, 85, 89, 96, 97, 96, 92, 91, 95, 96)
Fargo_ND <- c(86, 80, 84, 87, 90, 87, 83, 84, 87, 89)
df <- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)

plot(Date, Dayton_OH, type="o", col="blue", xlab="Date in July", ylab="Highest Temperature", ylim=c(80, 100))
lines(Date, Houston_TX, type="o", col="red")
lines(Date, Denver_CO, type="o", col="purple")
lines(Date, Fargo_ND, type="o", col="darkgreen")

README

You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):

A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller

Xie, Yihui, Joseph J Allaire, and Garrett Grolemund. 2018. R Markdown: The Definitive Guide. CRC Press.

MTH 209 Data Manipulation and Management

Lesson 6: Basic Graphical Displays