In this session, we will use the Black Friday Data available in Kaggle to study how to make the following graphical displays.
library(tidyverse)
Friday <- read_csv("C:/Users/tessa/Dropbox/MTH 209/Class Handouts/Data/Black_Friday.csv")Here is a list of common arguments:
In order to understand the customer purchases behavior against various products of different categories, the retail company “ABC Private Limited”, in United Kingdom, shared purchase summary of various customers for selected high volume products from last month. The data contains the following variables.
We can use the glimpse() function to get a glimpse of the data.
## Rows: 550,068
## Columns: 12
## $ User_ID <dbl> 1000001, 1000001, 1000001, 1000001, 1000002…
## $ Product_ID <chr> "P00069042", "P00248942", "P00087842", "P00…
## $ Gender <chr> "F", "F", "F", "F", "M", "M", "M", "M", "M"…
## $ Age <chr> "0-17", "0-17", "0-17", "0-17", "55+", "26-…
## $ Occupation <dbl> 10, 10, 10, 10, 16, 15, 7, 7, 7, 20, 20, 20…
## $ City_Category <chr> "A", "A", "A", "A", "C", "A", "B", "B", "B"…
## $ Stay_In_Current_City_Years <chr> "2", "2", "2", "2", "4+", "3", "2", "2", "2…
## $ Marital_Status <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0…
## $ Product_Category_1 <dbl> 3, 1, 12, 12, 8, 1, 1, 1, 1, 8, 5, 8, 8, 1,…
## $ Product_Category_2 <dbl> NA, 6, NA, 14, NA, 2, 8, 15, 16, NA, 11, NA…
## $ Product_Category_3 <dbl> NA, 14, NA, NA, NA, NA, 17, NA, NA, NA, NA,…
## $ Purchase <dbl> 8370, 15200, 1422, 1057, 7969, 15227, 19215…
Bar chart is a graphical display good for the general audience. Here, we study the distribution of Age Group of the company’s customers who purchased their products on Black Friday.
Usage: barplot(height, …)
We can have the horizontal bars. Using the argument col, we can assign a color for bars. The argument main could be used to change the title of the figure.
par(mgp=c(4,1,0)) # change the margin line for the axis title, axis labels and axis line
par(mar=c(5,7,4,2)) # set margin of the figure
barplot(table(Friday$Age), col = "lightblue", main = "Distribution of Customers' Age Group", horiz = TRUE, xlab = "Number of Customers",
ylab = "Age Group", las=1)Note: The margin of a figure could be set using the par() function. The order of the setting is c(bottom, left, top, right).
We can use RGB color code to assign colors.
barplot(table(Friday$Age), col = "#69b3a2", main = "Distribution of Customers' Age Group", xlab = "Age Group", ylab ="Number of Customers")Similarly, we can use pie chart to study the distribution of the city category.
Usage: pie(height, …)
library(tidyverse)
pie(table(Friday$City_Category), main = "Distribution of City Category", col = c("#264e70", "#679186", "#bbd4ce"))The following code chunk shows an advanced setting.
H <- table(Friday$City_Category)
percent <- round(100*H/sum(H), 1) # calculate percentages
pie_labels <- paste(percent, "%", sep="") # include %
pie(H, main = "Distribution of City Category", labels = pie_labels, col = c("#54d2d2", "#ffcb00", "#f8aa4b"))
legend("topright", c("A","B","C"), cex = 0.8, fill = c("#54d2d2", "#ffcb00", "#f8aa4b"))
Tip: Use color palette to choose colors (Google search:
color scheme generator).
Histogram is used when we want to study the distribution of a quantitative variable. Here we study the distribution of customer purchase amount.
Usage: hist(x, …)
hist(Friday$Purchase, main = "Distribution of Customer Purchase Amount", xlab = "Purchase Amount (British Pounds)")We can find that the distribution of histogram is multimodal.
Here, we talk about another graphical display that can be used to study the distribution of a quantitative variable: box and whisker plot (boxplot).
Usage: boxplot(x, …) or boxplot(formula, …)
In general, a boxplot is used When we want to compare the distributions of several quantitative variables. In the following we study the distribution of customer purchase amount among different age groups.
We can improve the quality of the boxplot by changing a few settings in the default settings.
boxplot(Friday$Purchase ~ Friday$Age, main = "Distribution of Purchase by Age Group",
xlab = "Customer Age Group", ylab = "Purchase Amount", cex.lab = 1.25, cex.axis = 1.25, col = "#54d2d2")We can use the argument data to indicate that variables used are from a given data.
boxplot(Purchase ~ Gender + Marital_Status, data = Friday, main="Distribution of Purchase by Sex and Marital_Status",
xlab="Sex and Marital Status", ylab="Purchase", cex.lab=1.25, cex.axis=1.25,
names = c("Female & Single", "Male & Single", "Female & Married", "Male & Married"))When we want to study the relationship of two quantitative variables, a scatterplot can be used. Since this data set doesn’t have another quantitative variable, for the demonstration purpose we create a variable Estimated_age by generating a random age for each customer based on their age group.
# create a function to generate a random integer in the age group
random_age <- function(age_group){
if (str_detect(age_group, "\\+") == F){
age_range <- as.numeric(unlist(strsplit(age_group, "-")))
age <- sample(age_range[1]:age_range[2], 1)
}else{
age_left <- as.numeric(gsub("\\+", "", age_group))
age <- sample(age_left:100, 1)
}
return(age)
}
# create a new column in the data
Friday$Estimated_age <- rep(NA, nrow(Friday))
# apply the function to each customer's age group
for (i in 1:nrow(Friday)){
Friday$Estimated_age[i] <- random_age(Friday$Age[i])
}Note: Since we haven’t talked about how to create functions and the use of loops and conditional statements, it is okay to skip this part and foucs on the creation of figures at this point.
Then we study the relationship of the customer purchase amount against the customer’s estimated age using 100 customer’s data.
index <- sample(1:nrow(Friday), 50)
# create the scatterplot
plot(Purchase ~ Estimated_age, data=Friday[index,], xlab="Customer's Estimated Age", ylab="Customer Purchase Amount")When we want to show how a quantitative variable changes over a
period of time, a line plot can be used. Line plots can also be used to
compare changes over the same period of time for several groups.
In order to graph a line plot, we need to know two additional
arguments
Since the Black Friday Data are not time series data, it is not appropriate to use a line plot. In the following code chunk, we create a data frame using the forecasted highest temperatures from July 13 to July 22 (The Weather Channel).
Date <- 13:22
Dayton_OH <- c(84, 86, 91, 89, 89, 91, 92, 91, 91, 91)
Houston_TX <- c(100, 97, 96, 94, 94, 94, 93, 93, 92, 91)
Denver_CO <- c(95, 85, 89, 96, 97, 96, 92, 91, 95, 96)
Fargo_ND <- c(86, 80, 84, 87, 90, 87, 83, 84, 87, 89)
df <- data.frame(Date, Dayton_OH, Houston_TX, Denver_CO, Fargo_ND)
plot(Date, Dayton_OH, type="o", col="blue", xlab="Date in July", ylab="Highest Temperature", ylim=c(80, 100))
lines(Date, Houston_TX, type="o", col="red")
lines(Date, Denver_CO, type="o", col="purple")
lines(Date, Fargo_ND, type="o", col="darkgreen")You can utilize the following single character keyboard shortcuts to enable alternate display modes (Xie, Allaire, and Grolemund (2018)):
A: Switches show of current versus all slides (helpful for printing all pages)
B: Make fonts large
c: Show table of contents
S: Make fonts smaller