Introduction

Column

Motivation

An Analysis of Offensive Baseball Statistics in Team Scoring

The purpose of this study was to analyze the relationship between statistics used to measure offensive performance in baseball and the run production of a team.

Understanding these relationships is an important task for front offices in baseball as it allows them to put the best team out on the field given the resources that they have.

A variety of statistics and factors were considered in this study:

  • Batting Average (BA)

  • On Base Percentage (OBP)

  • Slugging Percentage (SLG)

  • On Base Plus Slugging (OPS)

  • Offensive Wins Above Replacement (OWAR)

  • Market Size

Sources:

Baseball Reference

U.S. Census Bureau, Vintage 2021 Population Estimates

Make Interactive Maps with R

Column

Get to know the data

Below is a glimpse of the data used in this study. This dataset contains 300 observations.

Average Runs per Game by Each Team From 2012-2021

Data Exploration

Column

Explanations of Plots

The scatter plots on the left help paint a picture as far as correlation between our data and the runs per game scored by MLB teams. As expected, there is a fairly strong, positive correlation between each of the offensive statistics and runs per game. Notably, OBP and SLG seem to have the strongest correlation. This is where OPS comes in. Representing the sum of OBP and SLG, it is evident that the sum of these two statistical measures gives an even tighter fit with runs per game.

Notably, however, is the scatter plot for market size. Given that teams in larger markets have more access to resources, we expected that this access to resources would allow, on average, larger market teams to field a better hitting team than a smaller market teams. However, evident from the scatter plot, this is not the case and smaller market teams still seem to be competitive.

Below the scatter plots is a correlogram plot. This plot displays the correlation between explanatory variables. We excluded market size as well as OPS in this plot because market size showed no correlation with runs per game and OPS is a function of OBP and SLG. As expected, all of our explanatory variables have a positive correlation with each other. OBP seems to have very strong correlations with BA and SLG.

Regression Including OWAR

Regression Excluding OWAR

Regression Excluding OWAR and BA

Column

Scatterplots of Average runs against the Predictors

Correlation of the offensive statistics

Diagnostics

Column

Explanation

First, we load the necessary packages and data. Fit a linear model to the data and obtain necessary objects for diagnostics plots.

1. Linearity and Homoscedasticity Assumptions

We check the linearity and homoscedasticity assumption: Observe the plot of Residuals vs. Fitted Values for both regression outputs. The residuals are not too far away from 0. In addition, there is no pattern in the residuals from either plot and overall they are equally spread around the y = 0 line. So this does not imply a relationship between the residuals and our dependent variable, and the equal variance assumption is not violated.

2. Normality Assumption

We check the normality assumption: The QQ-Plot shows the distribution of residuals. The better the residuals fit on the line, the better they follow a normal distribution. Overall, the residuals on these plots fit well with the line of fit so we can assume the residuals are approximately normal.

Column

Linearity

Normality

Cluster Analysis

Column

Dissimilarity Matrix for all 30 teams

Cluster Dendogram for all 30 teams

Column

Cluster Analysis

The correlogram plot is a representation of the degree each team is similar to one another, with the darkest red indicating complete similarity and the darkest blue indicating little similarity.

The Cluster Dendogram shows a hierarchical relationship between all of the teams. By focusing on the height at which any two teams are joined together, you can see how similar one team is to another. One quick example, you can see that the San Diego Padres and Seattle Mariners are very similar.

The cluster map is meant to visualize the clusters of teams. Our goal with the clustering was to see if we could recognize any sort of pattern among the teams that clustered together. In cluster 3, it is notable that the teams with larger markets are clustered together. In cluster 2, we can see teams that have had overall more success on offense than teams in cluster 1. On the next page, there is a map you can explore to see locations of all these teams and their groups.

Cluster map with 3 clusters for all 30 teams

Group Comparison

Column

Fig. 1: Average Runs

Fig. 2: Average Batting Average

Column

Fig. 3: Average On Base Percentage

Fig. 4: Average Slugging Percentage

Column

Fig. 5: Average On Offensive Wins Above Replacement

Fig. 6: Market Size by Group

Interactive Map

An interactive map of groups

About the Author

Column

My Background

My name is Sebastian Meinking and I am an undergraduate student at the University of Dayton. Currently, my projected graduation is May 2023.

Right now, I am working towards completing a B.S. in Applied Mathematical Economics with minors in both Actuarial Science and Computer Science.

I am interested in pursuing full time employment in the actuarial field after graduation.

As far as my experience goes, I have performed programming in R, Python, and Java. During my most recent internship, I worked as a Retirement Actuarial Consulting Intern for WTW from May 2022 - August 2022. During this internship, I learned the basics of pension plan valuations, conducted benefit calculations, and attended client meetings with actuaries in the company.

Feel free to connect with me on LinkedIn here.

Column {data-width=300 fontsize: 20}

Photo

Sebastian Meinking

Sebastian Meinking

---
title: "Baseball Dashboard"
author: "Sebastian Meinking"
output: 
  flexdashboard::flex_dashboard:
    theme: default
    orientation: columns
    vertical_layout: fill
    source_code: embed
---




  
```{r setup, include=FALSE}
# load necessary packages
pacman::p_load(corrplot, DataExplorer, DT, factoextra, flexdashboard, 
               leaflet, pandoc, plotly, tidyverse, plyr)

# read the data set 
all_files <- list.files(pattern = '*.csv')
for (i in 1:length(all_files)) assign(all_files[i], read.csv(all_files[i]))
names(final_data.csv)[9] <- "Market Size in Millions"
```



Introduction
=======================================================================
Column {data-width=450}
-----------------------------------------------------------------------
### Motivation

**An Analysis of Offensive Baseball Statistics in Team Scoring**

The purpose of this study was to analyze the relationship between statistics used to measure offensive performance
in baseball and the run production of a team. 

Understanding these relationships is an important task for front offices in baseball as it allows them to put the 
best team out on the field given the resources that they have.

A variety of statistics and factors were considered in this study:

- Batting Average (BA)

- On Base Percentage (OBP)

- Slugging Percentage (SLG)

- On Base Plus Slugging (OPS)

- Offensive Wins Above Replacement (OWAR)

- Market Size


Sources: 

[Baseball Reference](https://www.baseball-reference.com/)

[U.S. Census Bureau, Vintage 2021 Population Estimates](https://www.census.gov/
library/stories/2021/03/major-league-baseball-is-back.html)

[Make Interactive Maps with R](https://towardsdatascience.com/making-interactive
-maps-in-r-with-less-than-15-lines-of-code-bfd81f587e12)


Column {data-width=550}
-----------------------------------------------------------------------


    


### Get to know the data

Below is a glimpse of the data used in this study. This dataset contains `r nrow(final_data.csv)` observations. 

```{r}
df <- format.data.frame(final_data.csv[,-2], digits = 3)

datatable(df, rownames = FALSE,
              colnames = c("Teams", "BA", "OBP", "OPS", "SLG", "OWAR", "Runs per Game", "Market Size (Millions)"),
              class = "hover",
              options = list(
                columnDefs = list(list(className = 'dt-center', 
                                       targets = 1:5)), pageLength = 5,
                initComplete = JS("function(settings, json) {",
      "$(this.api().table().header()).css({'background-color': '#4582ec', 'color': '#fff'});","}")
))

```



### Average Runs per Game by Each Team From 2012-2021

```{r, fig.align='center'}
#create histogram of run data
p1 <- ggplot(Cluster_Data.csv, aes(x = Avg_Runs)) +
  geom_histogram(fill = "#30ADE7", color = "white", bins = 6) +
  labs(title = "Histogram of Runs per Game",
       x = "Runs per Game") +
  theme(text = element_text(size = 14)) +
  theme_minimal()

ggplotly(p1)

```


Data Exploration
=======================================================================
Column {.tabset data-width=500}
-----------------------------------------------------------------------

### Explanations of Plots

The scatter plots on the left help paint a picture as far as correlation between
our data and the runs per game scored by MLB teams. As expected, there is a fairly
strong, positive correlation between each of the offensive statistics and runs 
per game. Notably, OBP and SLG seem to have the strongest correlation. This is
where OPS comes in. Representing the sum of OBP and SLG, it is evident that the
sum of these two statistical measures gives an even tighter fit with runs per 
game. 

Notably, however, is the scatter plot for market size. Given that teams in 
larger markets have more access to resources, we expected that this access to 
resources would allow, on average, larger market teams to field a better hitting
team than a smaller market teams. However, evident from the scatter plot, this
is not the case and smaller market teams still seem to be competitive. 

Below the scatter plots is a correlogram plot. This plot displays the correlation
between explanatory variables. We excluded market size as well as OPS in this 
plot because market size showed no correlation with runs per game and OPS is 
a function of OBP and SLG. As expected, all of our explanatory variables have a 
positive correlation with each other. OBP seems to have very strong correlations
with BA and SLG. 

### Regression Including OWAR

```{r, fig.align = 'center', out.width='50%'}
   knitr::include_graphics('Regression_OWAR.png')
```

### Regression Excluding OWAR

```{r, fig.align = 'center', out.width='50%'}
   knitr::include_graphics('Regression_no_OWAR.png')
```

### Regression Excluding OWAR and BA

```{r, fig.align = 'center', out.width='50%'}
   knitr::include_graphics('Regression_OPS.png')
```



Column {data-width=500}
-----------------------------------------------------------------------

### Scatterplots of Average runs against the Predictors 

```{r, fig.align='center', out.width="80%"}
data_list <- final_data.csv[,c(3,4,5,6,7,8,9)]
plot_scatterplot(data_list, 
                 by = "Runs_per_Game", 
                 sampled_rows = 300, 
                 title = "Runs per Game Compared to Sample Data",
                 geom_point_args = list(color ="#30ADE7"), 
                 theme_config = list(title = element_text(size = 13)), ncol = 2L)
```

### Correlation of the offensive statistics

```{r fig.align='center', out.width="80%"}
res <- cor(data_list[c(2:4,6)])
corrplot(res, type = "upper", order = "hclust",
         tl.col = "black", tl.srt = 45)
```



Diagnostics
=======================================================================
Column {data-width=350}
-----------------------------------------------------------------------

### Explanation

 
First, we load the necessary packages and data. Fit a linear model to the data and obtain necessary objects for diagnostics plots. 

**1. Linearity and Homoscedasticity Assumptions**

We check the linearity and homoscedasticity assumption: Observe the plot of Residuals vs. Fitted Values for both regression outputs.
The residuals are not too far away from 0. In addition, there is no pattern in the residuals from either plot and overall they are equally spread around the y = 0 line. So this does not imply a relationship between the residuals and our dependent variable, and the equal variance assumption is not violated. 


**2. Normality Assumption**

We check the normality assumption: The QQ-Plot shows the distribution of residuals. The better the residuals fit on
the line, the better they follow a normal distribution. Overall, the residuals on these plots fit well with the line 
of fit so we can assume the residuals are approximately normal. 


Column {.tabset data-width=650}
-----------------------------------------------------------------------

```{r}
fit <- lm(Runs_per_Game ~ BA + OBP + SLG + OWAR, data=final_data.csv)
fit_v2 <- lm(Runs_per_Game ~ BA + OBP + SLG, data=final_data.csv)
fit_v3 <- lm(Runs_per_Game ~ OBP + SLG, data = final_data.csv)

#obtain values needed in order to get diagnostics plots
# Extract fitted values
Fitted.Values <- fit$fitted.values
Fitted.Values_v2 <- fit_v2$fitted.values
fitted.values_v3 <- fit_v3$fitted.values

# Extract residuals
Residuals <- fit$residuals
Residuals_v2 <- fit_v2$residuals
Residuals_v3 <- fit_v2$residuals

# Calculate standardized residuals 
Standardized.Residuals <- scale(fit$residuals)


# Extract fitted values for lm() object
Theoretical.Quantiles <- qqnorm(Residuals, plot.it = F)$x

# find Square root of abs(residuals)
Root.Residuals <- sqrt(abs(Standardized.Residuals))

# Calculate Leverage
Leverage <- lm.influence(fit)$hat

# Create data frame 
# Will be used as input to plot_ly

diagnostics <- data.frame(Fitted.Values, 
                     Residuals, 
                     Standardized.Residuals, 
                     Theoretical.Quantiles,
                     Root.Residuals,
                     Leverage)


```

### Linearity

```{r, fig.align='center', out.width="50%"}
# Fitted vs Residuals
plot(fitted(fit), Residuals, xlab = "Fitted Values", ylab = "Residuals",
     main = "OWAR Included", col = "#30ADE7")
abline(0,0)
plot(fitted(fit_v2), Residuals_v2, xlab = "Fitted Values", ylab = "Residuals",
     main = "OWAR not Included", col = "#30ADE7")
abline(0,0)
plot(fitted(fit_v3), Residuals_v2, xlab = "Fitted Values", ylab = "Residuals",
     main = "OWAR and BA not Included", col = "#30ADE7")
abline(0,0)
```



### Normality

```{r, fig.align='center', out.width="50%"}
# QQ Plot
qqnorm(Residuals, main = "OWAR Included", col = "#30ADE7")
qqline(Residuals)
qqnorm(Residuals_v2, main = "OWAR not Included", col = "#30ADE7")
qqline(Residuals_v2)
qqnorm(Residuals_v3, main = "OWAR not Included", col = "#30ADE7")
qqline(Residuals_v3)

```


Cluster Analysis 
=======================================================================

Column
-----------------------------------------------------------------------

### Dissimilarity Matrix for all 30 teams

```{r}
Cluster_no_runs_df.csv <- Cluster_no_runs.csv[,-1]
Cluster_no_runs_df.csv <- Cluster_no_runs_df.csv[,-1]
row.names(Cluster_no_runs_df.csv) <- Cluster_Data.csv[,1] 
Cluster_no_runs_df.csv <- scale(Cluster_no_runs_df.csv[,-6])

# Compute dissimilarity matrix 
res.dist <- dist(Cluster_no_runs_df.csv, method = "euclidean")
# Compute hierarchical clustering
res.hc <- hclust(res.dist, method = "ward.D2")
# Visualize the dissimilarity matrix
fviz_dist(res.dist, lab_size = 8.5)

```

### Cluster Dendogram for all 30 teams

```{r}
#Cluster dendogram
plot(res.hc, cex = 0.8)
```

Column
-----------------------------------------------------------------------

### Cluster Analysis

The correlogram plot is a representation of the degree each team is similar to 
one another, with the darkest red indicating complete similarity and the darkest
blue indicating little similarity. 


The Cluster Dendogram shows a hierarchical relationship between all of the teams.
By focusing on the height at which any two teams are joined together, you can 
see how similar one team is to another. One quick example, you can see that the 
San Diego Padres and Seattle Mariners are very similar. 


The cluster map is meant to visualize the clusters of teams. Our goal with the
clustering was to see if we could recognize any sort of pattern among the teams
that clustered together. In cluster 3, it is notable that the teams with larger 
markets are clustered together. In cluster 2, we can see teams that have had 
overall more success on offense than teams in cluster 1. On the next page, there
is a map you can explore to see locations of all these teams and their groups. 

### Cluster map with 3 clusters for all 30 teams

```{r}
# Enhanced k-means clustering for dataset with no runs
res.km <- eclust(Cluster_no_runs_df.csv, "kmeans", k = 3, nstart = 25)
```

Group Comparison
===

Column {data-width=300}
---

### Fig. 1: Average Runs

```{r}
p1 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Avg_Runs)) +
  geom_boxplot(fill = "#30ADE7")  +
  labs(y = "Average Runs") +
  theme(text = element_text(size = 16))

ggplotly(p1)
```

### Fig. 2: Average Batting Average

```{r}
p2 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Avg_BA)) +
  geom_boxplot(fill = "#30ADE7") + 
  labs(y = "Average BA") + 
  theme(text = element_text(size = 16))

ggplotly(p2)
```

Column {data-width=300}
---

### Fig. 3: Average On Base Percentage

```{r}
p3 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Avg_OBP)) +   
  geom_boxplot(fill= "#30ADE7")  +
  labs(y = "Average OBP") +
  theme(text = element_text(size = 16))
ggplotly(p3)
```

### Fig. 4: Average Slugging Percentage

```{r}
p4 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Avg_SLG)) +   
  geom_boxplot(fill= "#30ADE7")  +
  labs(y = "Average SLG") +
  theme(text = element_text(size = 16))
ggplotly(p4)
```

Column {data-width=300}
---

### Fig. 5: Average On Offensive Wins Above Replacement

```{r}
p5 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Avg_OWAR)) +   
  geom_boxplot(fill= "#30ADE7")  +
  labs(y = "Avererage OWAR") +
  theme(text = element_text(size = 16))
ggplotly(p5)
```

### Fig. 6: Market Size by Group

```{r}
p6 <- ggplot(Cluster_no_runs.csv, aes(x=Group, y=Market_Size)) +   
  geom_boxplot(fill= "#30ADE7")  +
  labs(y = "Market Size (Millions)") +
  theme(text = element_text(size = 16))
ggplotly(p6)
```

Interactive Map
=======================================================================

### An interactive map of groups 

```{r}
Cluster_Data.csv <- Cluster_Data.csv %>% 
  mutate(teamLabs = paste0('Team: ',Teams, '
Average Runs: ', round(Avg_Runs,2), '
Market Size: ', Market_Size, " (millions)")) icons <- makeAwesomeIcon( icon = 'circle', library = 'ion', markerColor = ifelse(Cluster_no_runs.csv$Group == "Group1", '#CC0000', ifelse(Cluster_no_runs.csv$Group == "Group2", "blue", "purple")) ) makeColorsandNames <- data.frame(groups = c('Group 1','Group 2','Group 3'), groups.cent = c('#CC0000',"#30ADE7",'#D251B9')) # adjust the longitude and latitude slightly of some teams Cluster_Data.csv$longitude[c(17, 19, 21, 27)] <- c(-117.9143, -73.8648, -87.8412, -73.7949) Cluster_Data.csv$latitude[c(17, 19, 21, 27)] <- c(33.8366, 40.8448, 42.3256, 40.7282) Cluster_no_runs.csv$latitude <- Cluster_Data.csv$latitude Cluster_no_runs.csv$longitude <- Cluster_Data.csv$longitude Cluster_no_runs.csv$teamLabs <- Cluster_Data.csv$teamLabs leaflet(data = Cluster_no_runs.csv) %>% setView(lng = -98.5795, lat = 39.8283, zoom = 4.5) %>% addTiles() %>% addAwesomeMarkers(~longitude, ~latitude, icon = icons, popup = ~teamLabs) %>% addLegend(position = 'bottomleft', colors = makeColorsandNames[,2],labels = makeColorsandNames[,1],opacity = 1,title = 'Groups') ``` About the Author ============================================================================= Column {data-width=700} ----------------------------------------------------------------------------- ### My Background My name is Sebastian Meinking and I am an undergraduate student at the University of Dayton. Currently, my projected graduation is May 2023. Right now, I am working towards completing a B.S. in Applied Mathematical Economics with minors in both Actuarial Science and Computer Science. I am interested in pursuing full time employment in the actuarial field after graduation. As far as my experience goes, I have performed programming in R, Python, and Java. During my most recent internship, I worked as a Retirement Actuarial Consulting Intern for WTW from May 2022 - August 2022. During this internship, I learned the basics of pension plan valuations, conducted benefit calculations, and attended client meetings with actuaries in the company. Feel free to connect with me on LinkedIn [here](https://www.linkedin.com/in/sebastianmeinking/). Column {data-width=300 fontsize: 20} ----------------------------------------------------------------------- ### Photo ```{r , fig.width=6, echo=FALSE, fig.cap="Sebastian Meinking", fig.align='right'} knitr::include_graphics("Headshot-resize.jpg") ```