MTH 208 Exploratory Data Analysis

class: center, middle, inverse, title-slide

.title[
# MTH 208 Exploratory Data Analysis
]
.subtitle[
## Lesson 03: Descriptive Statistics & Data Summarization
]
.author[
### <br>Ying-Ju Tessa Chen, PhD <br><br> Associate Professor <br> Department of Mathematics<br> University of Dayton <br><br> <a href="https://twitter.com/ju_tessa"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M459.37 151.716c.325 4.548.325 9.097.325 13.645 0 138.72-105.583 298.558-298.558 298.558-59.452 0-114.68-17.219-161.137-47.106 8.447.974 16.568 1.299 25.34 1.299 49.055 0 94.213-16.568 130.274-44.832-46.132-.975-84.792-31.188-98.112-72.772 6.498.974 12.995 1.624 19.818 1.624 9.421 0 18.843-1.3 27.614-3.573-48.081-9.747-84.143-51.98-84.143-102.985v-1.299c13.969 7.797 30.214 12.67 47.431 13.319-28.264-18.843-46.781-51.005-46.781-87.391 0-19.492 5.197-37.36 14.294-52.954 51.655 63.675 129.3 105.258 216.365 109.807-1.624-7.797-2.599-15.918-2.599-24.04 0-57.828 46.782-104.934 104.934-104.934 30.213 0 57.502 12.67 76.67 33.137 23.715-4.548 46.456-13.32 66.599-25.34-7.798 24.366-24.366 44.833-46.132 57.827 21.117-2.273 41.584-8.122 60.426-16.243-14.292 20.791-32.161 39.308-52.628 54.253z"></path></svg> <span class="citation">@ying-ju</span></a> <br> <a href="https://github.com/ying-ju/"><svg viewBox="0 0 496 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M165.9 397.4c0 2-2.3 3.6-5.2 3.6-3.3.3-5.6-1.3-5.6-3.6 0-2 2.3-3.6 5.2-3.6 3-.3 5.6 1.3 5.6 3.6zm-31.1-4.5c-.7 2 1.3 4.3 4.3 4.9 2.6 1 5.6 0 6.2-2s-1.3-4.3-4.3-5.2c-2.6-.7-5.5.3-6.2 2.3zm44.2-1.7c-2.9.7-4.9 2.6-4.6 4.9.3 2 2.9 3.3 5.9 2.6 2.9-.7 4.9-2.6 4.6-4.6-.3-1.9-3-3.2-5.9-2.9zM244.8 8C106.1 8 0 113.3 0 252c0 110.9 69.8 205.8 169.5 239.2 12.8 2.3 17.3-5.6 17.3-12.1 0-6.2-.3-40.4-.3-61.4 0 0-70 15-84.7-29.8 0 0-11.4-29.1-27.8-36.6 0 0-22.9-15.7 1.6-15.4 0 0 24.9 2 38.6 25.8 21.9 38.6 58.6 27.5 72.9 20.9 2.3-16 8.8-27.1 16-33.7-55.9-6.2-112.3-14.3-112.3-110.5 0-27.5 7.6-41.3 23.6-58.9-2.6-6.5-11.1-33.3 2.6-67.9 20.9-6.5 69 27 69 27 20-5.6 41.5-8.5 62.8-8.5s42.8 2.9 62.8 8.5c0 0 48.1-33.6 69-27 13.7 34.7 5.2 61.4 2.6 67.9 16 17.7 25.8 31.5 25.8 58.9 0 96.5-58.9 104.2-114.8 110.5 9.2 7.9 17 22.9 17 46.4 0 33.7-.3 75.4-.3 83.6 0 6.5 4.6 14.4 17.3 12.1C428.2 457.8 496 362.9 496 252 496 113.3 383.5 8 244.8 8zM97.2 352.9c-1.3 1-1 3.3.7 5.2 1.6 1.6 3.9 2.3 5.2 1 1.3-1 1-3.3-.7-5.2-1.6-1.6-3.9-2.3-5.2-1zm-10.8-8.1c-.7 1.3.3 2.9 2.3 3.9 1.6 1 3.6.7 4.3-.7.7-1.3-.3-2.9-2.3-3.9-2-.6-3.6-.3-4.3.7zm32.4 35.6c-1.6 1.3-1 4.3 1.3 6.2 2.3 2.3 5.2 2.6 6.5 1 1.3-1.3.7-4.3-1.3-6.2-2.2-2.3-5.2-2.6-6.5-1zm-11.4-14.7c-1.6 1-1.6 3.6 0 5.9 1.6 2.3 4.3 3.3 5.6 2.3 1.6-1.3 1.6-3.9 0-6.2-1.4-2.3-4-3.3-5.6-2z"></path></svg> ying-ju</a> <br> <a href="mailto:ychen4@udayton.edu"><svg viewBox="0 0 512 512" style="height:1em;position:relative;display:inline-block;top:.1em;fill:white;" xmlns="http://www.w3.org/2000/svg"> <path d="M476 3.2L12.5 270.6c-18.1 10.4-15.8 35.6 2.2 43.2L121 358.4l287.3-253.2c5.5-4.9 13.3 2.6 8.6 8.3L176 407v80.5c0 23.6 28.5 32.9 42.5 15.8L282 426l124.6 52.2c14.2 6 30.4-2.9 33-18.2l72-432C515 7.8 493.3-6.8 476 3.2z"></path></svg> ychen4@udayton.edu</a><br>
]

---

## Learning Objectives

-	Measures of Central Tendency: mean, median, mode
-	Measures of Spread: range, variance, standard deviation, IQR
-	Non-Parametric Statistics and Their Significance
-	Skewness and Kurtosis
-	Measures of Relationship: correlation and covariance
-	Interpreting These Statistics in EDA

---
## Measures of Central Tendency

Central tendency measures are used to identify the center of a data set or its typical value. These measures include the mean, median, and mode, each providing a different perspective on the central value of the data.

**Mean (Arithmetic Average)**
   - `Definition`: The mean is the sum of all values in a dataset divided by the number of values.
   - `Calculation`: Mean = (Sum of all values) / (Number of values)
   - `Usage`: Appropriate for interval and ratio data, and when the data does not have extreme outliers.
   - `Example`: The average height of a group of people.

---
### Measures of Central Tendency (Continued)
 
**Median (Middle Value)**
   - `Definition`: The median is the middle value in a dataset when it is ordered from smallest to largest. For an even number of observations, it is the average of the two middle numbers.
   - `Calculation`: Arrange data in ascending order and identify the middle value.
   - `Usage`: Useful for ordinal data or when the dataset contains outliers or is skewed, as it is not affected by extreme values.
   - `Example`: The middle income in a list of incomes for a region.

**Mode (Most Frequent Value)**
   - `Definition`: The mode is the value that appears most frequently in a dataset.
   - `Usage`: It can be used for any level of measurement (nominal, ordinal, interval, ratio), and is particularly useful for categorical data.
   - `Example`: The most common eye color in a sample of people.

---
### Measures of Central Tendency (Continued)

**Interpreting Central Tendency in EDA**

- `Insights`: These measures help in understanding the general trend or typical value of the data.

- `Contextual Use`: Depending on the nature of the data and its distribution, one measure may be more appropriate than the others.

- `Combination with Other Measures`: Often used alongside measures of spread (like standard deviation) to provide a more complete picture of the data.

---
## Measures of spread

Measures of spread provide insights into the variability or dispersion within a dataset. They help to understand how much individual data points differ from the central tendency. Key measures include the range, variance, and standard deviation.

**Range**
  - `Definition`: The range is the difference between the highest and lowest values in a dataset.
  - `Calculation`: Range = Maximum value - Minimum value
  - `Usage`: Simplest measure of spread; however, it is sensitive to outliers.
    - `Example`: In a dataset of temperatures over a week, the range is the difference between the highest and lowest recorded temperatures.
    
**Interquartile Range (IQR)**
  - `Definition`: The IQR is the difference between the 75th percentile (upper quartile) and the 25th percentile (lower quartile) in a dataset. (IQR = Q3 - Q1)
  - `Usage`: Unlike range, the IQR is not affected by outliers. It is often used in conjunction with box plots to identify outliers. Data points that fall below Q1 - 1.5IQR or above Q3 + 1.5IQR are typically considered outliers. The IQR is useful for comparing the spread of different datasets. 
    
---
### Measures of spread (Continued)

**Variance**
  - `Definition`: Variance measures the average squared deviation of each number from the mean of the dataset. It gives an idea of how widely the data are spread.
  - `Calculation`: The "average" of the squared differences from the Mean. 
  
  `$$\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2$$`
    where `$X_1, X_2, \ldots, X_n$` are individual observations  and `$\bar{X}$` is the sample mean.
  - `Usage`: More comprehensive than range; used for interval and ratio data. Higher variance indicates greater spread in the data.
  - `Example`: Variance in the test scores of a class.

---
### Measures of spread (Continued)

**Standard Deviation**
  - `Definition`: Standard deviation is the square root of the variance. It is a measure of the amount of variation or dispersion in a set of values.
  - `Calculation`: Square root of the variance.
  - `Usage`: Widely used because it is in the same unit as the data, making it more interpretable.
  - `Example`: Standard deviation in heights within a population.

**Interpreting Spread in EDA**

- `Contextual Importance`: Helps in understanding the reliability of the mean. A small spread indicates that the data points tend to be close to the mean, while a large spread indicates more variability.
  - `Skewness and Outliers`: These measures can indicate if the data is skewed or if there are outliers affecting the data's spread.
  - `Comparative Analysis`: Often used in conjunction with central tendency measures for comprehensive data analysis.

---
### Case Study I

The histogram below shows the City Median Rental Price for a one bedroom home on [Zillow](https://www.kaggle.com/datasets/paultimothymooney/zillow-house-price-data?select=City_MedianRentalPrice_1Bedroom.csv) in December, 2019. Comment on the distribution of the rental price and discuss the appropriate measures in terms of central tendency and spread.

---
### Case Study II

The following histograms show the distribution of IQ score and cumulative grade point average from 100 college students respectively. (Source: [College Placement Dataset](https://www.kaggle.com/datasets/sameerprogrammer/college-placement)) Comment on the distribution of each histogram and discuss the appropriate measures in terms of central tendency and spread.

<br>

.pull-left[

]
.pull-right[

]

---
## Non-parametric Statistics and Their Significance

Non-parametric statistics are a key area of statistics used for analyzing data that does not assume a specific distribution (like normal distribution). These methods are especially useful when dealing with non-normal datasets or when the data violate the assumptions required for parametric tests.

**Key Concepts of Non-Parametric Statistics**

- `Distribution-Free`: Non-parametric methods do not require the data to follow any specific distribution.
  - `Types of Data`: Particularly useful for ordinal data or data on a nominal scale. Also applicable to interval or ratio data, especially when it's not normally distributed.
  - `Applications`: Commonly used in situations with small sample sizes, heavily skewed data, or data with outliers.

---
### Non-parametric Statistics and Their Significance (Continued)

**Examples of Non-Parametric Methods**
  - `Mann-Whitney U Test`: Used to compare differences between two independent groups when the dependent variable is ordinal or continuous but not normally distributed.
  - `Kruskal-Wallis Test`: An extension of the Mann-Whitney U Test for comparing more than two groups.
  - `Spearman's Rank Correlation Coefficient`: Used to measure the strength and direction of association between two ranked variables.

---
### Non-parametric Statistics and Their Significance (Continued)

**Significance in EDA**

- `Flexibility`: Offers a robust alternative to parametric methods, especially useful in exploratory data analysis where data may not meet parametric assumptions.
  - `Handling Skewed Data`: Ideal for analyzing skewed datasets or datasets with outliers where mean and standard deviation might not be appropriate.
  - `Insights into Data Structure`: Helps in understanding the underlying structure of the data, which might not be apparent with parametric methods.

---
## Skewness and Kurtosis

**Introduction to Skewness**

- `Definition`: Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. It indicates whether the observations in a dataset are concentrated on one side.
  - `Types`:
      - `Positive Skew`: The tail on the right side of the distribution is longer or fatter than the left side.
      - `Negative Skew`: The tail on the left side is longer or fatter than the right side.
  - `Interpretation`: 
      - Skewness close to 0 indicates a symmetrical distribution.
      - A significantly positive or negative value indicates skewness and potential outliers.

---
## Skewness and Kurtosis (Continued)

**Introduction to Kurtosis**

- `Definition`: Kurtosis is a measure of the "tailedness" of the probability distribution of a real-valued random variable. It describes the peakedness or flatness of the distribution compared to a normal distribution.
  - `Types`:
      - `High Kurtosis (>3)`: Indicates a distribution with heavy tails and a sharper peak ("Leptokurtic").
      - `Low Kurtosis (<3)`: Suggests a distribution with light tails and a flatter peak ("Platykurtic").
      
  - `Interpretation`:
      - Kurtosis close to 3 (normal distribution) is considered mesokurtic.
      - Extreme values suggest potential outliers and deviations from the normal distribution.

---
## Skewness and Kurtosis (Continued)

**Skewness and Kurtosis in EDA**

- `Purpose`: Understanding skewness and kurtosis is crucial in EDA to identify the nature of the distribution of the data, which can influence the choice of statistical methods and interpretations.
  - `Data Transformation`: Data with high skewness or extreme kurtosis might require transformation to meet the assumptions of various statistical modeling techniques.

---
## Measures of Relationship

Understanding the relationships between variables is a critical aspect of EDA. Two key statistical measures used to assess these relationships are correlation and covariance.

**Covariance**
  - `Definition`: Covariance is a measure that indicates the extent to which two variables change together. It assesses whether increases in one variable correspond to increases (positive covariance) or decreases (negative covariance) in another.

- `Calculation`: The average of the products of deviations of pairs of observations from their individual means. The covariance between two variables `$X$` and `$Y$` is given by:
$$Cov(X,Y)=\frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})(Y-\bar{Y}) $$
    where:
      - `$X_i$` and `$Y_i$`are the individual values of the variables,
      - `$\bar{X}$` and `$\bar{Y}$` are the means of the variables,
      - `$n$` is the number of data points.

---
### Measures of relationship (Continue)

**Covariance**
  - `Interpretation`:
      - `Positive Covariance`: Indicates that as one variable increases, the other tends to increase.
      - `Negative Covariance`: Suggests that as one variable increases, the other tends to decrease.
      - `Zero or Near-Zero Covariance`: Implies no linear relationship between the variables.

---
### Measures of Relationship (Continued)

**Correlation**
  - `Definition`: Correlation is a standardized measure of covariance and describes both the strength and direction of the linear relationship between two variables.
  - `Types`:
      - `Pearson's Correlation Coefficient`: Measures linear relationship between two interval or ratio variables. The Pearson correlation coefficient between two variables `$X$` and `$Y$` is given by 
      `$$r = \frac{\sum_{i=1}^n (X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_{i=1}^n (X_i-\bar{X})^2}\sqrt{\sum_{i=1}^n (Y_i-\bar{Y})^2}}.$$`
      - `Spearman’s Rank Correlation`: Used for ordinal variables or when the relationship is not linear.
  - `Interpretation`:
      - Values range from -1 to +1.
      - +1: Perfect positive linear relationship.
      - -1: Perfect negative linear relationship.
      - 0: No linear relationship.

---
### Measures of Relationship (Continued)

**Correlation vs. Covariance**

- Covariance provides a directional relationship but not the strength.
  - Correlation is a more standardized and interpretable measure, providing both direction and strength of the relationship.

**Significance in EDA**
  - Understanding these measures helps in identifying potential relationships between variables, which can guide further analysis and modeling.
  - They are used to explore data, test hypotheses, and in feature selection for machine learning models.

---
## Interpreting These Statistics in EDA

**Overview**
  The interpretation of statistical measures is a critical component of EDA. This process involves understanding what various statistics reveal about a dataset and how this information can inform decision-making, hypothesis testing, and further analysis.

**Key Aspects of Interpretation**
  - `Contextual Understanding`:
      Understanding data within the context of the subject area is crucial. Interpretations should align with the domain knowledge and objectives of the study.

- `Integrative Analysis`:
      - Combine various statistical measures (central tendency, spread, correlation, etc.) to gain a comprehensive understanding of the data.
      - Look for patterns, trends, and anomalies across different measures.

---
### Interpreting These Statistics in EDA (Continued)

**Key Aspects of Interpretation**
  - `Correlation vs. Causation`:
      - Distinguish between correlation (two variables moving together) and causation (one variable influencing another).
Be cautious about drawing conclusions of causality solely from correlational data.

- `Influence of Skewness and Outliers`:
      - Understand how skewness and outliers impact measures like mean and variance and adjust interpretations accordingly.
      - Use appropriate statistical methods to handle skewed or outlier-heavy data.
      
  - `Role of Non-Parametric Statistics`:
      - Recognize situations where non-parametric methods provide more reliable insights, especially when data do not meet parametric assumptions.
      
---
### Interpreting These Statistics in EDA (Continued)
      
**Practical Application in EDA**
  - `Exploratory vs. Confirmatory`: EDA is exploratory, aimed at uncovering insights and forming hypotheses, not confirming them.
  - `Visual Representation`: Use graphs and plots alongside numerical measures for a more intuitive understanding of data.
  - `Data-Driven Insights`: Use statistical interpretations to guide decisions on further data processing, feature selection, and potential areas for in-depth analysis.

---
## References

The lectures of this course are based on the ideas from the following references.

- Exploratory Data Analysis by John W. Tukey
- A Course in Exploratory Data Analysis by Jim Albert
- The Visual Display of Quantitative Information by Edward R. Tufte
- Data Science for Business: what you need to know about data mining and data-analytic thinking by Foster Provost and Tom Fawcett
- Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knaflic