class: center, middle, inverse, title-slide # Math Club Meeting: Dating with Data ## Exploratory Data Analysis - Job Analysis ###
Ying-Ju Tessa Chen, PhD
Ying-Ju
ychen4@udayton.edu
###
October 5, 2021 --- # Ad: Undergraduate Mathematics Day - Date: **Saturday, November 6, 2021** - An undergraduate mathematics conference - Contributed **15-minute talks**, primarily by undergraduate students, on mathematics research, the learning and teaching of mathematics, the history of mathematics, and applications to disciplines related to mathematics - Talks are delivered **face to face or virtually** - Two invited addresses * Suzanne Lenhart, University of Tennessee (One Health: Connecting Humans, Animals and the Environment) * Jennifer White, Saint Vincent College (The Crossings of Art, History, and Mathematics) - Submit articles (based on talks presented) for publication in refereed online Conference Proceedings - No registration fee, complimentary lunch * Registration and information at [UndergradMathDayRegistration](https://udayton.edu/artssciences/academics/mathematics/events/undergrad-math-day/index.php) * deadline: **Sunday, October 31, 2021**. --- .pull-left[ ## Why dating with data? - Understand what you are working on * Size of the data * Type of variables * Any missing values? * Any outliers? what to do with them? - Summarize the data * Characteristics of variables - Find interesting pattern - **What** can the data tell us? - **Why** are things good to know? - **How** to present the story? ] .pull-right[ <img src="wordcloud.jpg" width="480" height="480" /> ] --- # Google Job Skills - Google published all of their jobs at [https://careers.google.com/](https://careers.google.com/). - The data are collected using Selenium by scraping all of the jobs text from the Google Career site and provided in [kaggle.com](https://www.kaggle.com/niyamatalmass/google-job-skills?) by [Niyamat Ullah](https://www.kaggle.com/niyamatalmass). - Available variables * **Title**: The title of the job * **Category**: Category of the job * **Location**: Location of the job * **Responsibilities**: Responsibilities for the job * **Minimum Qualifications**: Minimum Qualifications for the job * **Preferred Qualifications**: Preferred Qualifications for the job --- ### Let's check out the data set first. ```r df <- read_csv("job_skills.csv") plot_intro(df, ggtheme = theme_minimal(base_size = 18)) ``` <img src="analysis_files/figure-html/read_data-1.png" width="65%" style="display: block; margin: auto;" /> --- ### Which variables have missing values? ```r plot_missing(df, ggtheme = theme_minimal(base_size = 18)) ``` <img src="analysis_files/figure-html/missing_values-1.png" width="65%" style="display: block; margin: auto;" /> --- ### Get a glimpse of the data ```r glimpse(df) ``` ``` ## Rows: 1,250 ## Columns: 7 ## $ Company <chr> "Google", "Google", "Google", "Google", "Goog~ ## $ Title <chr> "Google Cloud Program Manager", "Supplier Dev~ ## $ Category <chr> "Program Management", "Manufacturing & Supply~ ## $ Location <chr> "Singapore", "Shanghai, China", "New York, NY~ ## $ Responsibilities <chr> "Shape, shepherd, ship, and show technical pr~ ## $ Minimum_Qualifications <chr> "BA/BS degree or equivalent practical experie~ ## $ Preferred_Qualifications <chr> "Experience in the business technology market~ ``` ```r table(df$Company) ``` ``` ## ## Google YouTube ## 1227 23 ``` --- ### Find what you need to get a job at Google Since there are only 23 jobs from YouTube, we will focus on the job skills needed in Google. ```r df <- df %>% filter(Company=="Google") %>% select(-Company) ``` #### Here is a list of things we would like to know: - Popular job categories - Popular job titles - Where their offices are located - Which location(s) that Google needs more employee - Academic degree - Years of experience - Programming languages - Popular subjects --- **Popular Job Categories** ```r ggplot(data=df) + geom_bar(aes(x=reorder(Category, Category, function(x) length(x))), fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Job Category", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/job_categories-1.png" width="65%" style="display: block; margin: auto;" /> --- **Popular Job Titles** ```r titles <- sort(table(df$Title), decreasing = T)[1:20] df1 <- data.frame(titles) ggplot(data=df1, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Top 20 Popular Job Titles", y="Number of Jobs") + theme_minimal(base_size = 26) ``` <img src="analysis_files/figure-html/job_titles-1.png" width="65%" style="display: block; margin: auto auto auto 0;" /> --- **Which location(s) that Google needs more employee** ```r locations <- sort(table(df$Location), decreasing = T)[1:20] df2 <- data.frame(locations) ggplot(data=df2, aes(x=reorder(Var1, Freq), y=Freq)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Top 20 Popular Locations", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/job_locations-1.png" width="65%" style="display: block; margin: auto;" /> --- **Academic Degree** ```r ggplot(data=df3, aes(x=reorder(Degree, Count), y=Count)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Degree", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/job_degrees-1.png" width="65%" style="display: block; margin: auto;" /> --- **Years of Experience** ```r ggplot(data=df4, aes(x=reorder(Year, Count), y=Count)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Years of Experience", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/exp_years-1.png" width="65%" style="display: block; margin: auto;" /> --- **Programming Languages** ```r ggplot(data=df5, aes(x=reorder(Language, Count), y=Count)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Programming Language", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/job_languages-1.png" width="65%" style="display: block; margin: auto;" /> --- **Popular Subjects** In the end, I am curious how popular certain subjects are when looking for a job in Google. Here, I only consider the following subjects: **Computer Science**, **Engineering**, **Mathematics**, and **Statistics**. ```r ggplot(data=df6, aes(x=reorder(Subject, Count), y=Count)) + geom_bar(stat="identity", fill=daytonred, position=position_dodge(width=1)) + coord_flip() + theme(legend.position = "none") + labs(x="Subject", y="Number of Jobs") + theme_minimal(base_size = 18) ``` <img src="analysis_files/figure-html/job_subjects-1.png" width="65%" style="display: block; margin: auto;" /> --- # Ad: MTH 490 Intro to Programming with R - **Spring**, 2022 - 1 credit hour - **5:05PM - 6:20PM on Mondays** - Things you will learn in this course * Basic data types * Basic data structures * Managing data * Creating basic graphical displays * Data manipulation * Data visualization with two R packages: ggplot2, plotly * R markdown presentation * Dynamic programming (if time permits) --- # Thanks .pull-left[ - Please do not hesitate to contact Dr. Chen if you have questions pertaining to learning R or other languages. Please email me at <a href="mailto:ychen@udayton.edu"><i class="fa fa-paper-plane fa-fw"></i> ychen4@udayton.edu</a>. - The R code used in this presentation can be found [here](https://raw.githubusercontent.com/Ying-Ju/MathClub.github.io/main/job_analysis.R). - Slides were created via the R package **xaringan**, with styling based on: * [xariganthemer](https://cran.r-project.org/web/packages/xaringanthemer/vignettes/xaringanthemer.html) package, and * Alison Hill's [@apreshill](https://github.com/apreshill/) CSS resources for customizing themes and fonts - The formatting of slides is provided by Dr. Fadel M. Megahed [@fmegahed](https://github.com/fmegahed). - The example data are provided in [kaggle.com](https://www.kaggle.com/niyamatalmass) by [Niyamat Ullah](https://www.kaggle.com/niyamatalmass). ] .pull-right[ <img src="https://www.verouden.net/slides/presentation-xaringan/img/questions.gif" width="350" height="350" /> ]