CS178A-B-Template

Overview

Riverside County has a low retention rate in their Department of Public Social Services whether an employee is leaving for another job or transferring between different divisions within the same department. This causes employee instability in the workplace and diminishes the available resources and time since the positions take time to fill and new employees require training. The goal of this project is to find what causes the low retention rate by looking at provided data and looking and demographics such as salary, age, education, etc. We will also be using ML model to analyze the features and find any correlations between the features.

Team Members

Usage

Demo: Run the notebooks. kfold.ipynb and distance_calculation.ipynb are still work in progress.

Dataset to look for when running specific ML model:

Use the dataset named source_2021-12-07-09-51-43.csv ( inside Machine Learning Models ) to run the following ML models inside Machine Learning Folders
With Feature Importance)Logistic Regression Updated .ipynb
With Feature Importance)Random Forest F1 scores - Updated with graph.ipynb
Without Feature Importance)Logistic + Random + Support Vector (Multiple Features).ipynb
Without Feature Importance)Logistic + Random Forest + Support Vector (3 Features).ipynb
Without Feature Importance)Logistic + Random Forest + Support Vector (4 Features).ipynb
Random Forest F1 scores.ipynb

How To Run

You can use the attached PowerBI file and use the PowerBI application to view the dashboard. Data analysis and ML analysis are located in the .ipynb notebook files which you can look at our analysis or run the code yourself.

In order to run the Notebooks, install Jupyter Notebook and upload the notebooks. For any dependencies used in the notebook, view Dependencies.

Diagrams

Due to the unique nature of our project, we do not have diagrams. Instead we outline our project with features and descriptions on how we plan to use the ML model; list of tools, methods, and algorithms we will use; and mockups of our end-product.

Features and Descriptions

Age - Visualize the age group of people in each department, and calculate the retention rate for this
Duration - Calculate the average number of days employees stayed in each department or left the job
Distance - Calculate how distance plays a role in an employee staying or quitting the job using Google Maps API
Hourly Pay/Department name - Find the average of how hourly pay has changed over the years in each department and then compare that with different counties. See if hourly pay has an effect on duration. For example comparing Riverside County vs Los Angeles County etc
Change Division - Calculate how many employees changed the division in recent years
Between Divisions - compare retention rate and salary between divisions

List of Tools, Methods, and Algorithms

Jupyter Notebook and Google Colab for data analysis in Python
PowerBI for the dashboard as this is what Riverside County currently uses to visualize the data
Web scraping of websites to find salary and cost of living of similar job positions
ML tools such as linear regression and KNN to analyze the dataset

EDA Analysis

Yearly Retention Rate:

Shows it is declining since 2010

Retention Rate By Divisions:

Children Services has the lowest retention rate followed by Self-Sufficency

Retention Rate By Sex:

There’s no correlation between gender and retention rate as they trend together

Retention Rate By Classification:

Staffs have lower retention rate compared to other groups

Retention Rate By Work Distance From Home:

There is no correlation between distance from work and home and retention rate.

Duration (Current vs. Past):

Shows the duration of employment in the department
Majority of employees (both past and current) has been employed for more than 5 years
About 36% of the employees in our data were past employees who worked less than 5 years

Duration Between Divisions:

The number of past employees who worked less than 5 years in Children Services is higher than average

Duration Between Ending Age Groups:

This might be misleading as we are looking at ending age group
Those who stay longer will moves up into higher age group so there’ll be less employees who had worked less than 5 years in those age groups

Duration Between Staring Age Groups:

There is less correlation when looking at the age group of when the employees were hired

Duration Between Classification:

Staffs has more employees who worked less than 5 years compared to others

Comprate Distribution:

The distribution of comprate for past employees who worked less than 5 years is more to the left

Machine Learning Models:

XGBoost

F1 scores: 71.39%, 70.97%, 76.50%, 78.09%, 80.82%, 50.58%
Accuracy for each label:
Retirement: 76.96%
Termination: 88.82%
Working: 51.85%

Random Forest

Accuracy for each label:
Retirement: 70.96%
Termination: 81.19%
Working: 25.67%

Neural Network

Features: Duration, Age_group, comprate, Division, Last_pay_raise
Labels and prediction accuracy
Retirement (43.97%), Termination (77.32%), Working (55.18%)
F1 Scores:19.36%, 67.82%, 76.5%, 78.33%, 79.11% and 64.22% average

-----------------------------------------------------------------------------------------------------------------------------

Everything below includes the MockUps and Draft versions we worked on early in the quarter to understand and clean the dataset.

-----------------------------------------------------------------------------------------------------------------------------

Regression Analysis

PowerBI dashboard

Early Drafts EDA

This was done during last quarter when we were trying to understand and clean the data.

Attrition Distribution ( Ex-employees vs Current employees )

Leavers by division

Leavers by Jobrole

Dependencies

Install pip Helpful Documentation
Libraries to install using pip
- pandas Documentation
- numpy Documentation
- plotly Documentation
- scikit-learn Documentation
- xgboost Documentation
Install googlemaps. Helpful Documentation

Trello

https://trello.com/b/TrKsqXMa/project

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github		.github
EDA		EDA
Machine Learning Models		Machine Learning Models
Project Screenshots		Project Screenshots
data		data
LICENSE		LICENSE
README.md		README.md

License

CS-UCR/senior-design-project-abraca-data

Folders and files

Latest commit

History

Repository files navigation

CS178A-B-Template

Table of Contents

Overview

Team Members

Usage

Dataset to look for when running specific ML model:

How To Run

Diagrams

Features and Descriptions

List of Tools, Methods, and Algorithms

EDA Analysis

Yearly Retention Rate:

Retention Rate By Divisions:

Retention Rate By Sex:

Retention Rate By Classification:

Retention Rate By Work Distance From Home:

Duration (Current vs. Past):

Duration Between Divisions:

Duration Between Ending Age Groups:

Duration Between Staring Age Groups:

Duration Between Classification:

Comprate Distribution:

Machine Learning Models:

XGBoost

Random Forest

Neural Network

-----------------------------------------------------------------------------------------------------------------------------

Everything below includes the MockUps and Draft versions we worked on early in the quarter to understand and clean the dataset.

-----------------------------------------------------------------------------------------------------------------------------

Early Drafts EDA

Dependencies

Trello

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages