Covid-19_Vaccine_Hesitancy

10 SEP 2023 - This is a re-upload of a repositaory that I collobarated on. The project was created as part of a 6 month Data Analystics Bootcamp administed by George Washington University. The original repo with all collobarators can be found at https://github.com/danig89/Covid-19_Vaccine_Hesitancy. The purpose of re-uploading it is to preserve the work and provide a way to pin it on my profile, for some reason I cannot do this as a collaborator.

Project Overview

Draft presentation link:

Link_to_Project_Presentation

Background

This topic was chosen due to a shared interest in healthcare and public health as it relates to Covid-19. The group believes that AI/ML techniques can help in determining which demographic factors contribute to vaccine hesitancy.

Purpose

The purpose of this project is to explore which factors are more likely to contribute to an individual’s hesitancy of getting (or not getting) the Covid-19 vaccine. By analyzing Covid-19, US Census, and demographic data, we hope to determine:

Which demographic factors, such as income and proverty level, employment status, race/ethnicity, and access to transportation are more likely to contribute to vaccine hesitancy?
Can we assume that counties that voted for Donald Trump are more likely to have higher populations of individuals who are vaccine hesitant?

Resources

Data sources: US Census Demographic Data; Vaccine Hesitancy for COVID-19; Election, COVID, and Demographic Data by County; Urban-Rural Classification for Counties; County FIPS Codes
Software: Jupyter Notebook; QuickDBD; pgAdmin 4; Tableau Public; Amazon Web Services

Technologies

Data Cleaning and Analysis

Pandas and numpy were used to clean the data and perform data cleaning and preliminary exploratory analysis. Further analysis was completed using Python. Seaborn and matplotlib were used for data exploration/visualization.

Database Storage

Postgres and PgAdmin was used to create and store the database. AWS was used for cloud storage of the database. SQLAlchemy was used to load and connect to the data.

Machine Learning

Sklearn was used to split the data into training and testing sets, and to build and test our machine learning model.

Dashboard

Tableau Public was used to present the data and visualize our findings. Link_to_Tableau_Dashboard

Machine Learning Model

Description of preliminary data preprocessing

During preprocessing, four databases were joined to create the a file to be used in the machine learning model. Next, the file was converted to a dataframe. Null rows and columns, and duplicate rows were then removed from the dataframe. Using numpy, estimated hesitancy data was converted from integers to string, creating a new “hesitancy” column. Data was split into “low hesitancy,” “moderate hesitancy,” and “high hesitancy.” This final data was saved as a CSV file and used for the machine learning model.

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

Variables were chosen as follow:

Independent variables
X = county_data_df[["percent_white","percent_hispanic", "percent_american_indian_alaska_native", "percent_asian", "percent_black", "percent_hawaiian_pacific", "Poverty", "ChildPoverty", "Drive","Carpool", "Transit", "Walk", "OtherTransp", "WorkAtHome", "PrivateWork", "PublicWork", "SelfEmployed", "FamilyWork", "Unemployment", "percentage20_Donald_Trump", "percentage20_Joe_Biden", "population_scaled"]]

Dependent Variable
y = county_data_df['hesitancy']

Description of how data was split into training and testing sets

The data was split into training and testing sets using the random state parameter to guarantee that the same sequence of random numbers is generated each time we run the code.

Explanation of model choice, including limitations and benefits

After exploring various logistic regression models, such as muliple logistic regression, naïve random sampling, SMOTE oversampling, undersampling, and random forest classifier, the group chose to use the multiple logistic regression model, as it yielded an 77% accuracy, precision, and recall.
Advantages

Best for categorical data
Easier to train and interpret
Provides good accuracy

Disadvantages

Can be prone to overfitting if the number of observations is lesser than the number of features
Cannot be used for non-linear data
Not good for complex relationships

Model Performance

The model performed well while predicting medium hesitancy, as expected.
The model only predicted 1 datapoint as high hesitancy when it was truly low hesitancy.
The model only predicted 2 datapoints as low hesitancy when it was truly high hesitancy.

Results

Poverty is the most important feature, followed by percentage of votes for Joe Biden in 2020 election.
The third most important feature is percent of african american population in the county.
There is moderate negative correlation between percentage of votes for Joe Biden (2020) and percentage of white population in a county.
There is weak negative correlation between percentage of votes for Donald Trump (2020) and percentage of asian population as well as percentage of african american population in a county.
There is significant difference at 95% CL for low and moderate hesitancy.

Summary

Which demographic factors, such as income and poverty level, employment status, race/ethnicity, and access to transportation are more likely to contribute to vaccine hesitancy?

Poverty (economy)
Percentage of votes for Joe Biden in 2020 election (political views)
Percent of african american population in the county (race)

Can we assume that counties that voted for Donald Trump are more likely to have higher populations of individuals who are vaccine hesitant?

Yes. Our analysis showed that counties that Trump carried in the 2020 Presidential election were more likely to have moderate hesitancy between 15% and 25% (76% counties vs. 46% counties) and less likely to have low hesitancy below 15% (11% of counties vs. 42% of counties)

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
Database		Database
Resources		Resources
.gitignore		.gitignore
Exploratory_Analysis_and_Data_Preparation.ipynb		Exploratory_Analysis_and_Data_Preparation.ipynb
README.md		README.md
Vaccine_Data_Cleanup.ipynb		Vaccine_Data_Cleanup.ipynb
final_logreg_model.ipynb		final_logreg_model.ipynb
ml_model_testing_testing.ipynb		ml_model_testing_testing.ipynb
requirements.txt		requirements.txt
technology.md		technology.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid-19_Vaccine_Hesitancy

Project Overview

Draft presentation link:

Background

Purpose

Resources

Technologies

Data Cleaning and Analysis

Database Storage

Machine Learning

Dashboard

Machine Learning Model

Description of preliminary data preprocessing

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

Description of how data was split into training and testing sets

Explanation of model choice, including limitations and benefits

Model Performance

Results

Summary

About

Releases

Packages

Contributors 4

Languages

jbalooshie/covid-19-vaccine-hesitancy

Folders and files

Latest commit

History

Repository files navigation

Covid-19_Vaccine_Hesitancy

Project Overview

Draft presentation link:

Background

Purpose

Resources

Technologies

Data Cleaning and Analysis

Database Storage

Machine Learning

Dashboard

Machine Learning Model

Description of preliminary data preprocessing

Description of preliminary feature engineering and preliminary feature selection, including their decision-making process

Description of how data was split into training and testing sets

Explanation of model choice, including limitations and benefits

Model Performance

Results

Summary

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages