Covid Infection Analysis

A collaborative project looking into the likelihood of infection of Covid-19 in the United States.

Segment 1 Deliverable Presentation

Topic: What is the likelihood of being infected by Covid-19? How is infection affected by factors such as vaccination rates, gender, and ethnicity?

Purpose: It is important to analyze future trends following the Covid-19 pandemic to understand the prevalence of infection within the American population.

Data Source: We gathered data from reliable organizations such as Johns Hopkins University and the Center for Disease Control (CDC) which provide csv files on their findings.

Questions to be answered: Are certain populations more likely to be infected than others? How do these factors affect the other? What other factors should be considered in identifying risks of infection?

Communication Protocols: Our group name is Endless Knot. The members exchange information on Slack and document notes on Google Docs. Group meetings are held virtually on Zoom. We collaborate on our codes through GitHub, which include our repository (CovidInfectionAnalysis), branches, commits, and pull requests.

Dashboard / Presentation

Access to our Dashboard can be found on our Google Slides presentation: https://docs.google.com/presentation/d/18PVo5fo_eoCjJ3DtpC3WBItMMb-tCJtpTyTDTP6P53A/edit#slide=id.g12328c81713_2_9

Machine Learning (ML) Flowchart

Data Wrangling: We checked out the data quality by sorting and filtering with PYTHON. We cleaned missing data and removed outliers. We omitted several unnecessary columns. We found the null values, used dropna() and converted strings to numbers.

Data Preprocessing

We worked with 3 datasets from the following sources: • Covid Data Tracker (Center for Disease Control)(CDC) • Covid19 cases by State (Johns Hopkins University) • Genderscilab

Preprocessing

We examined each CSV file using Excel.

Data Processing:

Using Vlookup, we converted the State abbreviation to its full name, drew common ground (Primary key) and mapped the relationship using an Entity Relationship Diagram (ERD). We then merged and analyzed datasets using SQL and Pandas/Jupyter Notebook.

• Preparing data for machine learning • Importing libraries: pandas, NumPy, seaborn, matplotlib, sklearn, train_test_split, r2_score, mean_squared_error, sklearn.datasets, statsmodels.tsa.arima.model

Steps to run ML algorithm

• Read dataset • Activate ML environment in jupyter notebook mlev

Data Transformation

• Convert strings to numbers using pd.get_dummies • Split the data into training and testing • Split the data into training and testing using StandardScaler() and X_train_scaled = X_scaler.transform(X_train) X_test_scaled = X_scaler.transform(X_test)

• We tried different Machine learning algorithms. Since our data is labeled, we used Supervised learning. We focused on Regression models because we are using data to make predictions in a continuous form.

Supervised Learning and Models

we used several models:

Ordinary Least Squares (OLS)
Linear regression
Support Vector Machine (SVM)
Autoregressive Integrated Moving Average (ARIMA) for Time series
OLS model can predict an output value with an acceptable error margin, based on a set of known input parameters.

Linear regression: coeffiecient of determinations : 0.57037

SVM: SVM, or Support Vector Machine, is a linear model used for classification and regression problems. It can solve linear and non-linear problems and works well for many practical problems.

Time Series for Machine Learning Model

An ARIMA model is a class of statistical models for analyzing and forecasting time series data. ARIMA stands for Autoregressive Integrated Moving Average. It is a generalization of the simpler Autoregressive Moving Average and adds the notion of integration.

The below summarizes the coefficient values used as well as the skill of the fit on the on the in-sample observations. The ARIMA model used is ARIMA(5, 1, 0)

Next, we get a density plot of the residual error values, suggesting that the errors are Gaussian, but may not be centered on zero. The distribution of the residual errors is displayed. The results show that there is a bias in the prediction (a non-zero mean in the residuals).

The graph below shows that A line plot is created showing the expected values (blue) compared to the rolling forecast predictions (red). We can see the values show some trend and are in the correct scale.

Plotly - Interactive Visualization

Plotly is an interactive platform that was used to help visualize the different Covid-19 factors used in this project. The two factors that we wanted to showcase through maps were gender infections and total percent of vaccinations. Two maps were created to take the states with the highest total of infections between men and women. For the state of California, it had the highest rate of infections for both men and women. Looking at the maps, men were more likely to get infected in Texas than women. In comparison, both genders are likely to get infected equally in the states with the highest amount of cases.

Another map that was created was to visualize the amount of fully vaccinated people in each state. This allows us to see which states has the most vaccinations and which had the least. We can determine that California, Oregon and Washington have a high percentage of vaccinations, while North Carolina has the least percentage of vaccinations.

Tableau - Data Visualization

There were several graphs that were created with Tableau that allowed for easier visualization of the data. With the All Cases Bubble, it indicates that California and Texas had the highest rates of Covid cases for January and February 2021. The Deaths by Ethnicity Graph shows that New York had the highest mortality rates for the Asian, White, Latinx, and Black population. Lastly, the All Covid Cases graph breaks down all the fatalities in each ethnic background and includes the total of deaths.

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
CSVs		CSVs
DataBase		DataBase
Plotly_Images		Plotly_Images
machine learning model		machine learning model
square		square
.DS_Store		.DS_Store
.gitignore		.gitignore
All Cases Bubble.png		All Cases Bubble.png
Cases by Year Map.jpg		Cases by Year Map.jpg
Covid Infection Analysis.pdf		Covid Infection Analysis.pdf
Covid19V2.sql		Covid19V2.sql
CovidFinalRace_and_Ethnicity.csv		CovidFinalRace_and_Ethnicity.csv
Deliverable 2 Presentation.pdf		Deliverable 2 Presentation.pdf
ERD.PNG		ERD.PNG
FinalGender.csv		FinalGender.csv
FinalVax.csv		FinalVax.csv
FinalVax.ipynb		FinalVax.ipynb
Gender.ipynb		Gender.ipynb
README.md		README.md
Technologies_Outline.rtf		Technologies_Outline.rtf
Technology_Outline.docx		Technology_Outline.docx
US_Covid_Cases.ipynb		US_Covid_Cases.ipynb
us_states.csv		us_states.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Covid Infection Analysis

Segment 1 Deliverable Presentation

Dashboard / Presentation

Machine Learning (ML) Flowchart

Data Preprocessing

Preprocessing

Data Processing:

Steps to run ML algorithm

Data Transformation

Supervised Learning and Models

Time Series for Machine Learning Model

Plotly - Interactive Visualization

Tableau - Data Visualization

About

Releases

Packages

Contributors 4

Languages

antirose/CovidInfectionAnalysis

Folders and files

Latest commit

History

Repository files navigation

Covid Infection Analysis

Segment 1 Deliverable Presentation

Dashboard / Presentation

Machine Learning (ML) Flowchart

Data Preprocessing

Preprocessing

Data Processing:

Steps to run ML algorithm

Data Transformation

Supervised Learning and Models

Time Series for Machine Learning Model

Plotly - Interactive Visualization

Tableau - Data Visualization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages