A data based approach to how Covid-19 affects specific groups in the USA.
The jupyter notebook is based on Python 3.7 and relies mainly on pandas, numpy and datetime for the data preparation. For plotting and visualizing I used mainly Seaborn and for Chloropleth maps plotly. To export the Pandas style tables, I installed imgkit. As I have also build a simple Multiple Linear Regression Model, the notebook needs Scikit-learn as well.
There are a few points to mention here:
- It is part of my Data Scientist at Udacity to write a blogpost at Medium about a topic of my fancy. At the same time, a colleague showed me the "Uncover Covid-19 Challenge" on Kaggle. That is how I found all those datasets about Covid-19 on Kaggle.
- Having already read a lot about the implications of Covid-19 especially for poorer or - generally speaking - socially more vulnerable societies, I was at once intrigued by the CDC Social Vulnerability Index.
Therefore I decided to focus my investigation on the impact of Covid-19 in the USA and especially in the US counties. My main questions were:
- Which US-counties are most affected by Covid-19 regarding infections and deaths?
- Is there a correlation between specific social vulnerability indicators and Covid-19 cases as well as deaths?
- Is it possible to build a simple Linear Regression Model that predicts Covid-19 cases and deaths based on specific social vulnerability indicators?
CDC writes in the documentation for the SVI:
The degree to which a community exhibits certain social conditions, including high poverty, low percentage of vehicle access, or crowded households, may affect that community’s ability to prevent human suffering and financial loss in the event of disaster. These factors describe a community’s social vulnerability.
ATSDR’s Geospatial Research, Analysis & Services Program (GRASP) created Centers for Disease Control and Prevention Social Vulnerability Index (CDC SVI or simply SVI, hereafter) to help public health officials and emergency response planners identify and map the communities that will most likely need support before, during, and after a hazardous event.
Examples of social vulnerability indicators are Poverty, Age Over 65, Minority, Speaks English "less than well", No High School Diploma, Single-Parent Households, Mobile Homes, No Vehicle etc. Furthermore, there is a overall ranking indicator that takes all indicators into account.
The documentation can be found here: https://svi.cdc.gov/data-and-tools-download.html
I have written a blogpost on Medium that can be accessed on Medium.
- Covid-19 SVI: jupyter notebook that contains all code and results.
- confirmed-covid-19-cases-in-us-by-state-and-county: A csv file about all confirmed Covid-19 cases in US states and counties as of April 8th. This file is part of Kaggles "Uncover Covid-19 Challenge".
- confirmed-covid-19-deaths-in-us-by-state-and-county: A csv file about all confirmed Covid-19 caused deaths in US states and counties as of April 8th. This file is part of Kaggles "Uncover Covid-19 Challenge".
- SVI2018_US: A csv file with the newest social vulnerability data for US states and counties as of 2018. This file I downloaded from CDC's website: https://svi.cdc.gov/
- SVI2018Documentation: A pdf that explains the Social Vulnerability Index in general and the different sub-indicators in the csv. The pdf can be found here: https://svi.cdc.gov/data-and-tools-download.html
- Media for Medium post: I wrote a small blogpost on Medium and had to generate a few pictures from the plots aswell. This can be found in this folder.
Every contribution is welcome.
There is always the possibility to look deeper into the provided data. In my research, I have included only 14 indicators that were all percentage values and found already some interesting correlations. I have also plotted some maps of the US counties (using plotly) however I didnt include them in the notebook as they were not necessary for my results.
Further investigation could - in my opinion - focus on more social vulnerability aspects as well as on the absolute values. Also a very importing indicator is missing here: Gender. As far as I know, Covid-19 affects all genders, however in social vulnerable contexts gender could make a difference.
Finally, my Multiple Linear Regressions Models did not show very high scores. Maybe there is a few to improve the prediction even more, e.g. using other or more features or another model.
Thanks to Kaggle and the Roche Data Science Coalition for providing the datasets and supporting thus the fight against Covid-19. The challenge can be found here: https://www.kaggle.com/roche-data-science-coalition/uncover Thanks to Agency for Toxic Substances and Disease Registry (ATSDR) and Centers for Disease Control and Prevention (CDC) for providing data and documentation about social vulnerability in the USA. All information can be found here: https://svi.cdc.gov/index.html Thanks to Udacity, Codecademy and Stackoverflow for allways providing answers to my questions.
Maximilian Müller, Business Development Manager in the Renewable Energy sector. Now diving into the field of data analysis.
Link to GitHub respository: https://github.com/muellermax/Covid-19-social-vulnerability