Download data | CSV |
JSON |
Microsoft Excel |
SQLite |
---|
Availability of COVID-19 data is crucial for researchers and policy makers to understand the progression of the pandemic and react to it in real time. Here is recent plea from researchers in India for they urgent access to COVID data collected by government agencies. Individual states and cities in India provide detailed information in their daily media bulletins about the current situation of COVID-19 in their respective locations. However, such data (usually in the form of PDF documents) is not readily accessible in structured form.
While there are fantastic crowd-sourced efforts underway to curate such data, manual approaches cannot scale to the volume of the data produced over the long term. Unfortunately, although this project originally began anticipating this outcome, this eventuality has already come to pass.
In this project, we use AI-assisted document and image extraction techniques to automate the extraction of such data in structured (SQL) form from the state-level daily health bulletins; and aim to make this data readily (and freely) available for further research and analysis. The target is to automate the data extraction and curation for each Indian state, so that once the extraction process of each state is complete, we can be on "autopilot" for that state, requiring little to none continued manual curation (other than to respond to changes in schema).
If you are using this data in your reserach, please remember to cite us. ๐ Note that the list of authors will continue to grow over time with our OSS contributors. Please make sure to update the citation text in your future papers accordingly.
@inproceedings{agarwal2021covid,
title={COVID-19 India Dataset: Parsing Detailed COVID-19 Data in Daily Health Bulletins from States in India},
author={Mayank Agarwal and Tathagata Chakraborti and Sachin Grover and Arunima Chaudhary},
booktitle={NeurIPS 2021 Workshop on Machine Learning in Public Health},
year={2021}
}
There are two ways to get started:
The most important part of this codebase is the data extraction pipeline, as described above.
- To setup your environment, follow the instructions here.
- To run the extraction pipeline, refer to instructions here.
- For a detailed walkthrough of using the pipeline end to end on a state, refer to our Wiki.
Secondary, but almost as important, is the landing page that allows users to access the data quickly and in different forms such as time series visualization, data tables, CSVs, APIs, etc. For instructions on how to contribute to the landing page, see here.
The following are a few ways to get going. In general, you can pick up any unassigned issue, or issues tagged with help wanted
, from the issue board.
priority
This is the biggest way you can contribute in the beginning stages of the project. "Owning a state" involves:
-
Write the data extraction code for the bulletins of the state. This repository provides the starting code and helper packages to make this as simple as possible. See here for instructions.
-
Eventually reacting (or helping others react) to additions or changes in schema for the bulletins being put out by that state. The schemas have remained quite stable all this while but this issue may show up in a few states as the pandemic evolves.
For the project to succeed, this is the most crucial part. Once the data extraction
code for a state is done, the logging of data for that state is automatic and we can
sit back and relax scale up to the rest of the country over time.
Data at this volume and timeline is bound to suffer from inconsistencies. We will be documenting these as and when we find them on the dedicated Anomalies Page. Help us:
- Remove missing data / deal with missing for the plots.
- Idenitify possible outliers and errors.
Analyze the data for insights, irregularities, etc. You can put up results of your analysis in your papers, blogs, etc. (and point to that from our landing page) or directly add it to our landing page as a standalone new page or in the existing Analysis Page. You can use the data to validate or extend models developed for other countries to India [1] [2] [3]; developing epidemiological models which integrate additional variables [4] [5] [6] [7]; understanding various aspects of the pandemic in detail [8] [1] [9], among others.
๐ก ๐ก ๐ก If you are looking for some concrete tasks to get started, find out more about Challenge Tasks here.
State | Link to Bulletin | Owner | Status |
---|---|---|---|
Andaman and Nicobar AN |
Link | โ Own it! #113 | |
Arunachal Pradesh AR |
Link | โ Own it! #129 | |
Assam AS |
Link | โ Own it! #130 | |
Bihar BR |
Link | โ Own it! #126 | |
Chhattisgarh CT |
Link | โ Own it! #131 | |
Dadra and Nagar Haveli and Daman and Diu DH |
Link | โ Own it! #125 | |
Delhi DL |
Link | Mayank | โ ย COMPLETE Wiki |
Goa GA |
Link | Tathagata | Mayank | โ ย COMPLETE Wiki |
Gujarat GJ |
Link | โ Own it! #121 | |
Haryana HR |
Link | Mayank | โ ย COMPLETE Wiki |
Himachal Pradesh HP |
Link | โ Own it! #132 | |
Jammu and Kashmir JK |
Link | โ Own it! #133 | |
Karnataka KA |
Link | Sushovan De | Mayank | ๐ง ย IN PROGRESS Wiki |
Kerala KL |
Link | Tathagata | ๐ง ย IN PROGRESS Wiki |
Ladakh LA |
Link | โ Own it! #114 | |
Madhya Pradesh MP |
Link | Tathagata | ๐ง ย IN PROGRESS Wiki |
Maharashtra MH |
Link | Mayank | โ ย COMPLETE Wiki |
Manipur MN |
Link | Link | โ Own it! #116 | |
Meghalaya ML |
Link | โ Own it! #111 | |
Mizoram MZ |
Link | โ Own it! #135 | |
Nagaland NL |
Link | โ Own it! #124 | |
Puducherry PY |
Link | โ Own it! #128 | |
Punjab PB |
Link | Sachin | โ ย COMPLETE Wiki |
Odisha OR |
Link | โ Own it! #115 | |
Rajasthan RJ |
Link | ||
Tamil Nadu TN |
Link | Sachin | Tathagata | โ ย COMPLETE Wiki |
Telengana TG |
Link | Mayank | โ ย COMPLETE Wiki |
Uttarakhand UK |
Link | Link | Arunima | โ ย COMPLETE Wiki |
Uttar Pradesh UP |
Link | โ Own it! #127 | |
West Bengal WB |
Link | Mayank | โ ย COMPLETE Wiki |
Add new state |
As you might have noticed, this is an incomplete list of Indian states.
Not all states produce this form of data and not all bulletins are accessible.