This repository has been archived by the owner on Jul 7, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 8
Adding a new state to the data extraction pipeline
Tathagata Chakraborti edited this page Nov 30, 2021
·
10 revisions
The data extraction pipeline executes the following steps sequentially:
- Downloads all the health bulletins for all states
- Sets up the database and the tables for all the states, and
- Extracts the information and inserts them into the tables for all the health bulletins for each state
These 3 steps are executed unless the particular health bulletin for a state has already been downloaded and processed.
To add a new state to the data extraction pipeline, it is beneficial to follow the steps in the same order. A detailed description of these steps is described below.
- Create a new file in the
data_extractor/bulletin_download/states/
folder. Use the ISO 3166-2:IN standard to name your file. - Inherit from the
Bulletin
class in thebulletins.py
file. This will allow you to use functions commonly used across this procedure and will also allow the pipeline to track and save the metadata associated with the state. - Implement a
run
function in the newly created file along with any other utility function you might find useful. The main procedure will call therun
function to execute the script. - Create a dictionary of date as the key, and the URL of the corresponding days' bulletin, and use the
download_bulletin
function in theBulletin
class to automatically download and save the PDFs from these links. - Call the
_save_state_
function to save the metadata associated with the script. - References:
-
Delhi (DL) : Parses the HTML on the Health Department website to create the the
{date: url}
dictionary. - Telangana (TG) : Uses a set URL format to create the dictionary.
-
Delhi (DL) : Parses the HTML on the Health Department website to create the the
- Finally, add the newly created state file to the
data_extractor/bulletin_download/main.py
file
- Once you have completed the bulletin download routine, start by defining the table structures which will hold the data for the particular state
- Create a folder with the name
<state>_tables
in thedata_extractor/db
folder. - Create files for each table in the newly created folder. The particular table class should implement a
create_table
andinsert_row
function. See the structure for Telangana for reference. - Create a new file in the
data_extractor/db
folder, inheriting from theDatabase
class in thedb.py
file. As before, use the ISO 3166-2:IN standard to name your file. - This new file should initialize an instance variable
self.tables
, a dictionary with a table identifier as the key and the table class instance as the value. Thereafter, call thecreate_tables
function to create these tables in the database. - See the Telangana file for reference.
- Finally, add the entry for the newly created state in the
main.py
file.
- After the bulletin download and the table structure definition part, it's time to define the logic to extract information from a particular bulletin.
- Create a file for your state within the
data_extractor/local_extractor/states
folder. Use the ISO 3166-2:IN standard to name your file. - This file should implement a class that is instantiated with the date and the report filepath, and should implement a function called
extract
. - The
extract
function should return a dictionary with the keys as the table identifiers defined in Step 2 above and the values as a dictionary of table column names and values as items. - Refer to the file for West Bengal to see how it's implemented.
- There are certain utility functions implemented in the
data_extractor/local_extractor/utils/common_utils.py
file that you might find useful. These include routines to read tables from pdfs, finding tables given keywords, standardizing dates, and extracting pdf metadata. - Once you've completed the extraction procedure, add this new state entry in the
data_extractor/local_extractor/main.py
file.
- After completing steps 1, 2, and 3 above, tie all the procedures together by adding the state in the
data_extractor/run.py
file. This will inform the run script to execute the procedures for the particular state as well. - You can check if the entire pipeline works okay by executing the run script as described here.