Skip to content
This repository has been archived by the owner on Jul 7, 2023. It is now read-only.

Adding a new state to the data extraction pipeline

Tathagata Chakraborti edited this page Nov 30, 2021 · 10 revisions

The data extraction pipeline executes the following steps sequentially:

  1. Downloads all the health bulletins for all states
  2. Sets up the database and the tables for all the states, and
  3. Extracts the information and inserts them into the tables for all the health bulletins for each state

These 3 steps are executed unless the particular health bulletin for a state has already been downloaded and processed.

To add a new state to the data extraction pipeline, it is beneficial to follow the steps in the same order. A detailed description of these steps is described below.

Add the bulletin download routine for the state

  • Create a new file in the data_extractor/bulletin_download/states/ folder. Use the ISO 3166-2:IN standard to name your file.
  • Inherit from the Bulletin class in the bulletins.py file. This will allow you to use functions commonly used across this procedure and will also allow the pipeline to track and save the metadata associated with the state.
  • Implement a run function in the newly created file along with any other utility function you might find useful. The main procedure will call the run function to execute the script.
  • Create a dictionary of date as the key, and the URL of the corresponding days' bulletin, and use the download_bulletin function in the Bulletin class to automatically download and save the PDFs from these links.
  • Call the _save_state_ function to save the metadata associated with the script.
  • References:
    • Delhi (DL) : Parses the HTML on the Health Department website to create the the {date: url} dictionary.
    • Telangana (TG) : Uses a set URL format to create the dictionary.
  • Finally, add the newly created state file to the data_extractor/bulletin_download/main.py file

Define the table structure for the state

  • Once you have completed the bulletin download routine, start by defining the table structures which will hold the data for the particular state
  • Create a folder with the name <state>_tables in the data_extractor/db folder.
  • Create files for each table in the newly created folder. The particular table class should implement a create_table and insert_row function. See the structure for Telangana for reference.
  • Create a new file in the data_extractor/db folder, inheriting from the Database class in the db.py file. As before, use the ISO 3166-2:IN standard to name your file.
  • This new file should initialize an instance variable self.tables, a dictionary with a table identifier as the key and the table class instance as the value. Thereafter, call the create_tables function to create these tables in the database.
  • See the Telangana file for reference.
  • Finally, add the entry for the newly created state in the main.py file.

Write the data extraction logic for the state

  • After the bulletin download and the table structure definition part, it's time to define the logic to extract information from a particular bulletin.
  • Create a file for your state within the data_extractor/local_extractor/states folder. Use the ISO 3166-2:IN standard to name your file.
  • This file should implement a class that is instantiated with the date and the report filepath, and should implement a function called extract.
  • The extract function should return a dictionary with the keys as the table identifiers defined in Step 2 above and the values as a dictionary of table column names and values as items.
  • Refer to the file for West Bengal to see how it's implemented.
  • There are certain utility functions implemented in the data_extractor/local_extractor/utils/common_utils.py file that you might find useful. These include routines to read tables from pdfs, finding tables given keywords, standardizing dates, and extracting pdf metadata.
  • Once you've completed the extraction procedure, add this new state entry in the data_extractor/local_extractor/main.py file.

Tie it all together

  • After completing steps 1, 2, and 3 above, tie all the procedures together by adding the state in the data_extractor/run.py file. This will inform the run script to execute the procedures for the particular state as well.
  • You can check if the entire pipeline works okay by executing the run script as described here.