We scraped the 2020 Bihar Electoral Rolls from http://ele.bihar.gov.in/pdfsearch/ (Publication Date: 07-02-2020). In all, there were 72,723 primary rolls from 243 constituencies.
The file name has the following format: FinalRoll_ACNo_<AC NO 1~243>PartNo_<PART NO>.pdf
- We used the script to download the files and upload them to Google Cloud Storage (gs://in-electoral-rolls-2020/bihar).
- There were a few files which we couldn't download in the first try. The script for downloading those is here.
- Notebook to check if we downloaded all the files
- Notebook to check file size and produce metadata CSV for files
- Notebook gets the metadata from the webpage (including names etc.) and appends to the csv obtained step 3
- list.txt --- files that downloaded the first time.
- list2.txt --- all files that downloaded after the 2nd time.
- list3.txt --- all files with file size.
- Metadata CSV for Files along with size
- Metadata CSV with data from the webpage
We have instituted the same process as here.
Given privacy concerns, we are releasing the data only for research purposes. To access the pdfs, you must agree to take all precautions to maintain the privacy of Indian electors. (There is a difference between data being available in pdfs, split across different sites, sometimes behind CAPTCHA, and a common data dump.) You will get read access to Google Coldline storage bucket for a month. The buckets are setup as requester pays. So you need to create a project that will be used for billing. You can access them as follows:
gsutil -u projectname_for_billing ls gs://in-electoral-rolls-2020/bihar