Working with public.nazk.gov.ua data
iMacros-scripts - browser automation scripts in iMacros for crawling declaration list from public.nazk.gov.ua website
R-scripts - scripts in R language for parsing crawled web content (parse.R), formatting and storing data in html, csv, url-list formats (format.R)
data - formatted data in html, csv, url-list formats
- Firefox (or another browser with iMacros support)
- iMacros browser plugin. Firefox extension](https://addons.mozilla.org/uk/firefox/addon/imacros-for-firefox/)
- R language environment
- In iMacros browser extension settings, in Paths tab, set working dirs - Folder Macros to iMacros-scripts path, Folder Downloads - to root dir of repo
- In browser, open URL https://public.nazk.gov.ua/search
- At the bottom of web page, in paging control, click ">" (go to the last page), remember total page count (ex 3500), and go back to the first page (click "<")
- Open iMacros browser extension and select script iMacros-scripts\SavePageNext.iim
- In iMacros Play tab set Repeat Macro Max: value, much larger than total page count (ex 4000) and click "Play (Loop)"
- Enjoy your coffee|tea until iMacros gets to the last page (it may take a while, ex 3500 pages are saved in 2-3 hours)
- Go to iMacros working dir and rename all crawled *.csv into *.txt
- In repo root and copy them to new empty dir "scraped-data"
- Open R environment
- Set R session working dir to repo root dir
- Run R script R-scripts\parse.R
- After script execution you'll have all declaration list in allpersons.csv file of repo root dir