Skip to content

Latest commit

 

History

History
159 lines (135 loc) · 6.55 KB

readme.md

File metadata and controls

159 lines (135 loc) · 6.55 KB

declaratii.integritate.eu scraper

  • declaratii-integritate.py - fetches html rows and csvs (Exporta resultate)
  • consolidate_dl_csvs.py - consolidates dowloaded (Exporta resultate) csvs
  • consolidate-all.py - consolidates Exported csv w parsed html rows

Roadmap

big pages wo pagination? count page turns max 10k go for bucuresti, bucuresti si un fel de declaratie then difference per judet, log errors

  • get data from result table

  • follow pagination

  • loop dates

  • log errors

  • log results & meta/ count results

  • prevent overwrite (update untill found)

  • scrape from today to the latest logged day

  • split into functions

  • update missing values

    • break query in parts - date, tip declaratie, judet
  • download csvs

    • rename to target date
  • consolidate declaratii din tabel

  • consolidate downloaded csvs

    • remove duplicates
  • compare results - between downloaded csvs and scraped table pages / rows

  • build stats

  • download pdfs

  • parse pdfs


Initial approach

open page click advanced

loop (while start_date < earliest_date)
    add dates 
    see if max 10000 - if so previou day log err
    see if none - if so previous range log err

    if resultsTable

    read table (function)
    check if next
        append table
        advance next

    write to csv
    advance days
  1. open declaratii.integritate.eu
  2. click 'Căutare avansată' button (<a id="form:showAdvancedSearch">) found here: <div class="srch_switch"><a href="#" id="form:showAdvancedSearch">Căutare avansată</a><span></span></div>
  3. wait for <div id="form:advanced-search-panel"> ... <div id="form:advanced-search-panel_content"> to load
  4. input current date (ex: 03.10.2023) in this field, also change 'value' parameter accordingly <input id="form:endDate_input" name="form:endDate_input" role="textbox" type="text" value="03.10.2023"> input current date - 3 working days (ex: 29.09.2023) in <input id="form:startDate_input" name="form:startDate_input" role="textbox" size="10" type="text" value="29.09.2023"> - also change 'value' attribute inside tag
  5. click on <input id="form:submitButtonAS"> inside <div class="advanced_srch_subm_right"><input class="button" id="form:submitButtonAS" name="form:submitButtonAS" type="submit" value="caută>"><span></span></div>
Nume Prenume 
Institutie 
Functie 
Localitate 
Judet 
Data completare declaratie 
Tip declaratie 
GRECU P PETRUȚA AURORALiceul Tehnologic Energetic Dragomir Hurmuzescu DevaDirector adjunctDevaHunedoara30.09.2023Declaraţie de avere
DAMIAN D IULIAAdministratia Nationala Apele Romane - Cod Caen 3600, 3900, 4291, 6203Șef birouSectorul 1Bucuresti02.10.2023Declaraţie de avere
TEODORESCU Gh GEORGETAInspectoratul Scolar Al Judetului ArgesInspector scolarPitestiArges30.09.2023Declaraţie de interese
Gorcitz V LucicaDirectia Generala Regionala A Finantelor Publice BrasovInspectorTargu MuresMures02.10.2023Declaraţie de avere

SQL

Consolidate CSVs (downloaded and scraped from table)

SELECT
    COALESCE(d.NumePrenume, t.NumePrenume) AS NumePrenume,
    COALESCE(d.Institutie, t.Institutie) AS Institutie,
    COALESCE(d.Datacompletaredeclaratie, t.Datacompletaredeclaratie) AS Datacompletaredeclaratie,
    COALESCE(d.Tipdeclaratie, t.Tipdeclaratie) AS Tipdeclaratie,
    d.Functie_x,
    d.Localitate_x,
    d.Judet_x,
    d.Functie_y,
    d.Localitate_y,
    d.Judet_y,
    d.Vezideclaratie AS Vezideclaratie_d,
    d."Vezideclaratie.1" AS Vezideclaratie1_d,
    t.Functie AS Functie_t,
    t.Localitate AS Localitate_t,
    t.Judet AS Judet_t,
    t.Vezideclaratie AS Vezideclaratie_t,
    t."Vezideclaratie2" AS Vezideclaratie2_t,
    COALESCE(d.page, t.page) AS page,
    COALESCE(d.rezultate, t.rezultate) AS rezultate,
    COALESCE(d.start_date, t.start_date) AS start_date,
    COALESCE(d.end_date, t.end_date) AS end_date,
    CASE 
        WHEN d.NumePrenume IS NOT NULL AND t.NumePrenume IS NOT NULL THEN 'both'
        WHEN d.NumePrenume IS NOT NULL THEN 'dlcsv'
        ELSE 'tblz'
    END AS merged_status
FROM dlcsv d
LEFT JOIN tblz t ON d.NumePrenume = t.NumePrenume
                AND d.Institutie = t.Institutie
                AND d.Datacompletaredeclaratie = t.Datacompletaredeclaratie
UNION ALL
SELECT
    COALESCE(t.NumePrenume, d.NumePrenume),
    COALESCE(t.Institutie, d.Institutie),
    COALESCE(t.Datacompletaredeclaratie, d.Datacompletaredeclaratie),
    COALESCE(t.Tipdeclaratie, d.Tipdeclaratie),
    d.Functie_x,
    d.Localitate_x,
    d.Judet_x,
    d.Functie_y,
    d.Localitate_y,
    d.Judet_y,
    d.Vezideclaratie,
    d."Vezideclaratie.1",
    t.Functie,
    t.Localitate,
    t.Judet,
    t.Vezideclaratie,
    t."Vezideclaratie2",
    COALESCE(t.page, d.page),
    COALESCE(t.rezultate, d.rezultate),
    COALESCE(t.start_date, d.start_date),
    COALESCE(t.end_date, d.end_date),
    CASE 
        WHEN t.NumePrenume IS NOT NULL AND d.NumePrenume IS NULL THEN 'tblz'
        ELSE 'dlcsv'
    END AS merged_status
FROM tblz t
LEFT JOIN dlcsv d ON t.NumePrenume = d.NumePrenume
                AND t.Institutie = d.Institutie
                AND t.Datacompletaredeclaratie = d.Datacompletaredeclaratie
WHERE d.NumePrenume IS NULL;

stats

CREATE VIEW stats1 AS
    SELECT 
    SUBSTR(Datacompletaredeclaratie, 7, 4) AS year,
    SUBSTR(Datacompletaredeclaratie, 4, 2) AS month,
    Institutie,
    COALESCE(Localitate_x, Localitate_t) AS localitate,
    COALESCE(Judet_x, Judet_y, Judet_t) AS judet,
    COUNT(*) AS count
FROM match_all
WHERE Institutie IS NOT NULL
AND Datacompletaredeclaratie IS NOT NULL
GROUP BY year,
    month,
    Institutie,
    localitate,
    judet
HAVING count > 0;