Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate more efficient Friday-night scraping approach #38

Closed
hancush opened this issue Jul 20, 2019 · 4 comments
Closed

Investigate more efficient Friday-night scraping approach #38

hancush opened this issue Jul 20, 2019 · 4 comments

Comments

@hancush
Copy link
Member

hancush commented Jul 20, 2019

We run full and windowed scrapes on Friday, however we preclude multiple scrapes from running at once, so in theory, a full scrape could block a windowed scrape and prevent recent changes from appearing for quite a while. Let's look into a Friday-night approach that balances efficiency with completeness.

@hancush
Copy link
Member Author

hancush commented Apr 11, 2020

On Friday, we run full event and bill scrapes at the top of every hour. That means most of the regular full scrape is redundant. I propose we nix the regular full scrape on Friday and run only a person scrape, instead. This should remove the blocker!

@hancush
Copy link
Member Author

hancush commented Apr 11, 2020

The full bill scrape took almost seven hours last night!!!

lametro (scrape)
  bills: {'window': '0'}
bills scrape:
  duration:  6:45:34.103592
  objects:
    bill: 3083
    vote_event: 1489
jurisdiction scrape:
  duration:  0:00:00.158219
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
04/11/2020 03:16:21 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Planning and Development (Department)"}
04/11/2020 03:16:26 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Operations (Department)"}
04/11/2020 03:16:28 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Program Management (Department)"}
04/11/2020 03:16:28 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Maria Luk"}
04/11/2020 03:16:36 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "-"}
04/11/2020 03:16:41 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Fe Dalida"}
04/11/2020 03:17:43 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Chris Reyes"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "James Butts"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Jacquelyn Dupont-Walker"}
04/11/2020 03:18:59 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Ara Najarian"}
04/11/2020 03:19:19 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "Martha Welborne"}
04/11/2020 03:19:40 ERROR pupa: cannot resolve pseudo id to Organization: ~{"name": "OCEO (Department)"}
lametro (import)
  people: {}
  events: {}
  bills: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  events: {}
  bills: {}
import:
  bill: 0 new 0 updated 3083 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop
  vote_event: 0 new 0 updated 1489 noop

@hancush
Copy link
Member Author

hancush commented Apr 11, 2020

In other words, the slow full scrape blocked other scrapes for almost the entire support window. 😓

Manually ran a full event scrape to post agendas this morning.

lametro (scrape)
  events: {}
events scrape:
  duration:  0:06:34.156762
  objects:
    event: 391
jurisdiction scrape:
  duration:  0:00:01.411950
  objects:
    jurisdiction: 1
    organization: 3
    post: 18
lametro (import)
  people: {}
  events: {}
  bills: {}
import jurisdictions...
import organizations...
import people...
import posts...
import memberships...
import bills...
import events...
import vote events...
lametro (import)
  people: {}
  events: {}
  bills: {}
import:
  event: 1 new 6 updated 384 noop
  jurisdiction: 0 new 0 updated 1 noop
  organization: 0 new 0 updated 3 noop
  post: 0 new 0 updated 18 noop

@hancush
Copy link
Member Author

hancush commented Jul 20, 2020

We addressed this.

@hancush hancush closed this as completed Jul 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant