Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle missing histories #300

Closed
fgregg opened this issue Nov 18, 2019 · 6 comments
Closed

Handle missing histories #300

fgregg opened this issue Nov 18, 2019 · 6 comments

Comments

@fgregg
Copy link
Contributor

fgregg commented Nov 18, 2019

Some LA Metro Board Reports have missing histories, but we still need to show their last action in the councilmatic application.

Currently, this is handled by some complicated by some view-like code in the councilmatic app, but it should be handled, if practicable, in the data layer.

@hancush
Copy link
Collaborator

hancush commented Dec 17, 2019

Ideal to handle this at the data level. Similar to workaround for minutes in the scraper.

@hancush
Copy link
Collaborator

hancush commented Dec 19, 2019

To replicate this behavior in the scraper, we need to be able to query the Legistar API for events where a particular matter appears on the agenda (i.e., in an associated event item).

Some reading, specifically this, has led me to believe something like this should work: http://webapi.legistar.com/v1/metro/events/?$filter=EventItems/any(item:%20item/EventItemMatterId%20eq%206276)

Matter 6276 appears on this agenda, so I know there should be at least one result: http://webapi.legistar.com/v1/metro/events/1603/eventitems

The requests are going through ok, but the responses are coming back empty. I emailed Metro to see if they have any insight!

If we can't do this querying, I don't know if it's practical to do this in the scraper. Will revisit when it's not 5 p.m.

@hancush
Copy link
Collaborator

hancush commented Dec 19, 2019

Another thing is: How often are bills added to Legistar without a history? If we add an artificial history, do we want to clear it when an actual history is added? Or should it be added to the extras dict? (That's probably a better idea.)

If this isn't practical at the scraper level (i.e., we can't query the Legistar API the way we need to), it might be something we can create during the post save hook for bills, when we'll have full access to the database via the ORM.

@hancush
Copy link
Collaborator

hancush commented Jan 29, 2020

We're going to proceed on this assuming that changes won't be made to the Legistar API that allow us to query it the way we'd need to, to calculate this value in the scraper.

Apart from perhaps occurring at the wrong level of the code base, the big issue with our current approach is that it runs a heavy query every time a bill's last action date is needed, either in the UI or when updating or rebuilding the search index. Caching this value would lead to faster page load time and indexing operations.

I propose replacing the last_action_date property on the Councilmatic Bill model with a last_action_date attribute and populating the attribute during the post-save signal for OCD bills. This calculation would add some overhead to the first scrape into a bare database. However, since we scrape bills at such a high frequency, there are generally less than 10 new or updated bills per scrape, which would mitigate overhead on an ongoing basis.

ubuntu@ip-10-0-0-80:~$ grep "bill:" /tmp/lametro.log | grep 'noop$'
  bill: 0 new 0 updated 2920 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 1 new 0 updated 5 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 8 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
ubuntu@ip-10-0-0-80:~$ grep "bill:" /tmp/lametro.log.1 | grep 'noop$'
  bill: 0 new 5 updated 2908 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 6 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 6 new 0 updated 6 noop
  bill: 0 new 0 updated 11 noop
  bill: 1 new 0 updated 12 noop
  bill: 0 new 0 updated 14 noop
  bill: 0 new 0 updated 12 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 7 noop
  bill: 0 new 0 updated 5 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 4 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 2 noop
  bill: 0 new 0 updated 1 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 0 updated 3 noop
  bill: 0 new 5 updated 2915 noop

On the plus side, re-calculating this attribute every time a bill is saved ensures that bills for which we'd previously spoofed an action date via agendas would be updated appropriately when a history item is added.

Thoughts, @fgregg?

@hancush
Copy link
Collaborator

hancush commented Mar 3, 2020

Update: Hm, looks like the signal approach won't work after all. get_last_action_date depends on bill actions being present, however related objects aren't inserted until after the bill is created in pupa's import process. This makes sense, but it's bad news for us, because it means we aren't getting the last action date for new bills, and we might be setting the wrong one for existing bills, because we don't yet know about new actions.

With this in mind, it seems like we need a few things:

  • Access the the ORM
  • All the data in the database
  • Periodic updates

A signals-based approach gets us the first and third things, but not the second. Setting the attribute as bills are accessed, like we do with packets, might also seem attractive, but it doesn't get us periodic updates.

So it's starting to seem like we need to set this outside of the import cycle, e.g., in a management command, or skip caching and calculate it on the fly. A management command could be ok, but since we aren't running scrapes and downstream ETL in concert, there's still the potential for incomplete data.

Any thoughts, @fgregg?

Related to Metro-Records/la-metro-councilmatic#553, Metro-Records/la-metro-councilmatic#555.

@hancush
Copy link
Collaborator

hancush commented Mar 11, 2020

Addressed this in opencivicdata/pupa#329 and datamade/django-councilmatic#265.

@hancush hancush closed this as completed Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants