Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjust models and views for LA Metro Councilmatic upgrade #260

Merged
merged 38 commits into from
Mar 31, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
b1799be
adjust models and views for la metro councilmatic upgrade
fgregg Nov 6, 2019
9ebd122
fixes for la_metro upgrade
fgregg Nov 14, 2019
85e518a
complicated document model we didn't use
fgregg Dec 5, 2019
3c6b164
Revert "complicated document model we didn't use"
hancush Dec 26, 2019
6ddf2a0
Add back BillDocument – need to preserve full_text for search indexes
hancush Dec 26, 2019
4426220
Revert "Add back BillDocument – need to preserve full_text for search…
fgregg Jan 9, 2020
f82b8aa
update minimum req for ocd
fgregg Jan 9, 2020
b859eae
adjustments for tests
fgregg Jan 9, 2020
1e0c9e2
convert attachment to extras fulltext on billdoc
fgregg Jan 9, 2020
169476c
perfect updated bill docs
fgregg Jan 9, 2020
2c704a2
Handle exception due to missing encoding
hancush Jan 15, 2020
82fbebb
Merge branch 'la_metro_accomodations' of https://github.com/datamade/…
hancush Jan 15, 2020
e6e4d59
Handle yet another encoding issue
hancush Jan 15, 2020
ea5cfb1
Remove BillDocument subclass, including migration
hancush Jan 17, 2020
905bc10
Fix reference to primary sponsor
hancush Jan 17, 2020
ad40906
Continue removal of BillDocument
hancush Jan 17, 2020
d6fd581
Install opencivicdata from commit for tests
hancush Jan 17, 2020
8986373
flake it up, baby
hancush Jan 17, 2020
313c193
Default to trying to convert all bill documents without full text
hancush Jan 18, 2020
6d56777
Add last_action_date
hancush Jan 29, 2020
289d629
Set Bill.last_action_date on OCD Bill and Event save
hancush Jan 30, 2020
e7f1cbd
Merge pull request #264 from datamade/feature/hec/cache-last-action-date
hancush Jan 30, 2020
834b7d8
Use model alias
hancush Jan 30, 2020
d5fc001
Explicitly install library to extract text from Powerpoints
hancush Feb 11, 2020
6ec2e61
Take a whack at fixing convert_attachment_text
hancush Feb 12, 2020
08170a9
Bump PostGIS
hancush Feb 12, 2020
97178af
Add back detailed exception handling to convert_attachment_text
hancush Feb 12, 2020
40e1e28
Bump both Postgres and Postgis
hancush Feb 12, 2020
2809dd7
Log non-fatal exceptions as warnings
hancush Feb 12, 2020
3e473ab
Try this version thing again
hancush Feb 12, 2020
315e4e1
Bump Postgres one more minor version
hancush Feb 12, 2020
6ac158c
Don't reference a variable that doesn't exist
hancush Feb 21, 2020
fdce528
Only save the councilmatic bill when updating last_action_date
hancush Mar 3, 2020
9ec80db
Merge pull request #265 from datamade/patch/hec/fix-signal
hancush Mar 11, 2020
0a03a62
Omit event related entities that are not associated with a bill in po…
hancush Mar 24, 2020
73b2960
Remove manual installation of OCD
hancush Mar 31, 2020
6a8972c
Pin opencivicdata to >= 3.1 (release that adds extras to BillDocument)
hancush Mar 31, 2020
709be4a
Remove outdated information on team
hancush Mar 31, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ env:
- db_user=postgres

addons:
postgresql: '9.4'
postgresql: '9.6'
apt:
packages:
- postgresql-9.4-postgis-2.3
- postgresql-9.6-postgis-2.3

before_script:
- psql -U postgres -c "create extension postgis"
Expand All @@ -25,8 +25,8 @@ install:
- pip install -r tests/requirements.txt --upgrade
- pip install -e .

sudo: required
dist: trusty
sudo: required
dist: trusty
group: deprecated-2017Q4

script:
Expand Down
9 changes: 0 additions & 9 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,15 +88,6 @@ Run the management command to update the test fixture.
Run the tests and commit your updated fixture with your PR!


Team
----

- Forest Gregg, DataMade - Open Civic Data (OCD) and Legistar scraping
- Cathy Deng, DataMade - data models and loading
- Derek Eder, DataMade - front end
- Eric van Zanten, DataMade - search and dev ops


Patches and Contributions
-------------------------
We continue to improve django-councilmatic, and we welcome your ideas! You can make suggestions in the form of `github issues <https://github.com/datamade/django-councilmatic/issues>`_ (bug reports, feature requests, general questions), or you can submit a code contribution via a pull request.
Expand Down
6 changes: 6 additions & 0 deletions councilmatic_core/haystack_indexes.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,3 +65,9 @@ def prepare_ocr_full_text(self, obj):

def get_updated_field(self):
return 'updated_at'

def prepare_last_action_date(self, obj):
# Solr seems to be fussy about the time format, and we do not need the time, just the date stamp.
# https://lucene.apache.org/solr/guide/7_5/working-with-dates.html#date-formatting
if obj.last_action_date:
return obj.last_action_date.date()
92 changes: 55 additions & 37 deletions councilmatic_core/management/commands/convert_attachment_text.py
Original file line number Diff line number Diff line change
@@ -1,27 +1,24 @@
import os
import itertools
import logging
import logging.config
import sqlalchemy as sa
import os
import requests
import tempfile
import itertools
import tqdm

from django.core.management.base import BaseCommand
from django.conf import settings
from django.core.management.base import BaseCommand
from django.db import connection
from django.db.models import Max, Q

from opencivicdata.legislative.models import BillDocumentLink
from councilmatic_core.models import BillDocument
from opencivicdata.legislative.models import BillDocumentLink, BillDocument


# Configure logging
logging.config.dictConfig(settings.LOGGING)
logging.getLogger("requests").setLevel(logging.WARNING)
logger = logging.getLogger(__name__)

DB_CONN = 'postgresql://{USER}:{PASSWORD}@{HOST}:{PORT}/{NAME}'

engine = sa.create_engine(DB_CONN.format(**settings.DATABASES['default']),
convert_unicode=True,
server_side_cursors=True)

class Command(BaseCommand):
help = 'Converts bill attachments into plain text'
Expand All @@ -38,17 +35,23 @@ def handle(self, *args, **options):
self.add_plain_text()

def get_document_url(self):
# Only apply this query to most recently updated (or created) bill documents.
'''
By default, convert text for recently updated files, or files that
do not have attachment text. Otherwise, convert text for all files.
'''
max_updated = BillDocument.objects.all().aggregate(max_updated_at=Max('bill__updated_at'))['max_updated_at']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For next week: Running this command without the all documents flag results in nothing being converted, because we query for documents that were updated after the max, which isn't possible. Also, since we're looking at the update timestamp on the bill, I'm not sure it's true that all bills from the same import have the same update timestamp. We may be better off doing something like converting everything updated in the last hour or two by default.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way the scraper importer will work is if a bill document is updated it, will delete and recreate the bill document.

So we don't need to check timestamps, we can just check to see if full_text is not set on extras and it should be.


is_null = Q(document__councilmatic_document__full_text__isnull=True)
is_file = Q(url__iendswith='pdf') | Q(url__iendswith='docx') | Q(url__iendswith='docx')
after_max_update = Q(document__bill__updated_at__gt=max_updated)
is_null = Q(document__extras__full_text__isnull=True)
after_max_update = Q(document__bill__updated_at__gte=max_updated)

if max_updated is None or self.update_all:
qs = BillDocumentLink.objects.filter(is_null & is_file)
qs = BillDocumentLink.objects.filter(is_file)
else:
qs = BillDocumentLink.objects.filter(is_null & is_file & after_max_update)
# Always try to convert null files, because files may have failed
# in a reparable manner, e.g., Legistar server errors, during a
# previous conversion.
qs = BillDocumentLink.objects.filter(is_file & (after_max_update | is_null))

for item in qs:
yield item.url, item.document.id
Expand All @@ -58,15 +61,20 @@ def convert_document_to_plaintext(self):
# installing it, import the library here.
import textract

for document_data in self.get_document_url():
document_data = dict(document_data)
url = document_data['url']
document_id = document_data['id']
response = requests.get(url)
# Sometimes, Metro Legistar has a URL that retuns a bad status code (e.g., 404 from http://metro.legistar1.com/metro/attachments/95d5007e-720b-4cdd-9494-c800392b9265.pdf).
for url, document_id in tqdm.tqdm(self.get_document_url()):
try:
response = requests.get(url)
except (requests.exceptions.Timeout, requests.exceptions.ConnectionError):
# Don't fail due to server errors, as these tend to resolve themselves.
# https://requests.readthedocs.io/en/master/user/quickstart/#errors-and-exceptions
logger.warning('Document URL {} raised a server error - Could not get attachment text!'.format(url))
continue

# Sometimes, Metro Legistar has a URL that retuns a bad status code,
# e.g., 404 from http://metro.legistar1.com/metro/attachments/95d5007e-720b-4cdd-9494-c800392b9265.pdf.
# Skip these documents.
if response.status_code != 200:
logger.error('Document URL {} returns {} - Could not get attachment text!'.format(url, response.status_code))
logger.warning('Document URL {} returns {} - Could not get attachment text!'.format(url, response.status_code))
continue

extension = os.path.splitext(url)[1]
Expand All @@ -77,27 +85,37 @@ def convert_document_to_plaintext(self):
try:
plain_text = textract.process(tfp.name)
except textract.exceptions.ShellError as e:
logger.error('{} - Could not convert Councilmatic Document ID {}!'.format(e, document_id))
logger.warning('{} - Could not convert Councilmatic Document ID {}!'.format(e, document_id))
continue
except TypeError as e:
if 'decode() argument 1 must be str, not None' in str(e):
logger.warning('{} - Could not convert Councilmatic Document ID {}!'.format(e, document_id))
continue
else:
raise
except UnicodeDecodeError as e:
logger.warning('{} - Could not convert Councilmatic Document ID {}!'.format(e, document_id))
continue

logger.info('Councilmatic Document ID {} - conversion complete'.format(document_id))

yield {'plain_text': plain_text.decode('utf-8'), 'id': document_id}
yield (plain_text.decode('utf-8'), document_id)

def add_plain_text(self):
'''
Metro has over 2,000 attachments that should be converted into plain text.
When updating all documents with `--update_all`, this function insures that the database updates only 20 documents per connection (mainly, to avoid unexpected memory consumption).
It fetches up to 20 elements from a generator object, runs the UPDATE query, and then fetches up to 20 more.

Inspired by: https://stackoverflow.com/questions/30510593/how-can-i-use-server-side-cursors-with-django-and-psycopg2/41088159#41088159

More often, this script updates just a handful of documents: so, the incremental, fetch-just-20 approach may prove unnecessary. Possible refactor?
Metro has over 2,000 attachments that should be converted into plain
text. When updating all documents with `--update_all`, this function
ensures that the database updates only 20 documents per connection
(mainly, to avoid unexpected memory consumption). It fetches up to 20
elements from a generator object, runs the UPDATE query, and then
fetches up to 20 more.

Inspired by https://stackoverflow.com/questions/30510593/how-can-i-use-server-side-cursors-with-django-and-psycopg2/41088159#41088159
'''
update_statement = '''
UPDATE councilmatic_core_billdocument AS bill_docs
SET full_text = :plain_text
WHERE bill_docs.document_id = :id
UPDATE opencivicdata_billdocument AS bill_docs
SET extras = jsonb_set(extras, '{full_text}', to_jsonb(cast(%s as text)))
WHERE bill_docs.id = %s
'''

plaintexts = self.convert_document_to_plaintext()
Expand All @@ -109,7 +127,7 @@ def add_plain_text(self):
if not plaintexts_fetched_from_generator:
break
else:
with engine.begin() as connection:
connection.execute(sa.text(update_statement), plaintexts_fetched_from_generator)
with connection.cursor() as cursor:
cursor.executemany(update_statement, plaintexts_fetched_from_generator)

logger.info('SUCCESS')
37 changes: 37 additions & 0 deletions councilmatic_core/migrations/0049_auto_20191114_1142.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Generated by Django 2.1.14 on 2019-11-14 19:42

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('councilmatic_core', '0048_post_shape'),
]

operations = [
migrations.AlterModelOptions(
name='membership',
options={'base_manager_name': 'objects'},
),
migrations.AlterField(
model_name='bill',
name='slug',
field=models.SlugField(unique=True),
),
migrations.AlterField(
model_name='event',
name='slug',
field=models.SlugField(max_length=200, unique=True),
),
migrations.AlterField(
model_name='organization',
name='slug',
field=models.SlugField(max_length=200, unique=True),
),
migrations.AlterField(
model_name='person',
name='slug',
field=models.SlugField(unique=True),
),
]
16 changes: 16 additions & 0 deletions councilmatic_core/migrations/0050_remove_billdocument.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Generated by Django 2.2.9 on 2020-01-17 21:30

from django.db import migrations


class Migration(migrations.Migration):

dependencies = [
('councilmatic_core', '0049_auto_20191114_1142'),
]

operations = [
migrations.DeleteModel(
name='BillDocument',
),
]
18 changes: 18 additions & 0 deletions councilmatic_core/migrations/0051_bill_last_action_date.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Generated by Django 2.2.9 on 2020-01-30 19:03

from django.db import migrations, models


class Migration(migrations.Migration):

dependencies = [
('councilmatic_core', '0050_remove_billdocument'),
]

operations = [
migrations.AddField(
model_name='bill',
name='last_action_date',
field=models.DateTimeField(blank=True, null=True),
),
]
Loading