Skip to content

Commands to know

hannah cushman garland edited this page Sep 11, 2020 · 7 revisions

Commands to know

The LA metro galaxy comes with several CLI commands and their various options. This section identifies some of the most significant commands, how to use them and where to execute them.

Scraping data from Legistar

Running the scrapers can be simple or fairly involved. You can run full scrapes or "windowed" scrapes; you can run scrapes at faster or slower rates; you can run scrapes for all data or just bills, events, or people (oh, my).

Note! The Metro scrapers on the server run at different intervals, depending on the day. Consult the schedule for each scraping DAG in the dashboard. (Learn how to decipher cron scheduling syntax with our beloved crontab guru.)

Sometimes, you may need to run a scrape to debug, or simply capture data more quickly. You can trigger windowed scrapes, fast full scrapes, and hourly processing via the dashboard, by clicking the "Trigger DAG" button in the dashboard.

Trigger a DAG

If you need to run a more specific command, you may need to shell into the server. If so, then consider the below commands.

# shell into the server
ssh ubuntu@ocd.datamade.us

# visit the scrapers directory, and launch the correct virtual environment
sudo su - datamade
cd scrapers-us-municipal
workon opencivicdata

Then, run the appropriate command.

# scrape all recently updated data
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro

# scrape all recently updated data, and move as quickly as possible
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro --rpm=0
# scrape bills updated in the last 28 days
# https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/bills.py#L97
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro bills

# scrape all bills
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro bills window=0

# scrape bills updated in the last 7 days, as quickly as possible
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro bills window=7 --rpm=0
# scrape all events
# https://github.com/opencivicdata/scrapers-us-municipal/blob/master/lametro/events.py#L139
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro events

# scrape events updated in the last 7 days
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro events window=7
# scrape all people
# the people scraper does not have a "window" argument
# but instead determines which people to update by looking at those visible on the web interface
SHARED_DB=TRUE DATABASE_URL=postgis://datamade@3.93.9.229/lametro pupa update lametro people

Other commands

Metro Councilmatic runs additional processes on the data, after it gets imported to the database. As mentioned above, you can trigger hourly_processing in the Metro dashboard to run all of these commands in one go. (Repeat runs are not a big deal, so this is generally the best option, unless you need to append specific options.)

If you do need to run a particular management command, read on for more information about the commands that comprise hourly_processing (and don't forget to shell into the Councilmatic server and get situated first.)

ssh ubuntu@boardagendas.metro.net
sudo su - datamade
cd lametro
source ~/.virtualenvs/lametro/bin/activate

Refresh the Property Image Cache. Metro caches PDFs of board reports and event agendas. This can raise issues. The refresh_pic management command refreshes the document cache (an S3 bucket connected to Metro Councilmatic via property-image-cache) by deleting potentially out-of-date versions of board reports and agendas.

# run the command and log the results (if on the server)
python manage.py refresh_pic >> /var/log/councilmatic/lametro-refreshpic.log 2>&1

Create PDF packets. Metro Councilmatic has composite versions of the Event agendas (the event and all related board reports) and board reports (the report and its attachments). A separate app assists in creating these PDF packets, and the compile_pdfs command communicates with this app by telling it which packets to create.

# run the command and log the results (if on the server)
# documented in the `metro-pdf-merger` README: https://github.com/datamade/metro-pdf-merger#get-started
python manage.py compile_pdfs >> /var/log/councilmatic/lametro-compilepdfs.log 2>&1

python manage.py compile_pdfs --all_documents

Convert report attachments into plain text. Metro Councilmatic allows users to query board reports via attachment text. The attachments must appear as plain text in the database: convert_attachment_text helps accomplish this.

# run the command and log the results (if on the server)
python manage.py convert_attachment_text >> /var/log/councilmatic/lametro-convertattachments.log 2>&1

# update all documents
python manage.py convert_attachment_text --update_all

Rebuild or update the Solr search index. Haystack comes with a utility command for rebuilding and updating the search index. Learn more in the Haystack docs.

# ideally, rebuild should be run with a small batch-size to avoid memory consumption issues
# https://github.com/datamade/devops/issues/42
# run the command and log the results (if on the server)
python manage.py rebuild_index --batch-size=200 >> /var/log/councilmatic/lametro-updateindex.log 2>&1

# update can be run with an age argument, which instructs Solr to consider bills updated so many hours ago
python manage.py update_index --age=2

# update should be run in non-interactive mode, when logging the results
# `noinput` tells Haystack to skips the prompts
python manage.py update_index --noinput

Are the Solr index and Councilmatic database in sync? Sometimes, the Solr index falls behind the Councilmatic database. This issue arises for a number of reasons and, resultantly, causes headaches. The data_integrity script checks that the Councilmatic database has the same number of records as the Solr index.

# run the command and log the results (if on the server)
python manage.py data_integrity >> /var/log/councilmatic/lametro-integrity.log 2>&1

Appropriately, data_integrity executes at the very end of the hourly_processing DAG, a (usually) happy conclusion to the marching progression of Metro commands.

Clone this wiki locally