Template repo for creating a City Scrapers project in your area to scrape, standardize and share public meetings from local government websites. You can find more information on the project homepage or in the original City Scrapers repo for the Chicago area: City-Bureau/city-scrapers.
In order to set up a City Scrapers project for your area you'll need a GitHub account as well as git, Python 3.9 or above and Pipenv installed. If you want to make it easy to share access and onboard new contributors, GitHub organizations are a free and easy way of doing that.
-
Create a new repo in your GitHub account or organization by using this repo as a template or forking it.
- You should change the name to something specific to your area (i.e.
city-scrapers-il
for scrapers in Illinois) - If you forked the repo, enable issues for your fork by going to Settings, and checking the box next to Issues in the Features section.
- You should change the name to something specific to your area (i.e.
-
Clone the repo you created (substituting your account and repo name) with:
git clone https://github.com/{ACCOUNT}/city-scrapers-{AREA}.git
-
Update
LICENSE
,CODE_OF_CONDUCT.md
,CONTRIBUTING.md
, andREADME.md
with info on your group or organization so that people know what your project is and how they can contribute. -
Create a Python 3.9 virtual environment and install development dependencies with:
pipenv install --dev --python 3.9
-
Decide whether you want to output static files to AWS S3, Microsoft Azure Blob Storage, or Google Cloud Storage, and update the
city-scrapers-core
package with the necessary extras:# To use AWS S3 pipenv install 'city-scrapers-core[aws]' # To use Microsoft Azure pipenv install 'city-scrapers-core[azure]' # To use Google Cloud Storage pipenv install 'city-scrapers-core[gcs]'
Once you've updated
city-scrapers-core
, you'll need to update./city_scrapers/settings/prod.py
by uncommenting the extension and storages related to your platform.Note: You can reach out to us at documenters@citybureau.org or on our Slack if you want free hosting on either S3 or Azure and we'll create a bucket/container and share credentials with you. Otherwise you can use your own credentials.
-
Create a free account on Sentry, and make sure to apply for a sponsored open source account to take advantage of additional features.
-
The project template uses GitHub Actions for testing and running scrapers. All of the workflows are stored in the
./.github/workflows
directory. You'll need to make sure Actions are enabled for your repository../.github/workflows/ci.yml
runs automated tests and style checks on every commit and PR../.github/workflows/cron.yml
runs all scrapers daily and writes the output to S3, Azure, or GCS. You can set thecron
expression to when you want your scrapers to run (in UTC, not your local timezone)../.github/workflows/archive.yml
runs all scrapers daily and submits all scraped URLs to the Internet Archive's Wayback Machine. This is run separately to avoid slowing down general scraper runs, but adds to a valuable public archive of website information.- Once you've made sure your workflows are configured, you can change the URLs for the status badges at the top of your
README.md
file so that they display and link to the status of the most recent workflow runs. If you don't change the workflow names, all you should need to change is the account and repo names in the URLs.
-
In order for the scraped results to access S3, Azure, or GCS as well as report errors to Sentry, you'll need to set encrypted secrets for your actions. Set all of the secrets for your storage backend as well as
SENTRY_DSN
for both of them, and then uncomment the values you've set in theenv
section ofcron.yml
. If thecron.yml
workflow is enabled, it will now be able to access these values as environment variables. -
Once you've set the storage backend and configured GitHub Actions you're ready to write some scrapers! Check out our development docs to get started.
-
We're encouraging people to contribute to issues on repos marked with the
city-scrapers
topic, so be sure to set that on your repo and add labels like "good first issue" and "help wanted" so people know where they can get started. -
If you want an easy way of sharing your scraper results, check out our
city-scrapers-events
template repo for a site that will display the meetings you've scraped for free on GitHub Pages.
There's a lot involved in starting a City Scrapers project beyond the code itself, so you can check out our Introduction to City Scrapers in our documentation for some notes on how to grow your project.
If you want to ask questions or just talk to others working on City Scrapers projects you can fill out this form to join our Slack channel or reach out directly at documenters@citybureau.org.