Skip to content

edgi-govdata-archiving/web-monitoring-db

Repository files navigation

Code of Conduct  Project Status Board

⚠️ This project is no longer maintained. ⚠️ It may receive security updates, but we are no longer making major changes or improvements. EDGI no longer makes active use of this toolset and it is hard to re-deploy in other contexts.

web-monitoring-db

This repository is the database and API underlying the EDGI Web Monitoring Project. It’s a Rails app that:

  • Acts as a database of monitored pages and captured versions of those pages over time.

    (The application does not record new versions itself, but relies on importing data from external services, like the Internet Archive or Versionista. See “How Data Gets Loaded” below for more.)

  • Provides an API to get that page and version data, and to allow analysts or other automated tools to annotate those versions with metadata about what has changed from version to version.

For more about how data is modeled in this project, see “Data Model” below.

API documentation is available from the homepage of the application, e.g. by pointing your browser to http://localhost:3000/ or https://api.monitoring.envirodatagov.org. It’s generated from our OpenAPI docs in swagger.yaml.

We maintain a publicly available staging server at https://api-staging.monitoring.envirodatagov.org that you can test against. It runs the latest code and has non-production data — it’s safe to modify or post new versions or annotations to, but you should not rely on that data sticking around; it may get reset at any time. For access, ask for an account on Slack or use the public user credentials:

  • Username: public.access@envirodatagov.org
  • Password: PUBLIC_ACCESS

Installation

  1. Ensure you have Ruby 3.3+.

    You can use rbenv to manage multiple Ruby versions

  2. Ensure you have PostgreSQL 9.5+. If you are on MacOS, we recommend Postgres.app. It makes running multiple versions of PostgreSQL much simpler and gives you easy access to start and stop your databases.

  3. Ensure you have Redis (used for caching).

    On MacOS:

    $ brew install redis

    On Debian Linux:

    $ apt-get install redis
  4. Ensure you have a JavaScript Runtime

    On MacOS:

    You do not need to do anything. Apple JavaScriptCore fulfills this dependency.

    On Debian Linux:

    $ apt-get install nodejs

    If you wish to use another runtime you can use one listed here.

  5. Clone this repo

  6. If you don’t have the bundler Ruby gem, install it:

    $ gem install bundler
  7. Wherever you cloned the repo, go to that directory and install dependencies:

    $ bundle install --without production
  8. Copy the .env.example file to .env - this allows for easy configuration locally.

    $ cp .env.example .env

    Take a moment to look through the variables here and change any that make sense for your local environment. If you need set variables differently when running tests, make a .env.test file that has your test-specific variables.

  9. Set up your database.

    • If your Postgres install trusts local users and you have a superuser (this is the normal situation with Postgres.app), run:

      $ bundle exec rake db:setup

      That will create a database, set up all the tables, create an admin user, and add some sample data. Make note of the admin user e-mail and password that are shown; you’ll need them to log in and create more users, import more data, or make annotations.

      If you’d like to do the setup manually or don’t want sample data, see manual postgres setup below.

    • If your Postgres install has a superuser, but doesn't trust local connections, you'll need to configure database credentials in .env. Find the line for DATABASE_URL in your .env file, uncomment it, and fill it in with your username and password. Make another file named .env.test and copy that line, but change the database line at the end to configure your test database. Then run the same command as above:

      $ bundle exec rake db:setup

      If you’d like to do the setup manually or don’t want sample data, see manual postgres setup below.

    • If you’d like to configure your Postgres DB to use a specific user, you’ll need to do a little more work:

      1. Log into psql and create a new user for your databases. Change the username and password to whatever you’d like:

        CREATE USER wm_dev_user WITH SUPERUSER PASSWORD 'wm_dev_password';

        Unfortunately, Rails' test fixtures require nothing less than superuser privileges in PostgreSQL.

      2. (Still in psql) Create a development and a test database:

        -- Development database
        $ CREATE DATABASE web_monitoring_dev ENCODING 'utf-8' OWNER wm_dev_user;
        $ \c web_monitoring_dev
        $ CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
        $ CREATE EXTENSION IF NOT EXISTS "pgcrypto";
        $ CREATE EXTENSION IF NOT EXISTS "plpgsql";
        $ CREATE EXTENSION IF NOT EXISTS "citext";
        -- Repeat for test database
        $ CREATE DATABASE web_monitoring_test ENCODING 'utf-8' OWNER wm_dev_user;
        $ \c web_monitoring_test
        $ CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
        $ CREATE EXTENSION IF NOT EXISTS "pgcrypto";
        $ CREATE EXTENSION IF NOT EXISTS "plpgsql";
        $ CREATE EXTENSION IF NOT EXISTS "citext";
      3. Exit the psql console and open your .env file. Find the line for DATABASE_URL in your .env file, uncomment it, and fill it in with your credentials and database name from above:

        DATABASE_URL=postgres://wm_dev_user:wm_dev_password@localhost:5432/web_monitoring_dev

        Make a .env.test file and set the same value there, but with the name of your test database:

        DATABASE_URL=postgres://wm_dev_user:wm_dev_password@localhost:5432/web_monitoring_test
      4. Set up all the tables and test data in your DB by running:

        # Set up tables, indexes, and general database schema:
        $ bundle exec rake db:schema:load
        # Add sample data and an admin user:
        $ bundle exec rake db:seed

        For more on this last step, see manual postgres setup below.

  10. Start the server!

    $ bundle exec rails server

    You should now have a server running and can visit it at http://localhost:3000/. Open that up in a browser and go to town!

  11. Bulk importing, automated analysis, and e-mail invitations all run as asynchronous jobs (using the fantastic good_job gem). If you plan to use any of these features, you must also start a worker:

    $ bundle exec good_job start

    If you only want to run particular type of job, you can set a list of queue names with the --queues option:

    $ bundle exec good_job start --queues=mailers,import,analysis

    Each job type runs on a different queue:

    • mailers: Sending e-mails. (There's no job associated with this queue because it is automatically processed by ActionMailer, a built-in component of Rails.)
    • import: Bulk version imports (processing data sent to the /api/v0/imports endpoint).
    • analysis: Auto-analyze changes between versions and create annotations with the results.

Manual Postgres Setup

If you don’t want to populate your DB with seed data, want to manage creation of the database yourself, or otherwise manually do database setup, run any of the following commands as desired instead of rake db:setup:

$ bundle exec rake db:create       # Connects to Postgres and creates a new database
$ bundle exec rake db:schema:load  # Populates the database with the current schema
$ bundle exec rake db:seed         # Adds an admin user and sample data

If you skip rake db:seed, you’ll still need to create an Admin user. You should not do this through the database since the password will need to be properly encrypted. Instead, open the rails console with rails console and run the following:

User.create(
  email: '[your email address]',
  password: '[the password you want]',
  admin: true,
  confirmed_at: Time.now
)

Docker

The Dockerfile runs the rails server on port 3000 in the container. To build and run:

docker build --target rails-server -t envirodgi/db-rails-server .
docker build --target import-worker -t envirodgi/db-import-worker .
docker run -p 3000:3000 envirodgi/db-rails-server -e <ENVIRONMENT VARIABLES> .
docker run -p 6379:6379 envirodgi/db-import-worker -e <ENVIRONMENT VARIABLES> .

Point your browser or curl at http://localhost:3000.

Data Model

The database models three main types of data:

  • Pages, which represent a page on the internet. Pages are identified by a unique ID rather than their URL because pages can move or be available from multiple URLs. (Note: we don't actually model that yet, though! See #492 for more.)

  • Versions, which represent a particular page at a particular point in time. We use the term “version” instead of others more common in the archival space because we attempt to only represent different versions. That is, if a page changed on Wednesday and we captured copies of it on Monday, Tuesday, and Wednesday, we only make version records for Monday and Wednesday (because Tuesday was the same as Monday).

    (Note: because of technical issues around imported data, we often store more versions than we should according to the above definition [e.g. we might still have a record for Tuesday]. Versions have a different field that indicates whether a version is different from the previous one, and the API only returns versions that are different unless you explicitly request otherwise.)

  • Annotations, which represent an analysis about what’s changed between any two versions of a page. Annotations have a specialized priority and significance, which are numbers between 0 and 1, an author, indicating who made the analysis (it could be a bot account), and an annotation field, which is a JSON object with no specified structure (inside this field, annotations can include any data desired).

There are several other kinds of objects, but they are subservient to the ones above:

  • Changes, which serve to connect any two versions of a page. Annotations are actually connected to changes, rather than directly to two versions. You can also generate diffs for a given change.

  • Tags, which can be applied to pages. They help sort and categorize things. Most tags are manually applied, but the application auto-generates a few:

    • domain:<domain name>, e.g. domain:www.epa.gov for a page at https://www.epa.gov/citizen-science
    • 2l-domain:<second-level domain name> e.g. 2l-domain:epa.gov for a page at https://www.epa.gov/citizen-science
  • Maintainers, which can be applied to pages. They represent organizations that maintain a given page. For example, the page at https://www.epa.gov/citizen-science is maintained by EPA.

  • Imports model requests to import new data and the results of the import operation.

  • Users model people (both human and bots) who can view, import, and annotate data. You currently have to have a user account to do anything in the application, though we hope accounts will not be needed to view public data in the future.

Actual database schemas for each of these tables is listed in db/schema.rb.

How Data Gets Loaded

The web-monitoring-db project does not actually monitor or scrape pages on the web. Instead, we rely on importing data from other services, like the Internet Archive. Each day, a script queries other services for historical snapshots and sends the results to the /api/v0/imports endpoint.

Most of the data sent to /api/v0/imports matches up directly with the structure of the Version model. However, the body_url field in an import is treated specially.

When new page or version data is imported, the body_url field points to a location where the raw HTTP response body can be retrieved. If the body_url host matches one of the values in the ALLOWED_ARCHIVE_HOSTS environment variable, the version record that gets added to the database will simply point to that external location as a source of raw response data. Otherwise, the application downloads the data from body_url and stores it in its FileStorage.

The intent is to make sure data winds up at a reliably available location, ensuring that anyone who can access the API can also access the raw response body for any version. Hosts should be listed in ALLOWED_ARCHIVE_HOSTS if they meet this criteria better than the application’s own file storage. The application’s storage area can be the local disk or it can be S3, depending on configuration. The component can take pluggable configurations, so we can support other storage types or locations in the future.

You can see more about this process in:

File Storage

The application needs to store files for several different purposes (storing raw import data, archiving HTTP response bodies as described in the previous section, specialized logs, etc). To do this, it uses the FileStorage module, which has different implementations for different types of storage, such as the local disk or Amazon S3.

At current, the application creates two FileStorage instances:

  1. “Archival storage” is used to store raw HTTP response bodies for each version of a page. See the “how data gets loaded” section for more details. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the AWS_ARCHIVE_BUCKET environment variable. Everything in this storage area is publicly available.

  2. “Working storage” is used to store internal data, such as raw import data and import logs. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the AWS_WORKING_BUCKET environment variable. Everything in this storage area should be considered private and you should not expose it to the public web.

  3. For historical reasons, EDGI’s deployment includes a third S3 bucket that is not directly accessed by the application. It’s where we store HTTP response bodies collected from Versionista, a service we previously used for scraping government web pages. You can see it listed in the example settings for ALLOWED_ARCHIVE_HOSTS.

Releases

New releases of the app are published automatically as Docker images by CircleCI when someone pushes to the release branch. They are availble at https://hub.docker.com/r/envirodgi. See web-monitoring-ops for how we deploy releases to actual web servers.

Images are tagged with the SHA-1 of the git commit they were built from. For example, the image envirodgi/db-rails-server:ddc246819a039465e7711a1abd61f67c14b7a320 was built from commit ddc246819a039465e7711a1abd61f67c14b7a320.

We usually create merge commits on the release branch that note the PRs included in the release or any other relevant notes (e.g. Release #503, #504).

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for all their contributions! See our contributing guidelines to find out how you can help.

Contributions Name
📖 👀 Dan Allan
📋 🔍 Andrew Bergman
💻 🚇 📖 💬 👀 Rob Brackett
💻 Alessandro Caporrini
📖 Patrick Connolly
💻 Robert Dalin
💻 Kate Donaldson
📖 Michael Hardy
💻 Kasper Holbek Jensen
💻 Shishir Joshi
💻 📖 Krzysztof Madejski
📖 Ansar Memon (Amoury)
📖 📋 📢 Matt Price
📋 🔍 Toly Rinberg
💻 Ben Sheldon
💻 Ewelina Sobora
🚇 Frederik Spang
💻 Max Tedford
💻 Eddie Tejeda
📖 📋 Dawn Walker

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

License & Copyright

Copyright (C) 2017 Environmental Data and Governance Initiative (EDGI)

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3.0.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the LICENSE file for details.