From 6500685f10dce6a4e5e4edb76e4fcfdcc5d90cf9 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:12:52 +0000 Subject: [PATCH 01/12] Bump capybara from 3.27.0 to 3.28.0 Bumps [capybara](https://github.com/teamcapybara/capybara) from 3.27.0 to 3.28.0. - [Release notes](https://github.com/teamcapybara/capybara/releases) - [Changelog](https://github.com/teamcapybara/capybara/blob/master/History.md) - [Commits](https://github.com/teamcapybara/capybara/compare/3.27.0...3.28.0) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Gemfile.lock b/Gemfile.lock index cdfa6235..9e3b690d 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -68,7 +68,7 @@ GEM msgpack (~> 1.0) builder (3.2.3) byebug (11.0.1) - capybara (3.27.0) + capybara (3.28.0) addressable mini_mime (>= 0.1.3) nokogiri (~> 1.8) From b6205ca761e8349f97f87dd7d153e20cf013c3fa Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:13:18 +0000 Subject: [PATCH 02/12] Bump dotenv-rails from 2.7.4 to 2.7.5 Bumps [dotenv-rails](https://github.com/bkeepers/dotenv) from 2.7.4 to 2.7.5. - [Release notes](https://github.com/bkeepers/dotenv/releases) - [Changelog](https://github.com/bkeepers/dotenv/blob/master/Changelog.md) - [Commits](https://github.com/bkeepers/dotenv/compare/v2.7.4...v2.7.5) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 9e3b690d..0ed14ba7 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -93,9 +93,9 @@ GEM railties (>= 4.1.0, < 6.0) responders warden (~> 1.2.3) - dotenv (2.7.4) - dotenv-rails (2.7.4) - dotenv (= 2.7.4) + dotenv (2.7.5) + dotenv-rails (2.7.5) + dotenv (= 2.7.5) railties (>= 3.2, < 6.1) erubi (1.8.0) execjs (2.7.0) @@ -204,7 +204,7 @@ GEM rails-dom-testing (2.0.3) activesupport (>= 4.2.0) nokogiri (>= 1.6) - rails-html-sanitizer (1.0.4) + rails-html-sanitizer (1.2.0) loofah (~> 2.2, >= 2.2.2) railties (5.2.3) actionpack (= 5.2.3) @@ -213,7 +213,7 @@ GEM rake (>= 0.8.7) thor (>= 0.19.0, < 2.0) rainbow (3.0.0) - rake (12.3.2) + rake (12.3.3) rb-fsevent (0.10.3) rb-inotify (0.10.0) ffi (~> 1.0) From 8844416d6dcc9fde5812cc219cbd9b7448a9f550 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:13:44 +0000 Subject: [PATCH 03/12] Bump oj from 3.8.1 to 3.9.1 Bumps [oj](https://github.com/ohler55/oj) from 3.8.1 to 3.9.1. - [Release notes](https://github.com/ohler55/oj/releases) - [Changelog](https://github.com/ohler55/oj/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ohler55/oj/compare/v3.8.1...v3.9.1) Signed-off-by: dependabot-preview[bot] --- Gemfile | 2 +- Gemfile.lock | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Gemfile b/Gemfile index 2b2cb385..81ba34db 100644 --- a/Gemfile +++ b/Gemfile @@ -19,7 +19,7 @@ gem 'resque' gem 'resque-heroku-signals' gem 'sassc-rails', '~> 2.1.2' gem 'uglifier', '>= 1.3.0' -gem 'oj', '~> 3.8' +gem 'oj', '~> 3.9' gem 'pundit' gem 'sentry-raven' gem 'readthis' diff --git a/Gemfile.lock b/Gemfile.lock index 0ed14ba7..cf64ff55 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -160,7 +160,7 @@ GEM nio4r (2.4.0) nokogiri (1.10.4) mini_portile2 (~> 2.4.0) - oj (3.8.1) + oj (3.9.1) orm_adapter (0.5.0) os (1.0.1) parallel (1.17.0) @@ -332,7 +332,7 @@ DEPENDENCIES httparty jwt (~> 2.2) listen (~> 3.1) - oj (~> 3.8) + oj (~> 3.9) pg (~> 1.1) postmark-rails pry-rails From 315a65196395185d08640d49ccae7b27e87ea328 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:14:46 +0000 Subject: [PATCH 04/12] Bump bootsnap from 1.4.4 to 1.4.5 Bumps [bootsnap](https://github.com/Shopify/bootsnap) from 1.4.4 to 1.4.5. - [Release notes](https://github.com/Shopify/bootsnap/releases) - [Changelog](https://github.com/Shopify/bootsnap/blob/master/CHANGELOG.md) - [Commits](https://github.com/Shopify/bootsnap/compare/v1.4.4...v1.4.5) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index cf64ff55..4f294fad 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -64,7 +64,7 @@ GEM aws-eventstream (~> 1.0, >= 1.0.2) bcrypt (3.1.12) bindex (0.5.0) - bootsnap (1.4.4) + bootsnap (1.4.5) msgpack (~> 1.0) builder (3.2.3) byebug (11.0.1) @@ -152,7 +152,7 @@ GEM mini_portile2 (2.4.0) minitest (5.11.3) mono_logger (1.1.0) - msgpack (1.2.10) + msgpack (1.3.1) multi_json (1.13.1) multi_xml (0.6.0) multipart-post (2.1.1) From 51a4bf3e0cc00e3b375e3ff53798a8d64c41708a Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:15:52 +0000 Subject: [PATCH 05/12] Bump puma from 4.0.1 to 4.1.0 Bumps [puma](https://github.com/puma/puma) from 4.0.1 to 4.1.0. - [Release notes](https://github.com/puma/puma/releases) - [Changelog](https://github.com/puma/puma/blob/master/History.md) - [Commits](https://github.com/puma/puma/compare/v4.0.1...v4.1.0) Signed-off-by: dependabot-preview[bot] --- Gemfile | 2 +- Gemfile.lock | 6 +++--- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/Gemfile b/Gemfile index 81ba34db..2fc7dba6 100644 --- a/Gemfile +++ b/Gemfile @@ -13,7 +13,7 @@ gem 'httparty' gem 'jwt', '~> 2.2' gem 'rails', '~> 5.2.3' gem 'pg', '~> 1.1' -gem 'puma', '~> 4.0' +gem 'puma', '~> 4.1' gem 'rack-cors', :require => 'rack/cors' gem 'resque' gem 'resque-heroku-signals' diff --git a/Gemfile.lock b/Gemfile.lock index 4f294fad..8b492251 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -157,7 +157,7 @@ GEM multi_xml (0.6.0) multipart-post (2.1.1) mustermann (1.0.3) - nio4r (2.4.0) + nio4r (2.5.1) nokogiri (1.10.4) mini_portile2 (~> 2.4.0) oj (3.9.1) @@ -178,7 +178,7 @@ GEM pry-rails (0.3.9) pry (>= 0.10.4) public_suffix (3.1.1) - puma (4.0.1) + puma (4.1.0) nio4r (~> 2.0) pundit (2.0.1) activesupport (>= 3.0.0) @@ -336,7 +336,7 @@ DEPENDENCIES pg (~> 1.1) postmark-rails pry-rails - puma (~> 4.0) + puma (~> 4.1) pundit rack-cors rails (~> 5.2.3) From 3817f6dc60eaf6733b7ba80367b0a5ee3f2eb8e5 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:16:09 +0000 Subject: [PATCH 06/12] Bump aws-sdk-s3 from 1.46.0 to 1.48.0 Bumps [aws-sdk-s3](https://github.com/aws/aws-sdk-ruby) from 1.46.0 to 1.48.0. - [Release notes](https://github.com/aws/aws-sdk-ruby/releases) - [Changelog](https://github.com/aws/aws-sdk-ruby/blob/master/gems/aws-sdk-s3/CHANGELOG.md) - [Commits](https://github.com/aws/aws-sdk-ruby/compare/v1.46.0...v1.48.0) Signed-off-by: dependabot-preview[bot] --- Gemfile | 2 +- Gemfile.lock | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/Gemfile b/Gemfile index 2fc7dba6..de147352 100644 --- a/Gemfile +++ b/Gemfile @@ -7,7 +7,7 @@ end ruby '2.6.3' -gem 'aws-sdk-s3', '~> 1.46' +gem 'aws-sdk-s3', '~> 1.48' gem 'devise' gem 'httparty' gem 'jwt', '~> 2.2' diff --git a/Gemfile.lock b/Gemfile.lock index 8b492251..65afe908 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -47,8 +47,8 @@ GEM arel (9.0.0) ast (2.4.0) aws-eventstream (1.0.3) - aws-partitions (1.195.0) - aws-sdk-core (3.61.2) + aws-partitions (1.207.0) + aws-sdk-core (3.65.1) aws-eventstream (~> 1.0, >= 1.0.2) aws-partitions (~> 1.0) aws-sigv4 (~> 1.1) @@ -56,7 +56,7 @@ GEM aws-sdk-kms (1.24.0) aws-sdk-core (~> 3, >= 3.61.1) aws-sigv4 (~> 1.1) - aws-sdk-s3 (1.46.0) + aws-sdk-s3 (1.48.0) aws-sdk-core (~> 3, >= 3.61.1) aws-sdk-kms (~> 1) aws-sigv4 (~> 1.1) @@ -320,7 +320,7 @@ PLATFORMS DEPENDENCIES addressable (~> 2.6) - aws-sdk-s3 (~> 1.46) + aws-sdk-s3 (~> 1.48) bootsnap (>= 1.3.1) byebug capybara From 107114f4c0f6a442139e66324beac29d18eb1b78 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 10:16:32 +0000 Subject: [PATCH 07/12] Bump webmock from 3.6.2 to 3.7.0 Bumps [webmock](https://github.com/bblimke/webmock) from 3.6.2 to 3.7.0. - [Release notes](https://github.com/bblimke/webmock/releases) - [Changelog](https://github.com/bblimke/webmock/blob/master/CHANGELOG.md) - [Commits](https://github.com/bblimke/webmock/compare/v3.6.2...v3.7.0) Signed-off-by: dependabot-preview[bot] --- Gemfile | 2 +- Gemfile.lock | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/Gemfile b/Gemfile index de147352..cf4bf7d8 100644 --- a/Gemfile +++ b/Gemfile @@ -64,7 +64,7 @@ end group :test do gem 'capybara' gem 'capybara-email' - gem 'webmock', '~> 3.6' + gem 'webmock', '~> 3.7' end group :production do diff --git a/Gemfile.lock b/Gemfile.lock index 65afe908..5fe81efb 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -305,7 +305,7 @@ GEM activemodel (>= 5.0) bindex (>= 0.4.0) railties (>= 5.0) - webmock (3.6.2) + webmock (3.7.0) addressable (>= 2.3.6) crack (>= 0.3.2) hashdiff (>= 0.4.0, < 2.0.0) @@ -353,7 +353,7 @@ DEPENDENCIES tzinfo-data uglifier (>= 1.3.0) web-console (>= 3.3.0) - webmock (~> 3.6) + webmock (~> 3.7) RUBY VERSION ruby 2.6.3p62 From ffd9f3b6b0444ef2c4c381605532c793f53f6b17 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 21:35:26 +0000 Subject: [PATCH 08/12] Bump addressable from 2.6.0 to 2.7.0 Bumps [addressable](https://github.com/sporkmonger/addressable) from 2.6.0 to 2.7.0. - [Release notes](https://github.com/sporkmonger/addressable/releases) - [Changelog](https://github.com/sporkmonger/addressable/blob/master/CHANGELOG.md) - [Commits](https://github.com/sporkmonger/addressable/compare/addressable-2.6.0...addressable-2.7.0) Signed-off-by: dependabot-preview[bot] --- Gemfile | 2 +- Gemfile.lock | 8 ++++---- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/Gemfile b/Gemfile index cf4bf7d8..ddc4bfd5 100644 --- a/Gemfile +++ b/Gemfile @@ -25,7 +25,7 @@ gem 'sentry-raven' gem 'readthis' gem 'hiredis' gem 'google-api-client' -gem 'addressable', '~> 2.6' +gem 'addressable', '~> 2.7' # See https://github.com/rails/execjs#readme for more supported runtimes # gem 'therubyracer', platforms: :ruby diff --git a/Gemfile.lock b/Gemfile.lock index 5fe81efb..84a725b4 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -42,8 +42,8 @@ GEM i18n (>= 0.7, < 2) minitest (~> 5.1) tzinfo (~> 1.1) - addressable (2.6.0) - public_suffix (>= 2.0.2, < 4.0) + addressable (2.7.0) + public_suffix (>= 2.0.2, < 5.0) arel (9.0.0) ast (2.4.0) aws-eventstream (1.0.3) @@ -177,7 +177,7 @@ GEM method_source (~> 0.9.0) pry-rails (0.3.9) pry (>= 0.10.4) - public_suffix (3.1.1) + public_suffix (4.0.1) puma (4.1.0) nio4r (~> 2.0) pundit (2.0.1) @@ -319,7 +319,7 @@ PLATFORMS ruby DEPENDENCIES - addressable (~> 2.6) + addressable (~> 2.7) aws-sdk-s3 (~> 1.48) bootsnap (>= 1.3.1) byebug From 578f617bf1c96f460d3a42c4076d8573cb1f1a1a Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 21:34:39 +0000 Subject: [PATCH 09/12] Bump devise from 4.6.2 to 4.7.0 Bumps [devise](https://github.com/plataformatec/devise) from 4.6.2 to 4.7.0. - [Release notes](https://github.com/plataformatec/devise/releases) - [Changelog](https://github.com/plataformatec/devise/blob/master/CHANGELOG.md) - [Commits](https://github.com/plataformatec/devise/compare/v4.6.2...v4.7.0) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/Gemfile.lock b/Gemfile.lock index 84a725b4..5fdac542 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -62,7 +62,7 @@ GEM aws-sigv4 (~> 1.1) aws-sigv4 (1.1.0) aws-eventstream (~> 1.0, >= 1.0.2) - bcrypt (3.1.12) + bcrypt (3.1.13) bindex (0.5.0) bootsnap (1.4.5) msgpack (~> 1.0) @@ -87,10 +87,10 @@ GEM crass (1.0.4) declarative (0.0.10) declarative-option (0.1.0) - devise (4.6.2) + devise (4.7.0) bcrypt (~> 3.0) orm_adapter (~> 0.1) - railties (>= 4.1.0, < 6.0) + railties (>= 4.1.0) responders warden (~> 1.2.3) dotenv (2.7.5) @@ -228,9 +228,9 @@ GEM declarative (< 0.1.0) declarative-option (< 0.2.0) uber (< 0.2.0) - responders (2.4.1) - actionpack (>= 4.2.0, < 6.0) - railties (>= 4.2.0, < 6.0) + responders (3.0.0) + actionpack (>= 5.0) + railties (>= 5.0) resque (2.0.0) mono_logger (~> 1.0) multi_json (~> 1.0) From 28d95966057ec500304ec91c0403bd31e2ed585d Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Sun, 1 Sep 2019 21:43:46 +0000 Subject: [PATCH 10/12] Bump pundit from 2.0.1 to 2.1.0 Bumps [pundit](https://github.com/varvet/pundit) from 2.0.1 to 2.1.0. - [Release notes](https://github.com/varvet/pundit/releases) - [Changelog](https://github.com/varvet/pundit/blob/master/CHANGELOG.md) - [Commits](https://github.com/varvet/pundit/compare/v2.0.1...v2.1.0) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Gemfile.lock b/Gemfile.lock index 5fdac542..e5aa0810 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -180,7 +180,7 @@ GEM public_suffix (4.0.1) puma (4.1.0) nio4r (~> 2.0) - pundit (2.0.1) + pundit (2.1.0) activesupport (>= 3.0.0) rack (2.0.7) rack-cors (1.0.3) From c79bde1556dd7e16de3b29e0849149ed01b30a9c Mon Sep 17 00:00:00 2001 From: Rob Brackett Date: Thu, 5 Sep 2019 18:19:01 -0700 Subject: [PATCH 11/12] Document `ALLOWED_ARCHIVE_HOSTS` and S3 Buckets (#595) Add more explanation to `.env.example` and to the README about how data is managed and stored in the application. Also explain the overall data model and document the public access user on staging (see edgi-govdata-archiving/web-monitoring-ui#220). Fixes #249. Fixes #469. Co-Authored-By: Kevin Nguyen --- .env.example | 8 +++++- README.md | 81 +++++++++++++++++++++++++++++++++++++++++++++++----- 2 files changed, 81 insertions(+), 8 deletions(-) diff --git a/.env.example b/.env.example index 66a0f61c..2745726a 100644 --- a/.env.example +++ b/.env.example @@ -23,7 +23,13 @@ MAIL_SENDER='some-email-account@example.com' # MAIL_SMTP_PASSWORD='XXX' # MAIL_SMTP_TLS='true' -# Controls URLs that won't be downloaded and re-hosted when importing versions +# URLs that won't be downloaded and re-hosted when importing versions. +# When new page or version data is imported (e.g. via `POST /api/v0/imports`), +# the `uri` field points to a location where the raw HTTP response body is +# stored. If the `uri` host does *not* match one of the values in +# `ALLOWED_ARCHIVE_HOSTS`, the application downloads the data from `uri` and +# stores it (see `lib/archiver` for more). That way, we can ensure data is +# always available to API users from a reliable public location. ALLOWED_ARCHIVE_HOSTS='https://edgi-web-monitoring-db.s3.amazonaws.com/ https://edgi-wm-versionista.s3.amazonaws.com/ https://edgi-wm-versionista.s3-us-west-2.amazonaws.com/ https://s3-us-west-2.amazonaws.com/edgi-wm-versionista/ https://edgi-versionista-archive.s3.amazonaws.com/ https://edgi-versionista-archive.s3-us-west-2.amazonaws.com/ https://s3-us-west-2.amazonaws.com/edgi-versionista-archive/' # OPTIONAL: Uncomment & fill in to use S3 for storage instead of your local diff --git a/README.md b/README.md index 0f4eb748..96064f77 100644 --- a/README.md +++ b/README.md @@ -2,13 +2,22 @@ # web-monitoring-db -This repository is the database and API underlying the EDGI [Web Monitoring Project](https://github.com/edgi-govdata-archiving/web-monitoring). +This repository is the database and API underlying the EDGI [Web Monitoring Project](https://github.com/edgi-govdata-archiving/web-monitoring). It’s a Rails app that: -It’s a Rails app that: +- Acts as a database of monitored pages and captured versions of those pages over time. -- Acts as a database of monitored pages and revisions that have been made to them -- Allows other services to add new tracked pages/versions (we are currently focused on Versionista, but this database will soon host data from other sources, such as the Internet Archive) -- Provides an API to get that version data and allow analysts or other automated tools to annotate those versions with metadata + *(The application does not record new versions itself, but relies on importing data from external services, like [the Internet Archive](https://archive.org) or [Versionista](https://versionista.com). See [“How Data Gets Loaded”](#how-data-gets-loaded) below for more.)* + +- Provides an API to get that page and version data, and to allow analysts or other automated tools to annotate those versions with metadata about what has changed from version to version. + +For more about how data is modeled in this project, see [“Data Model”](#data-model) below. + +API documentation is available from the homepage of the application, e.g. by pointing your browser to http://localhost:3000/ or https://api.monitoring.envirodatagov.org. It’s generated from our OpenAPI docs in [`swagger.yml`](./swagger.yml). + +We maintain a publicly available *staging server* at https://api-staging.monitoring.envirodatagov.org that you can test against. It runs the latest code and has non-production data — it’s safe to modify or post new versions or annotations to, but you should not rely on that data sticking around; it may get reset at any time. **For access, ask for an account on Slack or use the public user credentials:** + +- Username: `public.access@envirodatagov.org` +- Password: `PUBLIC_ACCESS` ## Installation @@ -175,7 +184,7 @@ It’s a Rails app that: - `analysis`: Auto-analyze changes between versions and create annotations with the results. -## Manual Postgres Setup +### Manual Postgres Setup If you don’t want to populate your DB with seed data, want to manage creation of the database yourself, or otherwise manually do database setup, run any of the following commands as desired instead of `rake db:setup`: @@ -197,7 +206,7 @@ User.create( ``` -## Docker +### Docker The Dockerfile runs the rails server on port 3000 in the container. To build and run: @@ -212,6 +221,64 @@ docker run -p 6379:6379 envirodgi/db-import-worker -e . Point your browser or ``curl`` at ``http://localhost:3000``. +## Data Model + +The database models three main types of data: + +- **Pages**, which represent a page on the internet. Pages are identified by a unique ID rather than their URL because pages can move or be available from multiple URLs. *(Note: we don't actually model that yet, though! See [#492](https://github.com/edgi-govdata-archiving/web-monitoring-db/issues/492) for more.)* + +- **Versions**, which represent a particular page at a particular point in time. We use the term “version” instead of others more common in the archival space because we attempt to only represent *different* versions. That is, if a page changed on Wednesday and we captured copies of it on Monday, Tuesday, and Wednesday, we only make version records for Monday and Wednesday (because Tuesday was the same as Monday). + + *(Note: because of technical issues around imported data, we often store more versions than we should according to the above definition [e.g. we might still have a record for Tuesday]. Versions have a `different` field that indicates whether a version is different from the previous one, and the API only returns versions that are `different` unless you explicitly request otherwise.)* + +- **Annotations**, which represent an analysis about what’s changed between any two *versions* of a *page*. Annotations have a specialized `priority` and `significance`, which are numbers between 0 and 1, an `author`, indicating who made the analysis (it could be a bot account), and an `annotation` field, which is a JSON object with no specified structure (inside this field, annotations can include any data desired). + +There are several other kinds of objects, but they are subservient to the ones above: + +- **Changes**, which serve to connect any two *versions* of a *page*. *Annotations* are actually connected to *changes*, rather than directly to two *versions*. You can also generate diffs for a given *change*. + +- **Tags**, which can be applied to pages. They help sort and categorize things. Most tags are manually applied, but the application auto-generates a few: + - `domain:`, e.g. `domain:www.epa.gov` for a page at `https://www.epa.gov/citizen-science` + - `2l-domain:` e.g. `2l-domain:epa.gov` for a page at `https://www.epa.gov/citizen-science` + +- **Maintainers**, which can be applied to pages. They represent organizations that maintain a given page. For example, the page at `https://www.epa.gov/citizen-science` is maintained by `EPA`. + +- **Imports** model requests to import new data and the results of the import operation. + +- **Users** model people (both human and bots) who can view, import, and annotate data. You currently have to have a user account to do anything in the application, though we hope accounts will not be needed to view public data in the future. + +Actual database schemas for each of these tables is listed in [`db/schema.rb`](./db/schema.rb). + + +### How Data Gets Loaded + +The web-monitoring-db project does not actually monitor or scrape pages on the web. Instead, we rely on importing data from other services, like [the Internet Archive](https://archive.org). Each day, a script queries other services for historical snapshots and sends the results to the `/api/v0/imports` endpoint. + +Most of the data sent to `/api/v0/imports` matches up directly with the structure of the [`Version` model](./db/schema.rb). However, the `uri` field in an import is treated specially. + +When new page or version data is imported, the `uri` field points to a location where the raw HTTP response body can be retrieved. If the `uri` host matches one of the values in the [`ALLOWED_ARCHIVE_HOSTS` environment variable](./.env.example), the version record that gets added to the database will simply point to that external location as a source of raw response data. Otherwise, the application downloads the data from `uri` and stores it in its `FileStorage`. + +The intent is to make sure data winds up at a reliably available location, ensuring that anyone who can access the API can also access the raw response body for any version. Hosts should be listed in `ALLOWED_ARCHIVE_HOSTS` if they meet this criteria better than the application’s own file storage. The application’s storage area can be the local disk or it can be S3, depending on configuration. The component can take pluggable configurations, so we can support other storage types or locations in the future. + +You can see more about this process in: +- The overview repo’s [“architecture” document](https://github.com/edgi-govdata-archiving/web-monitoring/blob/master/ARCHITECTURE.md#web-page-snapshottingcapturing-workflow) +- The [import job code](./app/jobs/import_versions_job.rb), where imports are processed. +- The [`Archiver` module code](./lib/archiver/archiver.rb), where raw HTTP response data is saved. + + +### File Storage + +The application needs to store files for several different purposes (storing raw import data, archiving HTTP response bodies as described in the previous section, specialized logs, etc). To do this, it uses the [`FileStorage`](https://github.com/edgi-govdata-archiving/web-monitoring-db/tree/master/lib/file_storage) module, which has different implementations for different types of storage, such as [the local disk](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/lib/file_storage/local_file.rb) or [Amazon S3](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/lib/file_storage/s3.rb). + +At current, the application creates two `FileStorage` instances: + +1. “Archival storage” is used to store raw HTTP response bodies for each version of a page. See the [“how data gets loaded” section](#how-data-gets-loaded) for more details. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the `AWS_ARCHIVE_BUCKET` environment variable. **Everything in this storage area is publicly available.** + +2. “Working storage” is used to store internal data, such as raw import data and import logs. Under a default configuration, this is your local disk in development and S3 in production. You can configure the S3 bucket used for it with the `AWS_WORKING_BUCKET` environment variable. **Everything in this storage area should be considered private and you should not expose it to the public web.** + +3. For historical reasons, EDGI’s deployment includes a third S3 bucket that is not directly accessed by the application. It’s where we store HTTP response bodies collected from [Versionista](https://versionista.com), a service we previously used for scraping government web pages. You can see it listed in [the example settings for `ALLOWED_ARCHIVE_HOSTS`](https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/master/.env.example). + + ## Code of Conduct This repository falls under EDGI's [Code of Conduct](https://github.com/edgi-govdata-archiving/overview/blob/master/CONDUCT.md). From afacb3ab584a84036d5c74445e3e0a2b3bba3250 Mon Sep 17 00:00:00 2001 From: "dependabot-preview[bot]" <27856297+dependabot-preview[bot]@users.noreply.github.com> Date: Tue, 10 Sep 2019 08:48:14 -0700 Subject: [PATCH 12/12] [Security] Bump devise from 4.7.0 to 4.7.1 (#597) Bumps [devise](https://github.com/plataformatec/devise) from 4.7.0 to 4.7.1. **This update includes a security fix.** - [Release notes](https://github.com/plataformatec/devise/releases) - [Changelog](https://github.com/plataformatec/devise/blob/master/CHANGELOG.md) - [Commits](https://github.com/plataformatec/devise/compare/v4.7.0...v4.7.1) Signed-off-by: dependabot-preview[bot] --- Gemfile.lock | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Gemfile.lock b/Gemfile.lock index e5aa0810..ee89f0bc 100644 --- a/Gemfile.lock +++ b/Gemfile.lock @@ -87,7 +87,7 @@ GEM crass (1.0.4) declarative (0.0.10) declarative-option (0.1.0) - devise (4.7.0) + devise (4.7.1) bcrypt (~> 3.0) orm_adapter (~> 0.1) railties (>= 4.1.0)