- Page: a web resource (HTML pages, PDFs, Excel spreadsheets, CSVs, images, etc.) crawled over time by one or more services like the Internet Archive or Versionista.
- Version: a snapshot of a Page at a specific time that is different from a previous Version.
- Change: two different Versions of the same Page.
- Diff: a representation of a Change: this could be a plain text
diff
(as in the UNIX command line utility) or a richer representation such as differences in the rendered HTML. - Annotation: a set of key-value pairs characterizing a given Change, submitted by a human analyst or generated by an automated process. A given Change might be annotated by multiple analysts, thus creating multiple Annotations per Change.
The project is currently divided into several repositories handling complementary aspects of web monitoring. They can be developed and upgraded semi-independently, communicating via agreed-upon interfaces. For additional information, you can contact the active maintainers listed alongside each repo:
- web-monitoring-db (@Mr0grog) A Ruby on Rails app that serves database data via a REST API, serves diffs, and collects human-entered annotations.
- web-monitoring-ui (@lightandluck) A React front-end that provides useful views of the diffs. It communicates with the Rails app via JSON.
- web-monitoring-processing (@danielballan) A Python backend that pulls data from a source like the Internet Archive and computes diffs.
- web-monitoring-versionista-scraper (@Mr0grog)
A set of Node.js scripts used to extract data from Versionista and load it into the database. It also generates the CSV files that analysts currently use in Google Spreadsheets to review changes. This project runs on its own, but in the future may be managed by or merged into
web-monitoring-processing
.
For more details about the models we use in Scanner see web-monitoring-db's API documentation.
Diagram key | What happens | What does this | How | Criteria |
---|---|---|---|---|
A | Versionista is scraped | web-monitoring-versionista-scraper | Raw version bodies are scraped from Versionista, uploaded to S3, and the metadata (capture times, URLs, headers, etc.) are formatted and POSTed to -db | Runs on a cron job, formats per GH issue comment |
B | Internet Archive's Wayback Machine is queried for imports | web-monitoring-processing - this PR | Data is pulled from the IA API, formatted, and POSTed to -db | Runs on a cron job, formats per GH issue comment |
C | New metadata arrives at -db and is stored in a database | Scripts running on cron jobs that do ETL via scrapes or APIs (These scripts: Versionista scraper, Wayback Machine importer) | POST to /api/v0/imports as JSON array or newline-delimited JSON stream (stream preferred) | Contains:
|
D | Determination is made whether or not to download the data | -db | If the URL is in an acceptable, publicly readable location | Acceptable location checked per .env |
↳ E1 | URL for raw data is stored | -db (URI in metadata is untouched) | -db stores URL | Happens if -db has determined not to download the data |
↳ E2 | Raw data is verified and stored on URL we store and maintain | -db (URI in metadata is changed to point to our new location) | Raw response data downloaded from the URL, verified against SHA-256 hash of the data from initial POST, and stores it (in production) in a public S3 bucket | Happens if -db has determined to download the data |
F | (Success state, data has been stored) |