You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A bucket name and path prefix, e.g. edgi-wm-archive/wayback/
An S3 URI, e.g. s3://edgi-wm-archive
An S3 URI with a path prefix, e.g. s3://edgi-wm-archive/wayback/
Why?
When we use the scripts here to import data from the Wayback Machine, we process the mementos from Wayback and then send the metadata to the DB’s import API. The API doesn’t actually accept the raw data of the response bodies (it’s complicated to do in a safe and effective way, and, although there is a placeholder for it in the code, we never made it happen).
You can see here that both the import script and and web-monitoring-db have to get data from Wayback. The problem that happens here is when the Wayback Machine loads slowly (either because it’s under heavy load or we’re getting a memento that is rarely accessed), it might fail to load on the web-monitoring-db side, and therefore that record fails to save. This typically happens in about 1-2 of every 2000 mementos. Besides the occasional failure, this double-loading is also a waste of bandwidth and resources for both us and the archive!
(Note: The failures aren’t a serious problem because we typically grab overlapping sets of data from the Wayback Machine each time we run the script. The likelihood of a memento failing this way across multiple imports is pretty low. We do the overlap to work around the fact that Wayback has frequent indexing issues that sometimes cause mementos to be unfindable until several days after they were archived. The overlap period is longer than such outages typically last.)
We can work around this by doing what the old Versionista import script used to do: upload to S3 ourselves, before sending the metadata to web-monitoring-db. Web-monitoring-db will automatically skip loading mementos if the location it’s given is in an S3 bucket it already knows is OK.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in seven days if no further activity occurs. If it should not be closed, please comment! Thank you for your contributions.
The Internet Archive import script(s) (
wm import ia
andwm import ia-known-pages
) should have an option that causes them to upload Mementos to S3:S3 credentials should be read from the standard AWS environment variables (
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
).Setting this should cause the importer to save memento bodies in S3 before sending import metadata to web-monitoring-db. The metadata's
uri
property should be rewritten to point to the uploaded location in S3. The objects in S3 should be named by their SHA-256 base-16-encoded hash and theirContent-Type
header should be set appropriately, as web-monitoring-db currently does: https://github.com/edgi-govdata-archiving/web-monitoring-db/blob/46561ae6eb52b0d923f7832100d161fc98667d0c/lib/archiver/archiver.rb#L36-L66The value of the
--s3
should be one of:edgi-wm-archive
edgi-wm-archive/wayback/
s3://edgi-wm-archive
s3://edgi-wm-archive/wayback/
Why?
When we use the scripts here to import data from the Wayback Machine, we process the mementos from Wayback and then send the metadata to the DB’s import API. The API doesn’t actually accept the raw data of the response bodies (it’s complicated to do in a safe and effective way, and, although there is a placeholder for it in the code, we never made it happen).
The flow is something like this:
You can see here that both the import script and and web-monitoring-db have to get data from Wayback. The problem that happens here is when the Wayback Machine loads slowly (either because it’s under heavy load or we’re getting a memento that is rarely accessed), it might fail to load on the web-monitoring-db side, and therefore that record fails to save. This typically happens in about 1-2 of every 2000 mementos. Besides the occasional failure, this double-loading is also a waste of bandwidth and resources for both us and the archive!
(Note: The failures aren’t a serious problem because we typically grab overlapping sets of data from the Wayback Machine each time we run the script. The likelihood of a memento failing this way across multiple imports is pretty low. We do the overlap to work around the fact that Wayback has frequent indexing issues that sometimes cause mementos to be unfindable until several days after they were archived. The overlap period is longer than such outages typically last.)
We can work around this by doing what the old Versionista import script used to do: upload to S3 ourselves, before sending the metadata to web-monitoring-db. Web-monitoring-db will automatically skip loading mementos if the location it’s given is in an S3 bucket it already knows is OK.
Basically, we want the workflow to be more like:
The text was updated successfully, but these errors were encountered: