-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stable all.requests
#5
Conversation
Would really appreciate review here, as I'm almost ready with backfill PR #10. |
Hey @max-ostapenko sorry for radio silence here. I mistakenly presumed this was discussed on the last HTTP Archive maintainers call (that I missed) but just found out that @rviscomi missed this too so think we both presumed the other was dealing with this. I think it would be good to discuss this on a call (whether the next planned one on 15th October or before). Questions/concerns I have include:
|
I would definitely like to discuss it before the next crawl. And notify users of the upcoming changes beforehand.
Please drop here any issues or comments that I've missed.
The reprocessing will write into Once the reprocessing is done we can:
and eventually replace the live table with
We can only estimate by running the first month. In September 2024 it took 6h for requests tables without any transformations.
Ideally we start this before the new crawl starts (so that we can already have it synced with new schema and cleaned objects to just run If anything not yet synced in WPT agent before the crawl, we can always cleanup the last crawl data within the pipeline.
Queries reading from struct columns need to be adjusted to a new schema to be valid. |
We could pull
|
We need to look at native JSON columns as there were some that couldn't be processed in that and so had to use JavaScript JSON columns. See: HTTPArchive/httparchive.org#923 (comment). Is this going to be a problem if we move to native JSON columns? |
Ignore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one small request to leave the crawlids and pagesid in summary.
Agreed. It could work, but I'm happy to leave it as is. |
Co-authored-by: Barry Pollard <barrypollard@google.com>
Schema changes
summary
column trimmed of a number of metrics (Closes The new schema and cost concerns for users data-pipeline#149)payload
column trimmed of headersrank
data added and used for clustering (Resolves Addrank
field toall.requests
data-pipeline#189, Closes Consider clusteringall.requests
table bypage
orrank
data-pipeline#263)payload
andsummary
columns are of JSON typeIntermediate table
all.requests_stable
Run reprocessing jobs with
all_requests_stable
tag.After reprocessing