-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First traffic collection on the web #3
Comments
The first decision we need to make is which websites we want to visit. After looking at the list of lists included in Tranco, I think the Chrome User Experience Report (CrUX) is going to be the best for our use case. The other lists (and thus, Tranco itself) are too biased towards DNS lookups as opposed to actual websites, which we are interested in. For example, currently there are at least 12 rows in the top 25 of the Tranco list that are CDNs, DNS servers, etc., but not websites:
As far as accessing the CrUX data goes, there is https://github.com/crissyfield/crux-dumps, which has the very laudable goal of relieving you from having to deal with BigQuery. :D I looked at the top 10k as of 2024/08. Unfortunately, I don't think that is going to be helpful for our use case, either. For example, look at the included entries for
So, I had to use BigQuery after all to grab the top 10k for Germany using the following query: SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 10000 Dump: bquxjob_5ded4b14_19209284743.json, bquxjob_5ded4b14_19209284743.csv |
The analysis has been running for a few days now. Code is in: https://github.com/tweaselORG/experiments/tree/main/web-monkey-september-2024 |
Final tally: The HAR files were around ~300 GB, the exported SQLite database of just the requests is ~20 GB. In total, we have ~2.3 million requests across 9819 distinct initiator origins. |
Following the investigation and discussion in tweaselORG/data.tweasel.org#3, we have decided that for the time being we can only publish the data from the top 1k sites, unfortunately. |
Downloaded the data for the top 1k from BigQuery: SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 1000 Dump: bquxjob_3370bb45_192c2744035.csv, bquxjob_3370bb45_192c2744035.json |
Oh wait, the 10k dataset also has the rank information so I wouldn't have needed to do that. Well, anyway… |
To trim the dataset, I:
|
For anyone interested, I uploaded a dump of the full dataset for the top 10k to Zenodo: https://zenodo.org/records/13990110 |
So far, all of our traffic collections have been about mobile apps on Android and iOS. We are now working on extending Tweasel for the web, so we also need data on tracking requests on the web.
The text was updated successfully, but these errors were encountered: