First traffic collection on the web #3

baltpeter · 2024-09-09T06:53:28Z

So far, all of our traffic collections have been about mobile apps on Android and iOS. We are now working on extending Tweasel for the web, so we also need data on tracking requests on the web.

baltpeter · 2024-09-19T09:35:26Z

The first decision we need to make is which websites we want to visit.

After looking at the list of lists included in Tranco, I think the Chrome User Experience Report (CrUX) is going to be the best for our use case.

The other lists (and thus, Tranco itself) are too biased towards DNS lookups as opposed to actual websites, which we are interested in. For example, currently there are at least 12 rows in the top 25 of the Tranco list that are CDNs, DNS servers, etc., but not websites:

amazonaws.com
akamai.net
a-msedge.net
root-servers.net
akamaiedge.net
gstatic.com
tiktokcdn.com
googletagmanager.com
googlevideo.com
gtld-servers.net
akadns.net
windowsupdate.com

As far as accessing the CrUX data goes, there is https://github.com/crissyfield/crux-dumps, which has the very laudable goal of relieving you from having to deal with BigQuery. :D

I looked at the top 10k as of 2024/08. Unfortunately, I don't think that is going to be helpful for our use case, either. For example, look at the included entries for www.google.*:

https://www.google.bg
https://www.google.ch
https://www.google.co.il
https://www.google.co.nz
https://www.google.co.za
https://www.google.com.eg
https://www.google.com.my
https://www.google.com.pk
https://www.google.com.sa
https://www.google.com.sg
https://www.google.com.ua
https://www.google.dk
https://www.google.fi
https://www.google.hr
https://www.google.ie
https://www.google.sk

So, I had to use BigQuery after all to grab the top 10k for Germany using the following query:

SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 10000

Dump: bquxjob_5ded4b14_19209284743.json, bquxjob_5ded4b14_19209284743.csv

baltpeter · 2024-09-23T08:52:22Z

The analysis has been running for a few days now. Code is in: https://github.com/tweaselORG/experiments/tree/main/web-monkey-september-2024

baltpeter · 2024-09-30T12:10:41Z

Final tally: The HAR files were around ~300 GB, the exported SQLite database of just the requests is ~20 GB.

In total, we have ~2.3 million requests across 9819 distinct initiator origins.

baltpeter · 2024-10-25T06:55:11Z

Following the investigation and discussion in tweaselORG/data.tweasel.org#3, we have decided that for the time being we can only publish the data from the top 1k sites, unfortunately.

baltpeter · 2024-10-25T06:57:17Z

Downloaded the data for the top 1k from BigQuery:

SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 1000

Dump: bquxjob_3370bb45_192c2744035.csv, bquxjob_3370bb45_192c2744035.json

baltpeter · 2024-10-25T06:57:46Z

Oh wait, the 10k dataset also has the rank information so I wouldn't have needed to do that. Well, anyway…

baltpeter · 2024-10-25T07:24:29Z

To trim the dataset, I:

Open the web-monkey-september-2024.db database in DB Browser for SQLite.
Used the File -> Import -> Table from CSV file feature to import the 1k dump.
Then, I ran delete from requests where initiator not in (select origin from ranks), which took a cool 338321ms with 5,342,925 rows affected. Afterwards, we have 592,377 rows in the raw dataset (instead of 5,935,302).
To verify the query did what I expected, I ran select count (distinct initiator) from requests. That returned 988, which seems reasonable (some sites may just not have worked).
Ran vacuum to free up disk space. Now, the DB is 2.1 GB instead of 19.9 GB.

baltpeter · 2024-10-25T07:36:39Z

For anyone interested, I uploaded a dump of the full dataset for the top 10k to Zenodo: https://zenodo.org/records/13990110

baltpeter self-assigned this Sep 9, 2024

baltpeter added a commit that referenced this issue Sep 20, 2024

Fixes #3: First traffic collection on the web

e194375

baltpeter closed this as completed in 4d00eba Sep 30, 2024

baltpeter mentioned this issue Oct 7, 2024

Data from web traffic collection makes Datasette instance a lot slower tweaselORG/data.tweasel.org#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First traffic collection on the web #3

First traffic collection on the web #3

baltpeter commented Sep 9, 2024

baltpeter commented Sep 19, 2024

baltpeter commented Sep 23, 2024 •

edited

Loading

baltpeter commented Sep 30, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

First traffic collection on the web #3

First traffic collection on the web #3

Comments

baltpeter commented Sep 9, 2024

baltpeter commented Sep 19, 2024

baltpeter commented Sep 23, 2024 • edited Loading

baltpeter commented Sep 30, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Oct 25, 2024

baltpeter commented Sep 23, 2024 •

edited

Loading