Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First traffic collection on the web #3

Closed
baltpeter opened this issue Sep 9, 2024 · 8 comments
Closed

First traffic collection on the web #3

baltpeter opened this issue Sep 9, 2024 · 8 comments
Assignees

Comments

@baltpeter
Copy link
Member

So far, all of our traffic collections have been about mobile apps on Android and iOS. We are now working on extending Tweasel for the web, so we also need data on tracking requests on the web.

@baltpeter baltpeter self-assigned this Sep 9, 2024
@baltpeter
Copy link
Member Author

The first decision we need to make is which websites we want to visit.

After looking at the list of lists included in Tranco, I think the Chrome User Experience Report (CrUX) is going to be the best for our use case.

The other lists (and thus, Tranco itself) are too biased towards DNS lookups as opposed to actual websites, which we are interested in. For example, currently there are at least 12 rows in the top 25 of the Tranco list that are CDNs, DNS servers, etc., but not websites:

amazonaws.com
akamai.net
a-msedge.net
root-servers.net
akamaiedge.net
gstatic.com
tiktokcdn.com
googletagmanager.com
googlevideo.com
gtld-servers.net
akadns.net
windowsupdate.com

As far as accessing the CrUX data goes, there is https://github.com/crissyfield/crux-dumps, which has the very laudable goal of relieving you from having to deal with BigQuery. :D

I looked at the top 10k as of 2024/08. Unfortunately, I don't think that is going to be helpful for our use case, either. For example, look at the included entries for www.google.*:

https://www.google.bg
https://www.google.ch
https://www.google.co.il
https://www.google.co.nz
https://www.google.co.za
https://www.google.com.eg
https://www.google.com.my
https://www.google.com.pk
https://www.google.com.sa
https://www.google.com.sg
https://www.google.com.ua
https://www.google.dk
https://www.google.fi
https://www.google.hr
https://www.google.ie
https://www.google.sk

So, I had to use BigQuery after all to grab the top 10k for Germany using the following query:

SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 10000

Dump: bquxjob_5ded4b14_19209284743.json, bquxjob_5ded4b14_19209284743.csv

@baltpeter
Copy link
Member Author

baltpeter commented Sep 23, 2024

The analysis has been running for a few days now. Code is in: https://github.com/tweaselORG/experiments/tree/main/web-monkey-september-2024

@baltpeter
Copy link
Member Author

Final tally: The HAR files were around ~300 GB, the exported SQLite database of just the requests is ~20 GB.

In total, we have ~2.3 million requests across 9819 distinct initiator origins.

@baltpeter
Copy link
Member Author

Following the investigation and discussion in tweaselORG/data.tweasel.org#3, we have decided that for the time being we can only publish the data from the top 1k sites, unfortunately.

@baltpeter
Copy link
Member Author

Downloaded the data for the top 1k from BigQuery:

SELECT DISTINCT origin, experimental.popularity.rank FROM `chrome-ux-report.country_de.202408` WHERE experimental.popularity.rank <= 1000

Dump: bquxjob_3370bb45_192c2744035.csv, bquxjob_3370bb45_192c2744035.json

@baltpeter
Copy link
Member Author

Oh wait, the 10k dataset also has the rank information so I wouldn't have needed to do that. Well, anyway…

@baltpeter
Copy link
Member Author

To trim the dataset, I:

  1. Open the web-monkey-september-2024.db database in DB Browser for SQLite.
  2. Used the File -> Import -> Table from CSV file feature to import the 1k dump.
  3. Then, I ran delete from requests where initiator not in (select origin from ranks), which took a cool 338321ms with 5,342,925 rows affected. Afterwards, we have 592,377 rows in the raw dataset (instead of 5,935,302).
  4. To verify the query did what I expected, I ran select count (distinct initiator) from requests. That returned 988, which seems reasonable (some sites may just not have worked).
  5. Ran vacuum to free up disk space. Now, the DB is 2.1 GB instead of 19.9 GB.

@baltpeter
Copy link
Member Author

For anyone interested, I uploaded a dump of the full dataset for the top 10k to Zenodo: https://zenodo.org/records/13990110

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant