From 08bbf1907eb6f0ad1229b3bf67987181baa25791 Mon Sep 17 00:00:00 2001 From: Benjamin Altpeter Date: Mon, 30 Sep 2024 14:32:32 +0200 Subject: [PATCH] New dataset: web-monkey-september-2024 --- datasets.json | 9 ++++++++- datasette/settings.json | 2 +- datasette/templates/index.html | 7 ++++--- package.json | 1 + scripts/make-database.ts | 9 ++++++--- yarn.lock | 18 ++++++++++++++++++ 6 files changed, 38 insertions(+), 8 deletions(-) diff --git a/datasets.json b/datasets.json index fc6b4a2..33c8680 100644 --- a/datasets.json +++ b/datasets.json @@ -1,8 +1,15 @@ [ + { + "slug": "web-monkey-september-2024", + "title": "Traffic collection for TrackHAR adapter work on websites (September 2024)", + "description": "To start creating TrackHAR adapters for trackers active on the web, Benni a traffic collection on the top 10,000 websites in Germany as per the Chrome User Experience Report (CrUX) for August 2024.\n\nThe websites were accessed in a headless Chromium browser using Playwright from a French IP address. All websites were accessed twice for 60 seconds each: The first time without any user interaction, the second time random user input was provided, meaning that it is possible/likely that consent was given when requested.", + "url": "https://github.com/tweaselORG/experiments/issues/3", + "sourceCodeUrl": "https://github.com/tweaselORG/experiments/tree/main/web-monkey-september-2024" + }, { "slug": "monkey-april-2024", "title": "Traffic collection for TrackHAR adapter work (April 2024)", - "description": "For the TrackHAR adapter work, Benni ran another monkey traffic collection on 2,358 Android apps from the top charts in April 2024.\n\nThe apps were run in an Android 11 emulator for 120 seconds, receving random input from `adb monkey`, as such it is possible/likely that consent was given when requested.", + "description": "For the TrackHAR adapter work, Benni ran another monkey traffic collection on 2,358 Android apps from the top charts in April 2024.\n\nThe apps were run in an Android 11 emulator for 120 seconds, receiving random input from `adb monkey`, as such it is possible/likely that consent was given when requested.", "url": "https://github.com/tweaselORG/experiments/issues/2", "sourceCodeUrl": "https://github.com/tweaselORG/experiments/tree/main/monkey-april-2024" }, diff --git a/datasette/settings.json b/datasette/settings.json index 859ca50..e654530 100644 --- a/datasette/settings.json +++ b/datasette/settings.json @@ -1,5 +1,5 @@ { "facet_time_limit_ms": 1000, - "sql_time_limit_ms": 25000, + "sql_time_limit_ms": 50000, "suggest_facets": false } diff --git a/datasette/templates/index.html b/datasette/templates/index.html index d068e89..c29db5e 100644 --- a/datasette/templates/index.html +++ b/datasette/templates/index.html @@ -18,9 +18,9 @@

Tweasel open data Datasette instance

-

Tweasel is a project building infrastructure for detecting and complaining about tracking and privacy violations in mobile apps on Android and iOS. Among other things, we are developing a suite of tools and libraries for automated app analysis and tracking detection, and maintaining a wiki of HTTP endpoints used by tracking companies (for a full overview of what we’re doing, have a look at our documentation).

+

Tweasel is a project building infrastructure for detecting and complaining about tracking and privacy violations in mobile apps on Android and iOS as well as websites. Among other things, we are developing a suite of tools and libraries for automated app/website analysis and tracking detection, and maintaining a wiki of HTTP endpoints used by tracking companies (for a full overview of what we’re doing, have a look at our documentation).

-

For our work, we regularly run large-scale traffic analyses on mobile apps. We are using this data for example to maintain the tracking endpoint adapters of our TrackHAR library. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them. In addition, we want to provide documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean, and do so in a way that is replicable by others.

+

For our work, we regularly run large-scale traffic analyses on mobile apps and websites. We are using this data for example to maintain the tracking endpoint adapters of our TrackHAR library. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them. In addition, we want to provide documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean, and do so in a way that is replicable by others.

As such, we are publishing our datasets as open data for other researchers, activists, and anyone else who is interested in understanding the inner workings of trackers. We hope to thereby lower the barrier of entry for people to start investigating trackers themselves.

@@ -50,6 +50,7 @@

Datasets

  • Worrying confessions: A look at data safety labels on Android (data from September 2022, view requests)
  • Traffic collection for TrackHAR adapter work (July 2023) (data from July 2023, view requests)
  • Traffic collection for TrackHAR adapter work (April 2024) (data from April 2024, view requests)
  • +
  • Traffic collection for TrackHAR adapter work on websites (September 2024) (data from September 2024, view requests)
  • Note: We have decided to only publish requests to endpoints that are contacted by apps from at least two different vendors, using Apple’s definition for determining the vendor from the app ID. As such, our data is not suited for reverse-engineering internal app APIs.

    @@ -59,7 +60,7 @@

    Web interface

    We are publishing the data as a Datasette instance, which allows you to interactively explore the full data online, including running arbitrary SQL queries against it. Here are just a few examples of interesting things you can look at: