Skip to content

Commit

Permalink
New dataset: web-monkey-september-2024
Browse files Browse the repository at this point in the history
  • Loading branch information
baltpeter committed Sep 30, 2024
1 parent e6748a7 commit 08bbf19
Show file tree
Hide file tree
Showing 6 changed files with 38 additions and 8 deletions.
9 changes: 8 additions & 1 deletion datasets.json
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
[
{
"slug": "web-monkey-september-2024",
"title": "Traffic collection for TrackHAR adapter work on websites (September 2024)",
"description": "To start creating TrackHAR adapters for trackers active on the web, Benni a traffic collection on the top 10,000 websites in Germany as per the Chrome User Experience Report (CrUX) for August 2024.\n\nThe websites were accessed in a headless Chromium browser using Playwright from a French IP address. All websites were accessed twice for 60 seconds each: The first time without any user interaction, the second time random user input was provided, meaning that it is possible/likely that consent was given when requested.",
"url": "https://github.com/tweaselORG/experiments/issues/3",
"sourceCodeUrl": "https://github.com/tweaselORG/experiments/tree/main/web-monkey-september-2024"
},
{
"slug": "monkey-april-2024",
"title": "Traffic collection for TrackHAR adapter work (April 2024)",
"description": "For the TrackHAR adapter work, Benni ran another monkey traffic collection on 2,358 Android apps from the top charts in April 2024.\n\nThe apps were run in an Android 11 emulator for 120 seconds, receving random input from `adb monkey`, as such it is possible/likely that consent was given when requested.",
"description": "For the TrackHAR adapter work, Benni ran another monkey traffic collection on 2,358 Android apps from the top charts in April 2024.\n\nThe apps were run in an Android 11 emulator for 120 seconds, receiving random input from `adb monkey`, as such it is possible/likely that consent was given when requested.",
"url": "https://github.com/tweaselORG/experiments/issues/2",
"sourceCodeUrl": "https://github.com/tweaselORG/experiments/tree/main/monkey-april-2024"
},
Expand Down
2 changes: 1 addition & 1 deletion datasette/settings.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"facet_time_limit_ms": 1000,
"sql_time_limit_ms": 25000,
"sql_time_limit_ms": 50000,
"suggest_facets": false
}
7 changes: 4 additions & 3 deletions datasette/templates/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,9 @@
<div id="tweasel-home">
<h1>Tweasel open data Datasette instance</h1>

<p>Tweasel is a project building infrastructure for detecting and complaining about tracking and privacy violations in mobile apps on Android and iOS. Among other things, we are developing a suite of tools and libraries for automated app analysis and tracking detection, and maintaining a <a href="https://trackers.tweasel.org/">wiki of HTTP endpoints used by tracking companies</a> (for a full overview of what we’re doing, have a look at our <a href="https://docs.tweasel.org/">documentation</a>).</p>
<p>Tweasel is a project building infrastructure for detecting and complaining about tracking and privacy violations in mobile apps on Android and iOS as well as websites. Among other things, we are developing a suite of tools and libraries for automated app/website analysis and tracking detection, and maintaining a <a href="https://trackers.tweasel.org/">wiki of HTTP endpoints used by tracking companies</a> (for a full overview of what we’re doing, have a look at our <a href="https://docs.tweasel.org/">documentation</a>).</p>

<p>For our work, we regularly run large-scale traffic analyses on mobile apps. We are using this data for example to maintain the tracking endpoint adapters of our <a href="https://github.com/tweaselORG/TrackHAR">TrackHAR library</a>. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them. In addition, we want to provide documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean, and do so in a way that is replicable by others.</p>
<p>For our work, we regularly run large-scale traffic analyses on mobile apps and websites. We are using this data for example to maintain the tracking endpoint adapters of our <a href="https://github.com/tweaselORG/TrackHAR">TrackHAR library</a>. Our goal is to shine a light on how trackers work and what they collect, and as such we of course want as many people as possible researching them. In addition, we want to provide documentation on why/how we have concluded what certain values transmitted to a tracking endpoint mean, and do so in a way that is replicable by others.</p>

<p>As such, we are publishing our datasets as open data for other researchers, activists, and anyone else who is interested in understanding the inner workings of trackers. We hope to thereby lower the barrier of entry for people to start investigating trackers themselves.</p>

Expand Down Expand Up @@ -50,6 +50,7 @@ <h2 id="datasets">Datasets</h2>
<li><a href="https://www.datarequests.org/blog/android-data-safety-labels-analysis/">Worrying confessions: A look at data safety labels on Android</a> (data from September 2022, <a href="/data/requests?dataset=worrying-confessions">view requests</a>)</li>
<li><a href="https://github.com/tweaselORG/experiments/issues/1">Traffic collection for TrackHAR adapter work (July 2023)</a> (data from July 2023, <a href="/data/requests?dataset=monkey-july-2023">view requests</a>)</li>
<li><a href="https://github.com/tweaselORG/experiments/issues/2">Traffic collection for TrackHAR adapter work (April 2024)</a> (data from April 2024, <a href="/data/requests?dataset=monkey-april-2024">view requests</a>)</li>
<li><a href="https://github.com/tweaselORG/experiments/issues/3">Traffic collection for TrackHAR adapter work on websites (September 2024)</a> (data from September 2024, <a href="/data/requests?dataset=web-monkey-september-2024">view requests</a>)</li>
</ul>

<p><strong>Note</strong>: We have decided to only publish requests to endpoints that are contacted by apps from at least two different vendors, using <a href="https://developer.apple.com/documentation/uikit/uidevice/1620059-identifierforvendor">Apple’s definition for determining the vendor from the app ID</a>. As such, our data is not suited for reverse-engineering internal app APIs.</p>
Expand All @@ -59,7 +60,7 @@ <h2 id="web-interface">Web interface</h2>
<p>We are publishing the data as a <a href="https://datasette.io/">Datasette</a> instance, which allows you to interactively explore the full data online, including running arbitrary SQL queries against it. Here are just a few examples of interesting things you can look at:</p>

<ul>
<li><a href="/data?sql=select+count(1)+count%2C+endpointUrl+from+requests+where+endpointUrl+is+not+null%0D%0Agroup+by+endpointUrl++order+by+count+desc+limit+101%3B">the endpoints that were contacted most often</a> or <a href="/data?sql=select+count(distinct+regex_replace('%40.%2B%3F%24'%2C+coalesce(initiator%2C+'%3Cno+app+ID%3E')%2C+''))+appCount%2C+count(1)+requestCount%2C+endpointUrl+from+requests%0D%0Awhere+endpointUrl+is+not+null%0D%0Agroup+by+endpointUrl++order+by+appCount+desc+limit+101%3B">by the most apps</a></li>
<li><a href="/data?sql=select+count(1)+count%2C+endpointUrl+from+requests+where+endpointUrl+is+not+null%0D%0Agroup+by+endpointUrl++order+by+count+desc+limit+101%3B">the endpoints that were contacted most often</a> or by the most <a href="/data?sql=select+count%28distinct+regex_replace%28%27%40.%2B%3F%24%27%2C+coalesce%28initiator%2C+%27%3Cno+app+ID%3E%27%29%2C+%27%27%29%29+appCount%2C+count%281%29+requestCount%2C+endpointUrl+from+requests%0D%0Awhere+endpointUrl+is+not+null+and+%28platform%3D%27android%27+or+platform%3D%27ios%27%29%0D%0Agroup+by+endpointUrl++order+by+appCount+desc+limit+101%3B">apps</a>/<a href="/data?sql=select+count(distinct+initiator)+websiteCount%2C+count(1)+requestCount%2C+endpointUrl+from+requests%0D%0Awhere+endpointUrl+is+not+null+and+platform%3D'web'%0D%0Agroup+by+endpointUrl++order+by+websiteCount+desc+limit+101%3B">websites</a></li>
<li><a href="/data?sql=select+link(dataset%2C+id)%2C+initiator%2C+platform%2C+runType%2C+startTime%2C+method%2C+httpVersion%2C+endpointUrl%2C+scheme%2C+host%2C+port%2C+path%2C+content%2C+headers%2C+cookies+%0D%0Afrom+requests%0D%0Awhere+host+like+'%25'+||+%3Ahost+||+'%25'%0D%0Aorder+by+length(content)+%2B+length(path)+%2B+length(headers)+%2B+length(cookies)+desc+limit+101%3B&host=doubleclick.net">requests to a particular host, e.g. <code>doubleclick.net</code>, ordered by length</a></li>
<li><a href="/data?sql=select+link(dataset%2C+id)%2C+initiator%2C+platform%2C+runType%2C+startTime%2C+method%2C+httpVersion%2C+endpointUrl%2C+scheme%2C+host%2C+port%2C+path%2C+content%2C+headers%2C+cookies+%0D%0Afrom+requests%0D%0Awhere+initiator+like+%3AappId+||+'%40%25'%0D%0Alimit+101%3B&appId=com.airbnb.android">requests by a particular app, e.g. Airbnb on Android</a></li>
<li><a href="/data?sql=select+link(dataset%2C+id)%2C+initiator%2C+platform%2C+runType%2C+startTime%2C+method%2C+httpVersion%2C+endpointUrl%2C+scheme%2C+host%2C+port%2C+path%2C+content%2C+headers%2C+cookies+%0D%0Afrom+requests%0D%0Aorder+by+json_array_length(cookies)+desc%0D%0Alimit+101%3B">the requests setting the most cookies</a></li>
Expand Down
1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
"better-sqlite3": "^8.5.0",
"fs-extra": "^11.1.1",
"sqlite-regex": "^0.2.3",
"sqlite-url": "^0.1.0",
"tsx": "^3.12.7",
"yesno": "^0.4.0"
},
Expand Down
9 changes: 6 additions & 3 deletions scripts/make-database.ts
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
import Database from 'better-sqlite3';
import yesno from 'yesno';
import * as sqlite_regex from 'sqlite-regex';
import * as sqliteRegex from 'sqlite-regex';
import * as sqliteUrl from 'sqlite-url';
import fse from 'fs-extra';
import datasets from '../datasets.json';

Expand All @@ -14,7 +15,8 @@ import datasets from '../datasets.json';
}

const db = new Database('datasette/data.db');
db.loadExtension(sqlite_regex.getLoadablePath());
db.loadExtension(sqliteRegex.getLoadablePath());
db.loadExtension(sqliteUrl.getLoadablePath());
db.pragma('journal_mode = WAL');

// Create and fill `datasets` table.
Expand Down Expand Up @@ -69,6 +71,7 @@ insert into requests
case
when initiator is null then null
when instr(initiator,'.') = 0 then initiator
when instr(initiator,'@') = 0 then url_host(initiator)
else regex_replace('\\.[^.]+@.+?$', initiator, '')
end as vendor,
-- For the 'do-they-track' requests, we don't know the scheme, so we guess 'https'. That is reasonable considering that less than 0.5 % of requests in the rest of the dataset use 'http'.
Expand All @@ -81,7 +84,7 @@ insert into requests
)
select * from vendors where
-- Only include requests that are made to the same endpointUrl by apps from at least two different vendors (https://github.com/tweaselORG/meta/issues/33#issuecomment-1658348929).
-- Only include requests that are made to the same endpointUrl by apps/websites from at least two different vendors/hosts (https://github.com/tweaselORG/meta/issues/33#issuecomment-1658348929).
_endpointForCounting in (
select _endpointForCounting from vendors group by _endpointForCounting having count(distinct vendor) >= 2
union
Expand Down
18 changes: 18 additions & 0 deletions yarn.lock
Original file line number Diff line number Diff line change
Expand Up @@ -492,6 +492,24 @@ sqlite-regex@^0.2.3:
sqlite-regex-linux-x64 "0.2.3"
sqlite-regex-windows-x64 "0.2.3"

sqlite-url-darwin-x64@0.1.0:
version "0.1.0"
resolved "https://registry.yarnpkg.com/sqlite-url-darwin-x64/-/sqlite-url-darwin-x64-0.1.0.tgz#367bb63203b5d555366de1bdca9da3d3a0bcdd44"
integrity sha512-E7AMMTFMik4eSisE09+/fQIjHf2z4sKT8J7BcL80nTQuoz9+DiNLi0M9ZqygnK5Gp9gmrxD3Ds2Ti6zabGF8mQ==

sqlite-url-linux-x64@0.1.0:
version "0.1.0"
resolved "https://registry.yarnpkg.com/sqlite-url-linux-x64/-/sqlite-url-linux-x64-0.1.0.tgz#e398494afd47fd5a70324b38ecd1c310b948f6c9"
integrity sha512-lviHGy7/UjbncL5ORklBhixQLAfhX1913M4rJfqGslN3eNckN6y6RlRo/N1PmJIz0Lb8uJu596wmI67OxOudFw==

sqlite-url@^0.1.0:
version "0.1.0"
resolved "https://registry.yarnpkg.com/sqlite-url/-/sqlite-url-0.1.0.tgz#3e1712b2b9dc3741bcf3b11c8c36607aa4ada9f6"
integrity sha512-/nuI5ouS5HHP7a22KNaFkYFsTRE9VaJvLiUMeBliv9K09s2DeWTCiF025RrF7Jj7s2TSiCnhMB5N+ODD2TqjBg==
optionalDependencies:
sqlite-url-darwin-x64 "0.1.0"
sqlite-url-linux-x64 "0.1.0"

string_decoder@^1.1.1:
version "1.3.0"
resolved "https://registry.yarnpkg.com/string_decoder/-/string_decoder-1.3.0.tgz#42f114594a46cf1a8e30b0a84f56c78c3edac21e"
Expand Down

0 comments on commit 08bbf19

Please sign in to comment.