Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data from web traffic collection makes Datasette instance a lot slower #3

Closed
baltpeter opened this issue Oct 7, 2024 · 34 comments
Closed

Comments

@baltpeter
Copy link
Member

Now that I have imported the data from our first traffic collection on the web (tweaselORG/experiments#3), the Datasette instance has gotten a lot more sluggish.

That shouldn't come as too much of a surprise. We have jumped from 323,116 rows in the requests table to 2,614,200 and the database has grown to more than 10 GB (from a little over 1 GB iirc).

Probably most critically, I have had the debug-adapter tool in TrackHAR fail due to timeouts quite regularly since I have imported the data, and that is after I had already increased the timeout to 50s (08bbf19#diff-92a81eca627ad137f98388e8aeae273e0a4069fde8995897ab138b5fee849cdcR3).

@baltpeter baltpeter self-assigned this Oct 7, 2024
baltpeter added a commit to tweaselORG/TrackHAR that referenced this issue Oct 7, 2024
With the new data from our first traffic collection on the web
(tweaselORG/data.tweasel.org#3), the request
database has gotten quite huge and I have seen debug-adapter runs take
a long time and even fail due to timeouts.

Luckily, the timeouts at least can be solved quite easily. Turns out,
we were already trying to load all matching rows in a single request.
Now, we instead load it in chunks of 100 rows. Setting the _size
parameter was sufficient for that—we had already implemented handling
of the next_url in responses and row limits.

With this change, I haven't seen any timeouts anymore so far and while
still slower, it isn't too bad anymore.

I still want to work on making data.tweasel.org as a whole faster again,
but this is a good change in any case.
@baltpeter
Copy link
Member Author

Solving the timeouts in TrackHAR was easy enough at least by make use of chunking: tweaselORG/TrackHAR#90

@baltpeter
Copy link
Member Author

I was able to make one more "optimization": We previously kept the "Only include requests that are made to the same endpointUrl by apps from at least two different vendors" rule as-is for websites, but using the hostname here.

That was perhaps a bit generous. This way, it was enough for an endpointUrl to be accessed by www.example.org and login.example.org to be included. I have instead changed it to require requests from ten different hosts for websites now (9832133). Apps still only need two different vendors.

With this, the database has shrunk to 7.9 GB and we now have 1,912,743 requests with 1,592,854 of those being from websites.

@baltpeter
Copy link
Member Author

This has helped a little but the performance is still not great. And quite a few more complex queries, including ones on the homepage, are still timing out.

I fear that there isn't much more that we can do in terms of optimizations. I really see only two options:

  1. Pay for a more powerful server.
  2. Decide that using the top 10k on the web was a bit ambitious. Since the CrUX data is only published in buckets, this would mean that we would have to scale down to the top 1k instead.

What do you think, @zner0L?

@baltpeter
Copy link
Member Author

Another random thought I just had, inspired by accidentally having two different databases in my local instance: Datasette can handle multiple databases.

image

We could have separate databases for web and apps.

But upon thinking about this a little more, it probably isn't a good idea:

  1. The web database would still be sluggish.
  2. For many things, we will explicitly want to work with the data from both realms.

@zner0L
Copy link
Contributor

zner0L commented Oct 7, 2024

I tried running the server in immutable mode as suggested by documentation and it did improve the performance a bit, even though it still take quite long. I think part of this is also that some queries are just inefficient.

We can try to squeeze more performance out of this instance with better queries and optimized data storage. However, SQLite is just not built for this amount of data. Right now we are using a pretty inefficient software stack to handle too big of a dataset. I really don’t like solving these kinds of problems with better hardware…

@zner0L
Copy link
Contributor

zner0L commented Oct 7, 2024

To run in immutable mode while using directory configuration mode, datasette needs the database mentioned in inspect-data.json in the config directory. However there is an inconsistency in datasette when comparing database paths (simonw/datasette#2433) which is why I couldn’t get this to work at first: The file property needs to contain just the filename, not the path.

This improves the performance a bit and allows for the database to be downloaded. However, it introduces the problem where to generate and store the inspect-data.json. Do we write a script to generate it on each upload? Do we generate it locally and upload it into version control? I am interested in your thoughts, @baltpeter.

@baltpeter baltpeter removed their assignment Oct 7, 2024
@baltpeter
Copy link
Member Author

Oh, interesting. I also ran into the issue of not being able to combine configuration directory mode and immutable mode when I initially set up Datasette but didn't find the trick of using inspect-data.json. I also didn't investigate further because from reading Simon's blog, I know that there will be a pretty major overhaul of the whole config system in the upcoming Datasette 1.0.

That being said, from my understanding after reading the docs, inspect-data.json should provide some performance benefits in addition to the ones from immutable mode even, so I guess it might be worth it to implement that already.

Since our deployment method for the database is currently "copy and run this scp command", I guess adding a second command for generating inspect-data.json wouldn't be too bad.

zner0L pushed a commit to tweaselORG/TrackHAR that referenced this issue Oct 7, 2024
With the new data from our first traffic collection on the web
(tweaselORG/data.tweasel.org#3), the request
database has gotten quite huge and I have seen debug-adapter runs take
a long time and even fail due to timeouts.

Luckily, the timeouts at least can be solved quite easily. Turns out,
we were already trying to load all matching rows in a single request.
Now, we instead load it in chunks of 100 rows. Setting the _size
parameter was sufficient for that—we had already implemented handling
of the next_url in responses and row limits.

With this change, I haven't seen any timeouts anymore so far and while
still slower, it isn't too bad anymore.

I still want to work on making data.tweasel.org as a whole faster again,
but this is a good change in any case.
@baltpeter
Copy link
Member Author

Documenting an investigation that happen a while ago: @mal-tee suggested that adding indexes for more than just the facet columns might help.

Starting from the assumption that the only potential negative impact of just indexing everything should be to the write performance, I learned that the subject is quite complex indeed and there are whole books about it. For example, I'm not sure whether I knew before that there are multi-column indexes.

Anyway, I also learned that SQLite has an .expert mode that recommends which indexes to create for a particular query.

I tried that for one of our most used queries:

select count(distinct initiator) initiatorCount, count(*) requestCount, endpointUrl from requests where endpointUrl is not null and platform='web' group by endpointUrl  order by initiatorCount desc limit 101;

It suggested the following index:

CREATE INDEX requests_idx_b0c35c3d ON requests(platform, endpointUrl);

image

However, unless I'm severely misunderstanding something here, this actually massively decreased the query performance. Whereas previously, the query would take 7.766s on my machine, with the index, it takes 19.515s.

image

Just for the heck of it, I also tried creating single-column indexes for all columns appearing in the query but that had no significant effect (7.746s).

image

This all seems very counter-intuitive to me and I feel like I'm missing something. But given that it seems like adding indexes can negatively impact read performance after all and that our request database is specifically for exploration where we cannot know the queries users run in advance, I fear that indexes may not be the best way forward.

@baltpeter
Copy link
Member Author

@mal-tee and especially @zner0L also kept saying that this is just an inherent limitation of SQLite and that we would need to use a "proper" DBMS like Postgres. I find that hard to believe and feel like 2.6M rows isn't thaaat much. In any case, it seems sensible to verify it (especially since Postgres support for Datasette isn't completely out of the picture).

The first hurdle is finding a way to import the data from SQLite into a Postgres database. I struggled with various tools. pgloader worked the best but only imported 1GB, never the whole database, even after fiddling with various options.

In the end, I managed to do this using DataGrip by loading both databases into the same workspace and using the "Import/Export" -> "Copy Tables to" feature.

Screencast.from.2024-10-22.10-52-46.mp4

That took 2min 15s, which seems very reasonable.

To make the comparison fair, I manually created the same indexes as we have in SQLite:

for (const facetColumn of ['method', 'dataset', 'runType', 'scheme', 'platform'])
db.exec(`create index "requests_${facetColumn}" on "requests"("${facetColumn}");`);

@baltpeter
Copy link
Member Author

For reference, here are the versions of both systems I'm using for the test.

SQLite:

DBMS: SQLite (ver. 3.45.1)
Case sensitivity: plain=mixed, delimited=mixed
Driver: SQLite JDBC (ver. 3.45.1.0, JDBC4.2)

Postgres:

DBMS: PostgreSQL (ver. 17.0 (Debian 17.0-1.pgdg120+1))
Case sensitivity: plain=lower, delimited=exact
Driver: PostgreSQL JDBC Driver (ver. 42.6.0, JDBC4.2)

To create the Postgres database, I used Docker.

docker-compose.yml:

version: '3.7'
services:
    db:
        image: postgres:17
        volumes:
            - data:/var/lib/postgresql/data
        env_file: ./.env
        ports:
            - '${HOST_PORT}:5432'
volumes:
    data:

.env:

POSTGRES_DB=tweasel-postgres-test
POSTGRES_USER=tweasel-postgres-test
POSTGRES_PASSWORD=pw
POSTGRES_HOST=localhost
HOST_PORT=5437

@baltpeter
Copy link
Member Author

Now for the interesting part. I ran the same query in both DBMS:

select count(distinct initiator) appCount, count(1) requestCount, endpointUrl from requests
where endpointUrl is not null and (platform='web')
group by endpointUrl  order by appCount desc limit 101;

In SQLite, this took 21s:

[2024-10-22 11:00:49] Connected
main> select count(distinct initiator) appCount, count(1) requestCount, endpointUrl from requests
      where endpointUrl is not null and (platform='web')
      group by endpointUrl  order by appCount desc limit 101
[2024-10-22 11:01:10] 101 rows retrieved starting from 1 in 21 s 161 ms (execution: 21 s 101 ms, fetching: 60 ms)

Whereas in Postgres, it took less than 7s:

[2024-10-22 11:02:27] Connected
data.public> select count(distinct initiator) appCount, count(1) requestCount, endpointUrl from requests
             where endpointUrl is not null and (platform='web')
             group by endpointUrl  order by appCount desc limit 101
[2024-10-22 11:02:33] 101 rows retrieved starting from 1 in 6 s 655 ms (execution: 6 s 625 ms, fetching: 30 ms)

While this is far from a scientific test, a 3x speed difference is certainly not nothing. Looks like I was wrong!

@baltpeter
Copy link
Member Author

Testing a few other queries from our homepage.

select count(1) count, endpointUrl from requests where endpointUrl is not null
group by endpointUrl  order by count desc limit 101;

SQLite: 263ms, Postgres: 964ms


select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
where host like '%google-analytics%'
order by length(content) + length(path) + length(headers) + length(cookies) desc limit 101;

SQLite: 1s 17ms, Postgres: 1s 41ms


select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
where initiator like 'com.airbnb.android@%'
limit 101;

SQLite: 130ms, Postgres: 113ms


-- SQLite
select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
WHERE content like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or path like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or headers like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or cookies like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%'
limit 101 collate nocase;

-- Postgres
select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies
from requests
WHERE content like '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or path ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or headers ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%' or cookies ilike '%ea70edc1-ac05-481c-8d2a-66b1be496a7e%'
limit 101;

SQLite: 723ms, Postgres: 153ms


select dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies 
from requests
where runType = 'consent-dialog-accepted' and initiator is not null and not exists (
  select * from requests r where r.initiator = requests.initiator and r.endpointUrl = requests.endpointUrl and r.runType = 'no-interaction'
)
limit 10;

SQLite: 1m 15s, Postgres: 677ms (!)


select length(content) + length(path) + length(headers) + length (cookies) as length, dataset, id, initiator, platform, runType, startTime, method, httpVersion, endpointUrl, scheme, host, port, path, content, headers, cookies from requests
order by length desc
limit 10;

SQLite: 1m 31s, Postgres: 7s (!)

@baltpeter
Copy link
Member Author

baltpeter commented Oct 22, 2024

This is quite interesting. For the quick queries, SQLite can sometimes even win against Postgres. But some major queries that takes ages in SQLite are lightning fast in Postgres.

@baltpeter
Copy link
Member Author

But anyway, if there ever is support for Postgres in Datasette, it's quite a while away.

The reason I did decide to look into this again is that I happened to stumble across DuckDB again a few days ago. And that sounds very relevant to our use case (quoting from https://duckdb.org/why_duckdb#fast):

DuckDB is designed to support analytical query workloads, also known as online analytical processing (OLAP). These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example aggregations over entire tables or joins between several large tables. Changes to the data are expected to be rather large-scale as well, with several rows being appended, or large portions of tables being changed or added at the same time.

To efficiently support this workload, it is critical to reduce the amount of CPU cycles that are expended per individual value. The state of the art in data management to achieve this are either vectorized or just-in-time query execution engines. DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a "vector") are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries.

And notably, not only is Datasette support for DuckDB more likely than Postgres (simonw/datasette#968), there is even a plugin that enables support right now since DuckDB's API is quite similar to SQLite's: https://github.com/cldellow/datasette-parquet

And getting data from SQLite into DuckDB is trivial: https://duckdb.org/docs/guides/database_integration/sqlite.html

@baltpeter
Copy link
Member Author

Importing the data into DuckDB was indeed easy enough:

❯ duckdb data.duckdb
v1.1.2 f680b7d08f
Enter ".help" for usage hints.
D install sqlite;
D load sqlite;
D attach 'datasette/data.db' as data_sqlite (type sqlite);
D select * from data_sqlite.requests limit 2;
┌─────────┬───────────────┬───────────┬──────────┬───┬──────────────────────┬──────────────────────┬──────────────────────┬─────────┐
│   id    │    dataset    │ initiator │ platform │ … │         path         │       content        │       headers        │ cookies │
│ varchar │    varchar    │  varchar  │ varchar  │   │       varchar        │         blob         │       varchar        │ varchar │
├─────────┼───────────────┼───────────┼──────────┼───┼──────────────────────┼──────────────────────┼──────────────────────┼─────────┤
│ 18944   │ do-they-track │           │ android  │ … │ /v1/projects/chime…  │ {\x22installation\…  │ [{"name": "Content…  │ []      │
│ 18945   │ do-they-track │           │ android  │ … │ /androidcheck/v1/a…  │ \x0A4\x0A$827d8162…  │ [{"name": "Content…  │ []      │
├─────────┴───────────────┴───────────┴──────────┴───┴──────────────────────┴──────────────────────┴──────────────────────┴─────────┤
│ 2 rows                                                                                                       16 columns (8 shown) │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
D show databases;
┌───────────────┐
│ database_name │
│    varchar    │
├───────────────┤
│ data          │
│ data_sqlite   │
└───────────────┘
D copy from database data_sqlite to data;
100% ▕████████████████████████████████████████████████████████████▏ 
D detach data_sqlite;
D show databases;
┌───────────────┐
│ database_name │
│    varchar    │
├───────────────┤
│ data          │
└───────────────┘
D 

First observation: The DuckDB database only weighs 4.4GB.

@baltpeter
Copy link
Member Author

DataGrip also has basic support for DuckDB. Using:

DBMS: DuckDB (ver. v1.1.1)
Case sensitivity: plain=exact, delimited=exact
Driver: DuckDBJ (ver. 1.0, JDBC1.0)
Effective version: UNKNOWN (ver. 1.0)

@baltpeter
Copy link
Member Author

Huh. I keep running into internal errors. For example:

D select count(distinct initiator) from requests;
┌───────────────────────────┐
│ count(DISTINCT initiator) │
│           int64           │
├───────────────────────────┤
│                     18396 │
└───────────────────────────┘

D select count(distinct initiator) from requests where endpointUrl is not null;
INTERNAL Error: Attempted to access index 0 within vector of size 0
This error signals an assertion failure within DuckDB. This usually occurs due to unexpected conditions or errors in the program's logic.
For more information, see https://duckdb.org/docs/dev/internal_errors

Am I doing something wrong or did the import maybe fail?

@baltpeter
Copy link
Member Author

Seems like it was genuinely a problem with the import. I have now tried to adapt the make-database script to DuckDB and create the database that way. I must have gotten something wrong since it only imported 647825 rows (probably extracting the hostname from URLs – since there is no extension for that in DuckDB), I had to use a regex.

But at least with this database, all queries work fine. And it definitely seems worth investigating further. All queries I've tried running have returned almost instantly.

@baltpeter
Copy link
Member Author

Okay, it was just a silly typo/c&p error. I had executed the regex on the wrong column (endpointUrl instead of initiator). The regex would probably work fine but I didn't actually test it.

Before I spotted the problem I tried registering a custom function:

db.register_udf('url_host', 'text', (url) => {
    try {
        const u = new URL(url);
        return u.hostname;
    } catch {
        return null;
    }
});

Even with needing to parse all those URLs using JS, creating the DB took less than 90s.

@baltpeter
Copy link
Member Author

And the performance is still really encouraging!

Even fetching the top 1000 endpoints is instantaneous:

Screencast.from.2024-10-22.13-24-29.mp4

@baltpeter
Copy link
Member Author

I did some testing with datasette-parquet as well. One note: You'll need to also install pytz in the venv since I'm using timestamptz for the startTime column.

It's really fun to have the complex queries be instant. For example, the "most common endpoints on web" is "Queries took 80.526ms" vs "Queries took 21597.821ms". Tiny bit of a difference there.

Unfortunately, I'm also running into a few errors. Some of these I've been able to fix with query changes etc. But this is definitely not a drop-in replacement.

@baltpeter
Copy link
Member Author

I'm definitely also seeing that DuckDB isn't better in all scenarios. For any sort of aggregation etc. it blows SQLite out of the water. But if you actually need to fetch the data from the rows, SQLite tends to be faster. Not much of a surprise given DuckDB's focus on analytical queries.

Here's an example from the query we're using in TrackHAR:

image

When fetching all the data, the difference isn't too bad. But when fetching a subset of rows, there is quite the difference:

image

image

@baltpeter
Copy link
Member Author

I think that's as far as it makes sense for me to go with this investigation for now.

  • Y'all were right that Postgres would be much better for our use case. While not as fast as DuckDB for analytical queries, it offers a good balance and is a lot faster than SQLite in all relevant cases. However, we (currently) have no viable path of moving to Postgres, so that unfortunately doesn't help us.
  • DuckDB is phenomenal for the analytical queries but from my limited testing, it seems like would make the TrackHAR problems even worse. Also, while there is at least some chance of us being able to switch, I'm pretty sure that there would be quite a lot of work involved.

What do you think, @zner0L? Should we spent more time on investigating DuckDB and trying to make it work for us? Or should we just do what we had already resigned to anyway (throw away everything other than the top 1k for web for now) and hope that Datasette one day offers native support for another DBMS?

@baltpeter
Copy link
Member Author

For reference: My changes for testing datasette-parquet are in #7.

@baltpeter
Copy link
Member Author

baltpeter commented Oct 22, 2024

Oh, I also wanted to mention this blog post I found: https://simonwillison.net/2022/Sep/1/sqlite-duckdb-paper/

Most notably, it mentions that SQLite 3.38.0 ships with speed ups for large analytical queries. We're currently running 3.37.2 on the server.

@baltpeter
Copy link
Member Author

Alternative URL for checking the version: https://data.tweasel.org/-/versions

@baltpeter
Copy link
Member Author

3.37.2 appears to be the version included with the system Python on our server (https://stackoverflow.com/a/68589898). As far as I can tell, there is no other sqlite installed on the system or via pip.

dsette@tweasel-data ~/data.tweasel.org (git)-[main] % python3
Python 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3; print(sqlite3.connect(':memory:').execute('SELECT sqlite_version();').fetchall())
[('3.37.2',)]

@zner0L
Copy link
Contributor

zner0L commented Oct 24, 2024

I think we should focus on other things for now and go through with the deletion of data. We should maybe upload a dump to zenodo so that other people can still use it on their own machine if they want to. We could still try to update sqlite and see if that brings any improvement, though.

@zner0L
Copy link
Contributor

zner0L commented Oct 24, 2024

At least for building tracker-wiki we do need aggregate queries after all.

@baltpeter
Copy link
Member Author

Notes on how I trimmed the dataset in: tweaselORG/experiments#3 (comment)

Full dataset for the top 10k on Zenodo: https://zenodo.org/records/13990110

@baltpeter
Copy link
Member Author

With that, the result of yarn make-database is now only 2.2GB with 461,653 rows, which is a good sign (but also sad :/).

@baltpeter
Copy link
Member Author

baltpeter commented Oct 25, 2024

Running TrackHAR's debug-adapter against the Google Analytics adapter now takes 1:22mins instead of 3 hours (on my local machine!). Teeny-tiny bit of an improvement, there.

But of course, it also returns a lot less data. :/ We now have 8,488 matching requests instead of ~75k.

@baltpeter
Copy link
Member Author

Deployed to data.tweasel.org, which is now usable again.

@baltpeter
Copy link
Member Author

And debugging the GA adapter against the remote Datasette instance only takes 2mins!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants