-
Notifications
You must be signed in to change notification settings - Fork 6.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Massive slowdown when simply including a (large) column in the result #7187
Comments
try to replace
|
It makes the first query that was already fast around 30% faster, but doesn't do anything for the already slow one. |
can you show the output of slow query with debug info set send_logs_level='trace'; |
Sure:
|
This is probably should by CODEC(DoubleDelta, LZ4)) It seems the reason is slow decompression of ZSTD(1). And CH decompresses repository_id,location,lines before order by limit Try this for a test (to check my guess) :
|
Thanks! Your query is even a bit faster than my first one, but I have to add the language constraint to the outer query otherwise I get wrong results sometimes.
The only question that remains: Could ClickHouse do better on this query or is there something stopping it from doing something similar itself? Obviously it can't do exactly what you've done since this optimization depends on noticing that the number of files per repository is very small in general, but it could for example lazily load the column or at least not read all chunks from disk when it will not use 99.9% of them anyways. If this doesn't fit the ClickHouse devteam's agenda I think we can close this, otherwise let's leave it open as a tracking issue for this feature. |
It is.
For now i would you recommend to do that (you can also consider T64) - as DoubleDelta is extremely slow now. :\ That problem will be addressed in further releases. #7082 Let's use #7082 to continue tracking that issue. |
@filimonov the issue is not related to CODECs.
|
Ok, reopened. but is it still the same isuue as in original issue description? |
Yes, the same issue. Sort of feature request for performance improvement. |
It look like quite easy to workaround with @den-crane proposed option, or (even better speed) with argMin, i.e. something like
But will such optimisation be good "in general"?
|
It's not about that specific optimization (as I've said above) but about ClickHouse reading whole columns when it only needs a tiny amount of values. The workaround is application-specific but the solution would actually be generalizable. |
@lorenz It is wonderful that we have this dataset publicly available! Is it Ok if we will copy and redistribute this dataset from our servers? |
We have similar optimization in mind: #5329 We will implement this optimization but it will be disabled by default, because we don't use the same table snapshot to process subqueries. |
@lorenz Could you please describe this dataset? |
@alexey-milovidov The dataset is for directly attaching to ClickHouse. You should be able to extract it to The data was not collected by me, it comes from here. I've just asked the author for the raw data and then imported that into ClickHouse for testing and distribution (original dataset ist huge, around 1TiB since it is all JSON). From my side feel free to do whatever you want with the data. The cluster which hosts the data has a few Gbps of unmetered bandwitdth, you don't need to be concerned about bandwidth on my side. I also host the raw data at https://blob.dolansoft.org/datasets/boyter-10m-repos/raw.tar.gz EDIT: If you want to double-check some analyses against the original blog post, be aware that some of them are wrong because the author's code has floating point accuracy issues. I've reported these, but he didn't update the post. |
Yes, I have loaded the data but it shows strange numbers. |
we encountered the same isssue, changing WHERE to PREWHERE worked. |
## Summary <!-- Ideally, there is an attached Linear ticket that will describe the "why". If relevant, use this section to call out any additional information you'd like to _highlight_ to the reviewer. --> This PR lowers the number of logs loaded from 100 -> 50. There's a couple reasons for this: * Loading large amounts of data (which we see with `LogAttributes`) can cause clickhouse to slow down significantly (see ClickHouse/ClickHouse#7187) * If there is very few matching results (e.g. find me logs with a log body of "STRIPE_INTEGRATION_ERROR"), we want to terminate as early as possible * Loading less data results in a smaller payload size. * Realistically, only about ~20-25 logs fix on the the screen (dependent on screen size) given our UX. This reason is less important as the previous points but mostly mentioning that this change shouldn't break our UX. ## How did you test this change? <!-- Frontend - Leave a screencast or a screenshot to visually describe the changes. --> Verified that logs still load and pagination works as expected. ## Are there any deployment considerations? <!-- Backend - Do we need to consider migrations or backfilling data? --> N/A
Hi! |
Hi, We have a table with a set of small columns and one column containing binary blobs:
When a query is executed to select the binary blob, it takes a long time even if it returns no results:
It takes significantly longer than when selecting one of the smaller fields:
Since we're relaying on Are there any plans to address this issue? I can see that it is currently on hold. |
Describe the situation
Simply adding a column in the result (even if only used for a final lookup) slows the whole query down massively since it get read completely as far as I can tell.
Please note that I'm just evaluating ClickHouse and I might be doing something dumb when specifying the table layout or writing queries.
How to reproduce
Running on a single host with NVMe (read bandwith ~2GiB/s) with ClickHouse 19.15.2.2
Dataset is publicly available: https://blob.dolansoft.org/datasets/boyter-10m-repos/clickhouse-columns.tar (:warning: ~50GiB / 3.5B+ rows)
Query A (fast):
Query B (slow):
Expected performance
I would expect both of these queries to take approximately the same time since ClickHouse can ignore the location column until it has found the single match and then just read that chunk from the (much heavier) location column. Instead it looks like it tries to read the whole location column and slows down the query around 8 times. I've also tried with argMax() instead of
order by x limit 1
but it seems to suffer from the same issue.Originally I also had a join to
repositories
in there, but that did not change the performance of either query so I've removed it in the interest of a more minimal reproducer.The text was updated successfully, but these errors were encountered: