-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): implement CountDistinctStar #9470
Conversation
5d35f2d
to
1aedb67
Compare
@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries. |
There will certainly be an overhead, but it's diffiicult to answer how bad it is. From the experimenatation I decided to run, the string encoding certainly shows up significantly for the more complex timestamp datatype. TLDR; of the below. It seems to depend on data type. Testing a distinct on a significant number of rows (120M, predicated to ~30M) still lead to queries mostly completing in 2 seconds, but with significant variations in slot-time consumed (which is a good proxy for resource requirements in BQ). I've found a small, easy optimization for my current implementation to avoid initializing an array which I'll update the PR with shortly. Other optimizations would have to be on a type by type basis and might get quite complex with arrays and structs because they can be nested, whilst there's no-point re-encoding existing string or bytes objects. The beauty of the TO_JSON_STRING is that it natively handles nested datatypes itself. Experimental SetupFollowing this medium article with some changes to make some diverse columns to generate 128m rows, credit to this article with some minor changes:
I tested encoding the int column ( Int ColumnTLDR; there is no significant performance overhead of JSON encoding an int. First comparing the overhead of running count distinct on the int column:
33 slot seconds, 252MB shuffled
33 slot seconds, 208MB shuffled Timestamp columnTLDR; string encoding a timestamp column using TO_JSON_STRING is much slower than casting it to string. Interestingly, turning it into an integer and then a string is much faster, presumably because date encoding to the ISO format is so much slower.
12 slot seconds, 29MB shuffled
57 slot seconds, 88.83MB shuffled
33 sec slot seconds, 88MB shuffled
18 slot seconds, 69MB shuffled Both columnsCurrent Implementation
1min 52 secs slot seconds, 1.02gb shufflled Optimization to submit, skip array initialization
1 min, 28 slot seconds, 1.09gb shuffled As expected, the small saving from directly casting a timestamp to string rather than json encoding saves time
1 min, 22 slot seconds, 936mb shuffled Finally, the signficant saving from first transforming a timestamp in an int through unix_micros also translates into both columns. In order to implement this for all data types, I'd have to come up with a way of optimally string encoding each data type and handle it in the PR
56 slot seconds. ConclusionThe problem here remains that BQ does not support count distinct on more than one column so a single type is required to contain all the input types. I think the above quantifies the significant overhead that comes from string encoding complex datatypes like TIMESTAMP, but there are future workarounds if this ends up being a problem. |
Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise. I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO. |
Great. Will push an updated version today with the minor performance improvement. Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall. |
No need to do that in this PR. I'd like to hear from @tswast before merging, but I think we can address performance concerns (to the extent possible) in a follow-up (or never if we don't hear about them!).
Yep! On second thought I'm not sure you should do any additional work here until we hear from folks whose workflows are limited by the performance of this operation. The fact of the matter is that people are already working around the lack of support for this in BigQuery, or they are using approximate alternatives, which we already support, so this is at base an improvement. |
Fine by me. I've removed the redundant array initialization in favour of a simple concat and left it at that, which itself saves a bit of time in the profiling above. Otherwise I think ready for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
We make the same workaround in BigQuery DataFrames for some operations, indeed as @ssabdb there are some types that aren't groupable so there isn't a great way around those. That said, TO_JSON_STRING is slower than some other conversion methods. I would recommend borrowing our implementation here which uses more specific methods when available: https://github.com/googleapis/python-bigquery-dataframes/blob/6d947a2b2930cd34faf39e920711d0330b8a5651/bigframes/core/compile/default_ordering.py#L36-L50 Or maybe we update |
The workaround sounds good to me. I think that can be done in a follow up! @ssabdb If you're feeling up for that, would be greatly appreciated! |
I'll fix any xfailing tests here and then merge. |
Ok, fixed up the xfails and implemented the count distinct star with filter case. |
The tricky bit is that |
BigQuery is passing:
|
Thanks @ssabdb, keep 'em coming! |
Description of changes
Implements countdistinctstar for bigquery
Bigquery does not support count(distinct a,b,c) or count(distinct (a, b, c)) (e.g. using a tuple, as is done with DuckDB) as expressions must be groupable
Instead, convert the entire expression to a string
SELECT COUNT(DISTINCT ARRAY_TO_STRING([TO_JSON_STRING(a), TO_JSON_STRING(b)], ''))
This works with all the bigquery datatypes (source). Using an array generates a unique, deterministic string for each combination of rows deterministically (see json encoding)
I do not know what the impact on cost or runtime is here, but there aren't many other ways of achieving a count distinct on multiple column types of rows.