Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): implement CountDistinctStar #9470

Merged
merged 3 commits into from
Jul 1, 2024
Merged

Conversation

ssabdb
Copy link
Contributor

@ssabdb ssabdb commented Jun 28, 2024

Description of changes

Implements countdistinctstar for bigquery

Bigquery does not support count(distinct a,b,c) or count(distinct (a, b, c)) (e.g. using a tuple, as is done with DuckDB) as expressions must be groupable

Instead, convert the entire expression to a string SELECT COUNT(DISTINCT ARRAY_TO_STRING([TO_JSON_STRING(a), TO_JSON_STRING(b)], ''))

This works with all the bigquery datatypes (source). Using an array generates a unique, deterministic string for each combination of rows deterministically (see json encoding)

I do not know what the impact on cost or runtime is here, but there aren't many other ways of achieving a count distinct on multiple column types of rows.

@ssabdb ssabdb marked this pull request as ready for review June 28, 2024 18:41
@ssabdb ssabdb force-pushed the main branch 5 times, most recently from 5d35f2d to 1aedb67 Compare June 29, 2024 09:21
@cpcloud cpcloud requested a review from tswast June 29, 2024 13:56
@cpcloud
Copy link
Member

cpcloud commented Jun 29, 2024

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

@ssabdb
Copy link
Contributor Author

ssabdb commented Jun 30, 2024

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

There will certainly be an overhead, but it's diffiicult to answer how bad it is. From the experimenatation I decided to run, the string encoding certainly shows up significantly for the more complex timestamp datatype.

TLDR; of the below. It seems to depend on data type. Testing a distinct on a significant number of rows (120M, predicated to ~30M) still lead to queries mostly completing in 2 seconds, but with significant variations in slot-time consumed (which is a good proxy for resource requirements in BQ).

I've found a small, easy optimization for my current implementation to avoid initializing an array which I'll update the PR with shortly. Other optimizations would have to be on a type by type basis and might get quite complex with arrays and structs because they can be nested, whilst there's no-point re-encoding existing string or bytes objects. The beauty of the TO_JSON_STRING is that it natively handles nested datatypes itself.

Experimental Setup

Following this medium article with some changes to make some diverse columns to generate 128m rows, credit to this article with some minor changes:

DECLARE i INT64 DEFAULT 0;

create or replace table `my_bq_input_dataset.generated_array` as
select 
  id,
  TIMESTAMP_TRUNC(timestamp_add(TIMESTAMP '2020-01-01T00:00:00', INTERVAL id SECOND), MINUTE) as time_id,
  case when mod(id,4) = 1 then 'TR'
  when mod(id,4) = 2 then 'DE'
  when mod(id,4) = 3 then 'GB'
  else 'US'
  end as country,
  current_timestamp() as _load_time
from unnest(generate_array(1,1000000)) as id;


REPEAT
  insert into my_bq_input_dataset.generated_array select * from my_bq_input_dataset.generated_array;
  SET i = i + 1;
  UNTIL i >= 7
END REPEAT;

I tested encoding the int column (id), then the more complex timestamp colum time_id and then tested overlaying one on the other.

Int Column

TLDR; there is no significant performance overhead of JSON encoding an int.

First comparing the overhead of running count distinct on the int column:

select count(distinct(TO_JSON_STRING(id))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 252MB shuffled

select count(distinct(id)) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 208MB shuffled

Timestamp column

TLDR; string encoding a timestamp column using TO_JSON_STRING is much slower than casting it to string. Interestingly, turning it into an integer and then a string is much faster, presumably because date encoding to the ISO format is so much slower.

select count(distinct(time_id)) from my_bq_input_dataset.generated_array
where country = 'TR'

12 slot seconds, 29MB shuffled

select count(distinct(TO_JSON_STRING(time_id))) from my_bq_input_dataset.generated_array
where country = 'TR'

57 slot seconds, 88.83MB shuffled

select count(distinct(CAST(time_id AS string))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 sec slot seconds, 88MB shuffled

select count(distinct(TO_JSON_STRING(UNIX_MICROS(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

18 slot seconds, 69MB shuffled

Both columns

Current Implementation

select count(distinct(ARRAY_TO_STRING([TO_JSON_STRING(id), TO_JSON_STRING(time_id)], ''))) from my_bq_input_dataset.generated_array
where country = 'TR'

1min 52 secs slot seconds, 1.02gb shufflled

Optimization to submit, skip array initialization

select count(distinct(CONCAT(TO_JSON_STRING(id), TO_JSON_STRING(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 28 slot seconds, 1.09gb shuffled

As expected, the small saving from directly casting a timestamp to string rather than json encoding saves time

select count(distinct(CONCAT(CAST(id as STRING), CAST(time_id as STRING)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 22 slot seconds, 936mb shuffled

Finally, the signficant saving from first transforming a timestamp in an int through unix_micros also translates into both columns. In order to implement this for all data types, I'd have to come up with a way of optimally string encoding each data type and handle it in the PR

select count(distinct(CONCAT(id, unix_micros(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

56 slot seconds.

Conclusion

The problem here remains that BQ does not support count distinct on more than one column so a single type is required to contain all the input types. I think the above quantifies the significant overhead that comes from string encoding complex datatypes like TIMESTAMP, but there are future workarounds if this ends up being a problem.

@cpcloud
Copy link
Member

cpcloud commented Jun 30, 2024

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

@ssabdb
Copy link
Contributor Author

ssabdb commented Jul 1, 2024

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.
I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

No need to do that in this PR. I'd like to hear from @tswast before merging, but I think we can address performance concerns (to the extent possible) in a follow-up (or never if we don't hear about them!).

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

Yep! On second thought I'm not sure you should do any additional work here until we hear from folks whose workflows are limited by the performance of this operation. The fact of the matter is that people are already working around the lack of support for this in BigQuery, or they are using approximate alternatives, which we already support, so this is at base an improvement.

@ssabdb
Copy link
Contributor Author

ssabdb commented Jul 1, 2024

Fine by me. I've removed the redundant array initialization in favour of a simple concat and left it at that, which itself saves a bit of time in the profiling above. Otherwise I think ready for a review.

Copy link
Member

@cpcloud cpcloud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

ibis/backends/bigquery/compiler.py Outdated Show resolved Hide resolved
@cpcloud cpcloud added this to the 9.2 milestone Jul 1, 2024
@cpcloud cpcloud added feature Features or general enhancements bigquery The BigQuery backend labels Jul 1, 2024
@tswast
Copy link
Collaborator

tswast commented Jul 1, 2024

Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

We make the same workaround in BigQuery DataFrames for some operations, indeed as @ssabdb there are some types that aren't groupable so there isn't a great way around those.

That said, TO_JSON_STRING is slower than some other conversion methods. I would recommend borrowing our implementation here which uses more specific methods when available: https://github.com/googleapis/python-bigquery-dataframes/blob/6d947a2b2930cd34faf39e920711d0330b8a5651/bigframes/core/compile/default_ordering.py#L36-L50

Or maybe we update cast to support these workarounds for types that aren't directly castable to string normally?

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

The workaround sounds good to me. I think that can be done in a follow up! @ssabdb If you're feeling up for that, would be greatly appreciated!

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

I'll fix any xfailing tests here and then merge.

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

Ok, fixed up the xfails and implemented the count distinct star with filter case.

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

The tricky bit is that COUNT(DISTINCT ...) can't use the usual aggregation filter syntax, so you have to do the "null if the filter is true" thing.

@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

BigQuery is passing:

…/ibis on  ssabdb/main:main is 📦 v9.1.0 via 🐍 v3.10.14 via ❄️   impure (ibis-3.10.14-env)
❯ pytest -m bigquery -n 8 --dist loadgroup -q
bringing up nodes...
xxsssssssssssssssssssss...sssssssssssssssssssssssssss.x...x.x....x..x.....x.......x.....xx...x...x...x.......xx..x.......xx.x..x...........x...x......x.x....x.x.............................x......x [  9%]
......x........x...............xxx.xxxxxxx....x....x...xxxx.x.x.xx........x.xx.xxx..x..xxxx..xx.x.x.x..xxxxxxxxxxx.x..xxx..x.x.xxx......x......x...x..........xx.x..x..x..x.........x..............x. [ 19%]
.......x...x....x.........x..............................x.x............x.x............x.........x..........x.xx............x...................s..s......s.......................................... [ 29%]
......................................x.............s.........................x..............xsx.s..x.......s.........x..x...x........sx.....x........x........x.x....x..x........................... [ 39%]
..xx......x.......xx.x.x...x..............x..x.............x.....x......xxx...x........xx.......xxx....x.x........x..........x................x.xxxxxx...xx...xxxxxxxxx.x..xxxxxxxxxxx.xx.xxxxxxxxxxx [ 49%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xx................x...x.....................xx..................x.....................x...................x.. [ 59%]
x........................................x........x........................................................x................................x..............x......x...x..x..xxx...................... [ 69%]
xx.......xx........x................x.......................x..xx..x..x.x......x....x..........x..x..x....xx..x................x..x......xx........xx.......x......xx...................x............ [ 79%]
.........................................x..x.........................................................................................................x.............................................. [ 89%]
....................................................................................................................................................x................................................ [ 99%]
.............                                                                                                                                                                                         [100%]
1592 passed, 56 skipped, 335 xfailed in 688.16s (0:11:28)

@cpcloud cpcloud merged commit 273e4bc into ibis-project:main Jul 1, 2024
76 checks passed
@cpcloud
Copy link
Member

cpcloud commented Jul 1, 2024

Thanks @ssabdb, keep 'em coming!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bigquery The BigQuery backend feature Features or general enhancements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants