feat(bigquery): implement CountDistinctStar #9470

ssabdb · 2024-06-28T18:36:42Z

Description of changes

Implements countdistinctstar for bigquery

Bigquery does not support count(distinct a,b,c) or count(distinct (a, b, c)) (e.g. using a tuple, as is done with DuckDB) as expressions must be groupable

Instead, convert the entire expression to a string SELECT COUNT(DISTINCT ARRAY_TO_STRING([TO_JSON_STRING(a), TO_JSON_STRING(b)], ''))

This works with all the bigquery datatypes (source). Using an array generates a unique, deterministic string for each combination of rows deterministically (see json encoding)

I do not know what the impact on cost or runtime is here, but there aren't many other ways of achieving a count distinct on multiple column types of rows.

cpcloud · 2024-06-29T13:57:55Z

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

ssabdb · 2024-06-30T21:36:23Z

@tswast Out of curiosity are there any performance concerns here? Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

There will certainly be an overhead, but it's diffiicult to answer how bad it is. From the experimenatation I decided to run, the string encoding certainly shows up significantly for the more complex timestamp datatype.

TLDR; of the below. It seems to depend on data type. Testing a distinct on a significant number of rows (120M, predicated to ~30M) still lead to queries mostly completing in 2 seconds, but with significant variations in slot-time consumed (which is a good proxy for resource requirements in BQ).

I've found a small, easy optimization for my current implementation to avoid initializing an array which I'll update the PR with shortly. Other optimizations would have to be on a type by type basis and might get quite complex with arrays and structs because they can be nested, whilst there's no-point re-encoding existing string or bytes objects. The beauty of the TO_JSON_STRING is that it natively handles nested datatypes itself.

Experimental Setup

Following this medium article with some changes to make some diverse columns to generate 128m rows, credit to this article with some minor changes:

DECLARE i INT64 DEFAULT 0;

create or replace table `my_bq_input_dataset.generated_array` as
select 
  id,
  TIMESTAMP_TRUNC(timestamp_add(TIMESTAMP '2020-01-01T00:00:00', INTERVAL id SECOND), MINUTE) as time_id,
  case when mod(id,4) = 1 then 'TR'
  when mod(id,4) = 2 then 'DE'
  when mod(id,4) = 3 then 'GB'
  else 'US'
  end as country,
  current_timestamp() as _load_time
from unnest(generate_array(1,1000000)) as id;


REPEAT
  insert into my_bq_input_dataset.generated_array select * from my_bq_input_dataset.generated_array;
  SET i = i + 1;
  UNTIL i >= 7
END REPEAT;

I tested encoding the int column (id), then the more complex timestamp colum time_id and then tested overlaying one on the other.

Int Column

TLDR; there is no significant performance overhead of JSON encoding an int.

First comparing the overhead of running count distinct on the int column:

select count(distinct(TO_JSON_STRING(id))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 252MB shuffled

select count(distinct(id)) from my_bq_input_dataset.generated_array
where country = 'TR'

33 slot seconds, 208MB shuffled

Timestamp column

TLDR; string encoding a timestamp column using TO_JSON_STRING is much slower than casting it to string. Interestingly, turning it into an integer and then a string is much faster, presumably because date encoding to the ISO format is so much slower.

select count(distinct(time_id)) from my_bq_input_dataset.generated_array
where country = 'TR'

12 slot seconds, 29MB shuffled

select count(distinct(TO_JSON_STRING(time_id))) from my_bq_input_dataset.generated_array
where country = 'TR'

57 slot seconds, 88.83MB shuffled

select count(distinct(CAST(time_id AS string))) from my_bq_input_dataset.generated_array
where country = 'TR'

33 sec slot seconds, 88MB shuffled

select count(distinct(TO_JSON_STRING(UNIX_MICROS(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

18 slot seconds, 69MB shuffled

Both columns

Current Implementation

select count(distinct(ARRAY_TO_STRING([TO_JSON_STRING(id), TO_JSON_STRING(time_id)], ''))) from my_bq_input_dataset.generated_array
where country = 'TR'

1min 52 secs slot seconds, 1.02gb shufflled

Optimization to submit, skip array initialization

select count(distinct(CONCAT(TO_JSON_STRING(id), TO_JSON_STRING(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 28 slot seconds, 1.09gb shuffled

As expected, the small saving from directly casting a timestamp to string rather than json encoding saves time

select count(distinct(CONCAT(CAST(id as STRING), CAST(time_id as STRING)))) from my_bq_input_dataset.generated_array
where country = 'TR'

1 min, 22 slot seconds, 936mb shuffled

Finally, the signficant saving from first transforming a timestamp in an int through unix_micros also translates into both columns. In order to implement this for all data types, I'd have to come up with a way of optimally string encoding each data type and handle it in the PR

select count(distinct(CONCAT(id, unix_micros(time_id)))) from my_bq_input_dataset.generated_array
where country = 'TR'

56 slot seconds.

Conclusion

The problem here remains that BQ does not support count distinct on more than one column so a single type is required to contain all the input types. I think the above quantifies the significant overhead that comes from string encoding complex datatypes like TIMESTAMP, but there are future workarounds if this ends up being a problem.

cpcloud · 2024-06-30T23:22:11Z

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

ssabdb · 2024-07-01T07:55:33Z

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.

I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

cpcloud · 2024-07-01T12:25:32Z

Thanks for really digging in here, the analysis is much appreciated. I'm inclined to merge this as is after review and address performance concerns as they arise.
I suspect we could probably get pretty far by only encoding columns whose type is not groupable. Completely fine to do in a follow up IMO.

Great. Will push an updated version today with the minor performance improvement.

No need to do that in this PR. I'd like to hear from @tswast before merging, but I think we can address performance concerns (to the extent possible) in a follow-up (or never if we don't hear about them!).

Nitpicking: it's hard to get away from needing to string encode: arrays aren't groupable and it's not just heterogeneous types. If you need to distinct more than a single column and each column is the same, group able type, you need to combine them. I agree you could skip string encoding for strings themselves and JSON encoding is a catchall.

Yep! On second thought I'm not sure you should do any additional work here until we hear from folks whose workflows are limited by the performance of this operation. The fact of the matter is that people are already working around the lack of support for this in BigQuery, or they are using approximate alternatives, which we already support, so this is at base an improvement.

ssabdb · 2024-07-01T13:42:36Z

Fine by me. I've removed the redundant array initialization in favour of a simple concat and left it at that, which itself saves a bit of time in the profiling above. Otherwise I think ready for a review.

cpcloud

Thanks!

ibis/backends/bigquery/compiler.py

tswast · 2024-07-01T22:17:42Z

Exact count distinct is already expensive, but just curious if the overhead of string encoding would show up here for anything but smaller-scale queries.

We make the same workaround in BigQuery DataFrames for some operations, indeed as @ssabdb there are some types that aren't groupable so there isn't a great way around those.

That said, TO_JSON_STRING is slower than some other conversion methods. I would recommend borrowing our implementation here which uses more specific methods when available: https://github.com/googleapis/python-bigquery-dataframes/blob/6d947a2b2930cd34faf39e920711d0330b8a5651/bigframes/core/compile/default_ordering.py#L36-L50

Or maybe we update cast to support these workarounds for types that aren't directly castable to string normally?

cpcloud · 2024-07-01T22:35:35Z

The workaround sounds good to me. I think that can be done in a follow up! @ssabdb If you're feeling up for that, would be greatly appreciated!

cpcloud · 2024-07-01T22:36:13Z

I'll fix any xfailing tests here and then merge.

cpcloud · 2024-07-01T22:55:01Z

Ok, fixed up the xfails and implemented the count distinct star with filter case.

cpcloud · 2024-07-01T22:56:12Z

The tricky bit is that COUNT(DISTINCT ...) can't use the usual aggregation filter syntax, so you have to do the "null if the filter is true" thing.

cpcloud · 2024-07-01T23:22:47Z

BigQuery is passing:

…/ibis on  ssabdb/main:main is 📦 v9.1.0 via 🐍 v3.10.14 via ❄️   impure (ibis-3.10.14-env)
❯ pytest -m bigquery -n 8 --dist loadgroup -q
bringing up nodes...
xxsssssssssssssssssssss...sssssssssssssssssssssssssss.x...x.x....x..x.....x.......x.....xx...x...x...x.......xx..x.......xx.x..x...........x...x......x.x....x.x.............................x......x [  9%]
......x........x...............xxx.xxxxxxx....x....x...xxxx.x.x.xx........x.xx.xxx..x..xxxx..xx.x.x.x..xxxxxxxxxxx.x..xxx..x.x.xxx......x......x...x..........xx.x..x..x..x.........x..............x. [ 19%]
.......x...x....x.........x..............................x.x............x.x............x.........x..........x.xx............x...................s..s......s.......................................... [ 29%]
......................................x.............s.........................x..............xsx.s..x.......s.........x..x...x........sx.....x........x........x.x....x..x........................... [ 39%]
..xx......x.......xx.x.x...x..............x..x.............x.....x......xxx...x........xx.......xxx....x.x........x..........x................x.xxxxxx...xx...xxxxxxxxx.x..xxxxxxxxxxx.xx.xxxxxxxxxxx [ 49%]
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx..xx................x...x.....................xx..................x.....................x...................x.. [ 59%]
x........................................x........x........................................................x................................x..............x......x...x..x..xxx...................... [ 69%]
xx.......xx........x................x.......................x..xx..x..x.x......x....x..........x..x..x....xx..x................x..x......xx........xx.......x......xx...................x............ [ 79%]
.........................................x..x.........................................................................................................x.............................................. [ 89%]
....................................................................................................................................................x................................................ [ 99%]
.............                                                                                                                                                                                         [100%]
1592 passed, 56 skipped, 335 xfailed in 688.16s (0:11:28)

cpcloud · 2024-07-01T23:23:19Z

Thanks @ssabdb, keep 'em coming!

ssabdb marked this pull request as ready for review June 28, 2024 18:41

ssabdb force-pushed the main branch 5 times, most recently from 5d35f2d to 1aedb67 Compare June 29, 2024 09:21

cpcloud requested a review from tswast June 29, 2024 13:56

ssabdb force-pushed the main branch from d21896e to 5188ffd Compare July 1, 2024 13:38

cpcloud approved these changes Jul 1, 2024

View reviewed changes

ibis/backends/bigquery/compiler.py Outdated Show resolved Hide resolved

cpcloud added this to the 9.2 milestone Jul 1, 2024

cpcloud added feature Features or general enhancements bigquery The BigQuery backend labels Jul 1, 2024

ssabdb and others added 3 commits July 1, 2024 18:54

feat(bigquery): implement exact count distinct

2448045

test(bigquery): remove xfail for count distinct tests

d92087c

chore(bigquery): handle the case of count distinct star with a filter

af42342

cpcloud force-pushed the main branch from 5188ffd to af42342 Compare July 1, 2024 22:54

cpcloud merged commit 273e4bc into ibis-project:main Jul 1, 2024
76 checks passed

tswast mentioned this pull request Jul 9, 2024

feat: to_json_string (alternatively, to_json, but that could get confusing with the JSON type) #9542

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bigquery): implement CountDistinctStar #9470

feat(bigquery): implement CountDistinctStar #9470

ssabdb commented Jun 28, 2024 •

edited

Loading

cpcloud commented Jun 29, 2024

ssabdb commented Jun 30, 2024 •

edited

Loading

cpcloud commented Jun 30, 2024 •

edited

Loading

ssabdb commented Jul 1, 2024

cpcloud commented Jul 1, 2024

ssabdb commented Jul 1, 2024

cpcloud left a comment

tswast commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

feat(bigquery): implement CountDistinctStar #9470

feat(bigquery): implement CountDistinctStar #9470

Conversation

ssabdb commented Jun 28, 2024 • edited Loading

Description of changes

cpcloud commented Jun 29, 2024

ssabdb commented Jun 30, 2024 • edited Loading

Experimental Setup

Int Column

Timestamp column

Both columns

Conclusion

cpcloud commented Jun 30, 2024 • edited Loading

ssabdb commented Jul 1, 2024

cpcloud commented Jul 1, 2024

ssabdb commented Jul 1, 2024

cpcloud left a comment

Choose a reason for hiding this comment

tswast commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

cpcloud commented Jul 1, 2024

ssabdb commented Jun 28, 2024 •

edited

Loading

ssabdb commented Jun 30, 2024 •

edited

Loading

cpcloud commented Jun 30, 2024 •

edited

Loading