Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of COUNT (distinct x) for dictionary columns #258

Open
alamb opened this issue May 4, 2021 · 5 comments
Open

Improve performance of COUNT (distinct x) for dictionary columns #258

alamb opened this issue May 4, 2021 · 5 comments
Assignees
Labels
datafusion Changes in the datafusion crate enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 4, 2021

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I have large amounts of low cardinality string data (for example, 200 M rows, but only 20 distinct values). DictionaryArrays are very good for such data as they are space efficient.

#256 adds basic query support for distinct dictionary columns but it is not a very computationally efficient imlementation. It effectively unpacks the (likely mostly deduplicated) dictionary's values row by row into a hash set to deduplicate it again. That is a lot of extra hashing work.

Describe the solution you'd like
It would likely be much more efficient (especially for arrays that have a small number of distinct values in their dictionary) to look at the values from the dictionary directly, first checking that each entry in the dictionary was actually used.

@jaylmiller
Copy link
Contributor

jaylmiller commented Mar 11, 2023

I've made a little PR for this. But I'm not sure about how to go about measuring the performance improvements... @alamb do you know of any existing benches in the codebase that would measure this?

alternatively i guess i could add a new bench but i remember you saying you were weary about adding too many new benches 😅

@waynexia
Copy link
Member

To my knowledge, the current TPC-H and TPC-DS benchmarks do not include this scenario. While the data may be dictionary encoded in storage format, they are expanded (IIRC) to normal arrays after being read into memory.

@alamb
Copy link
Contributor Author

alamb commented Mar 12, 2023

I agree with @waynexia that this scenario is not covered by any existing datafusion benchmarks I know of

Clickbench has several queries that include count distinct (see for example #5276 (comment)) but I am not sure if the input is dictionary encoded.

> CREATE EXTERNAL TABLE hits STORED AS PARQUET LOCATION 'hits.parquet';

> SELECT "RegionID", SUM("AdvEngineID"), COUNT(*) AS c, AVG("ResolutionWidth"), COUNT(DISTINCT "UserID") FROM hits GROUP BY "RegionID" ORDER BY c DESC LIMIT 10;

However, I think with #5166 you could now create a dictionary encoded version with a command like the following (untested as I don't not to have the data downloaded -- data is here https://github.com/ClickHouse/ClickBench/tree/main#data-loading)

CREATE TABLE hits_dictionary as 
select 
  arrow_cast("RegionID", 'Dictionary(Int32, Utf8)') as "RegionID",
  "ResolutionWidth",
  "UserID",
FROM hits;

@mingmwang
Copy link
Contributor

I have one question for the parquet stored data. When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type.
@liukun4515 Could you please help to confirm this ?

@alamb
Copy link
Contributor Author

alamb commented Mar 15, 2023

When the arrow parquet reader read data from parquet files, I remember even in parquet the files, the data is dictionary encoded(string types with low cardinality), the arrow parquet reader will not convert it to Arrow Dictionary type.

I believe @tustvold changed this a while ago so that the Dictionary is preserved (and then I think datafusion hydrates them) -- apache/arrow-rs#1180

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Changes in the datafusion crate enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants