-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance: Add "read strings as binary" option for parquet #12788
Comments
I want to try this issue, but I'm unsure if it's part of #12777. Is @jayzhan211 currently working on it? |
Thank you @goldmedal I think #12777 is version of this issue, but it changes the schema for all files, not just based on a config option |
Thanks, @alamb |
I think a POC would be wonderful |
You can take it if you would like to |
take |
@alamb @jayzhan211 let ctx = SessionContext::new();
ctx.sql(
r#"
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'benchmarks/data/hits_partitioned'
OPTIONS ('binary_as_string' 'true')
"#,
).await?.show().await?;
ctx.sql("describe hits").await?.show().await?;
ctx.sql(r#"select "Title" from hits limit 1"#).await?.show().await?; The result is
If you want to do some experiments, this PR is easy to use. Related IssueBy the way, I found an issue about casting let ctx = SessionContext::new();
ctx.sql(
r#"
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION '/Users/jax/git/datafusion/benchmarks/data/hits_partitioned'
OPTIONS ('binary_as_string' 'true', 'schema_force_view_types' 'true')
"#,
).await?.show().await?;
ctx.sql("describe hits").await?.show().await?;
ctx.sql(r#"select "Title" from hits limit 1"#).await?.show().await?;
----
Error: Error during planning: Cannot cast file schema field Title of type Binary to table schema field of type Utf8View It can be reproduced by
I guess it is an issue of Maybe we should file an issue on the arrow-rs repo. 🤔
|
Thank you @goldmedal -- I am checking it out now. |
Yes, we need to support binary -> utf8view in arrow cast |
Filed apache/arrow-rs#6531 |
What I hope / plan to do is to combine this PR with the code in #12792 and #12809 from @Rachelint and finally get a benchmark run that shows stringview speeding up ClickBench for hits_partitioned. Stay tuned. |
Casting from binary --> utf8view via BTW thinking more about this, I do think we need to support the cast, but in this PR we should effectively change the file schema (not just the table schema) when we setup the parquet reader (specifically with Here is the code that does so for datafusion/datafusion/core/src/datasource/physical_plan/parquet/opener.rs Lines 126 to 129 in b821929
I think we need to also allow the switch when the table schema is |
This idea sounds great. It seems if we can apply the new schema when reading file, we can save one time casting. Just read as string. I tried to follow the implementation of StringView to apply the new schema using
I can reprodcue this error on the arrow-rs side by added a test case in #[test]
fn test_cast_binary_utf8() {
let original_fields = Fields::from(vec![
Field::new("binary_to_utf8", ArrowDataType::Binary, false),
]);
let file = write_parquet_from_iter(vec![
(
"binary_to_utf8",
Arc::new(BinaryArray::from(vec![b"one".as_ref(), b"two".as_ref()])) as ArrayRef,
),
]);
let supplied_fields = Fields::from(vec![
Field::new("binary_to_utf8", ArrowDataType::Utf8, false),
]);
let options = ArrowReaderOptions::new().with_schema(Arc::new(Schema::new(supplied_fields)));
let mut arrow_reader = ParquetRecordBatchReaderBuilder::try_new_with_options(
file.try_clone().unwrap(),
options,
)
.expect("reader builder with schema")
.build()
.expect("reader with schema");
let batch = arrow_reader.next().unwrap().unwrap();
assert_eq!(batch.num_columns(), 1);
assert_eq!(batch.num_rows(), 2);
assert_eq!(
batch
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.expect("downcast to string")
.iter()
.collect::<Vec<_>>(),
vec![Some("one"), Some("two")]
);
} The output is
I tired to fix it through adding more pattern match at (DataType::Binary, DataType::Utf8) => hint, It can work well but I'm not pretty sure if this way makes sense 🤔 BTW, this way might be not safe. If the data isn't a valid Utf8 binary like
This casting will fail by
Maybe we also need to make this behaivor be optional on the arrow-rs side? 🤔 |
Submitted apache/arrow-rs#6539 for it. |
Thank you @goldmedal -- that looks great. |
TLDR I would like to add a new
binary_as_string
option for paruetIs your feature request related to a problem or challenge?
The Real Problem
The primary problem is that the ClickBench queries slow down when we enable StringView by default only for the
hits_partitioned
version of the datasetOne of the reasons is that reading a column as a
BinaryViewArray
and then casting toUtf8ViewArray
is significantly slower than reading the data from parquet as aUtf8ViewArray
(due to optimizations in the parquet decoder).This is not a problem with
StringArray
-->BinaryArray
because reading a column as aBinaryArray
and then casting toUtf8Array
is about the same speed as reading as Utf8ArrayThe core issue is that for
hits_partitioned
the "String" columns in the schema are marked as binary (not Utf8) and thus the slower conversion path is used.The background:
hits_partitioned
has "string" columns marked as "Binary"Clickbench has 2 versions of the parquet dataset (see docs here)
hits.parquet
(a single 14G parquet file)athena_partitioned/hits_{0..99}.parquet
However, the SCHEMA is different between these two files
hits.parquet
hasString
s:DataFusion recognizes this as Utf8
hits_partitioned
has the string columns asBinary
:Which datafusion correctly interprets as
Binary
:Describe the solution you'd like
I would like a way to treat binary columns in the hits_partitioned dataset as Strings.
This is the right thing to do for the
hits_partitioned
dataset, but I am not sure it is the right thing to do for all files, so I think we need some flagDescribe alternatives you've considered
I propose adding a
binary_as_string
option to the parquet reader likeIn fact, it seems as if duckDB does exactly this note the (
binary_as_string=True
) item herehttps://duckdb.org/docs/data/parquet/overview.html#parameters
And it is set in clickbench scripts:
https://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/duckdb-parquet/create.sql#L6
Additional context
Trino also explictly specifies the types of files in the
hits_partitioned
datasethttps://github.com/ClickHouse/ClickBench/blob/a6615de8e6eae45f8c1b047df0073fe32f43499f/trino/create_partitioned.sql#L1-L107
We could argue that adding something just for benchmarking goes against the spirit of the test, however, I think we can make a reasonable argument on correctness grounds too
For example, if you run this query today
You get this arguably nonsensical result (the search phrase is shown as binary):
However if you run the same query with the proper schema (strings) you see a real search phrase
The text was updated successfully, but these errors were encountered: