Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential performance regression for TPCH q18 #13188

Open
Tracked by #13449
alamb opened this issue Oct 30, 2024 · 11 comments
Open
Tracked by #13449

Potential performance regression for TPCH q18 #13188

alamb opened this issue Oct 30, 2024 · 11 comments
Assignees
Labels
bug Something isn't working

Comments

@alamb
Copy link
Contributor

alamb commented Oct 30, 2024

Describe the bug

While enabling StringView reading from Parquet in #13101 @Dandandan noticed a slight regression for TPCH 18 #13101 (comment)

here is the query

select
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice,
    sum(l_quantity)
from
    customer,
    orders,
    lineitem
where
        o_orderkey in (
        select
            l_orderkey
        from
            lineitem
        group by
            l_orderkey having
                sum(l_quantity) > 300
    )
  and c_custkey = o_custkey
  and o_orderkey = l_orderkey
group by
    c_name,
    c_custkey,
    o_orderkey,
    o_orderdate,
    o_totalprice
order by
    o_totalprice desc,
    o_orderdate;

To Reproduce

To reproduce

Make data

# make the data and get to the correct location
cd datafusion/benchmarks
./bench.sh data tpch
cd data/tpch_sf1

Run query:

datafusion-cli -f ../../queries/q18.sql  | grep Elapsed
Elapsed 0.088 seconds.

When StringView is enabled it seems like it is slightly slower

Expected behavior

StringView should always be faster

Additional context

I took a brief look at the flamegraphs -- it seems like one difference could be BatchCoalescer::push_batch

Screenshot 2024-10-30 at 2 13 38 PM

There is a special case for StringView here:
https://github.com/apache/datafusion/blob/6034be42808b43e3f48f6e58ec38cc35fa253abb/datafusion/physical-plan/src/coalesce/mod.rs#L117-L116

Here are the explain plans for the query before and after the change

Here are the flamegraphs for the query before/after the change

  • q18-flamegraph-after
  • q18-flamegraph-before
@alamb alamb added the bug Something isn't working label Oct 30, 2024
@alamb
Copy link
Contributor Author

alamb commented Oct 30, 2024

This may be fixed by fixing #11628

@devanbenz
Copy link
Contributor

take

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 5, 2024

I will take a look on this issue too, how is your progress on this @devanbenz ?

@devanbenz
Copy link
Contributor

devanbenz commented Nov 5, 2024

I will take a look on this issue too, how is your progress on this @devanbenz ?

Looking at the following code path:

if actual_buffer_size > (ideal_buffer_size * 2) {
// We set the block size to `ideal_buffer_size` so that the new StringViewArray only has one buffer, which accelerate later concat_batches.
// See https://github.com/apache/arrow-rs/issues/6094 for more details.
let mut builder = StringViewBuilder::with_capacity(s.len());
if ideal_buffer_size > 0 {
builder = builder.with_fixed_block_size(ideal_buffer_size as u32);
}
for v in s.iter() {
builder.append_option(v);
}
let gc_string = builder.finish();
debug_assert!(gc_string.data_buffers().len() <= 1); // buffer count can be 0 if the `ideal_buffer_size` is 0
Arc::new(gc_string)
} else {

I noticed when writing up a small example using this code path locally.

Screenshot 2024-11-03 at 12 30 19 PM

When taking a look at the append_option source code it implements a copy.

I've identified where the performance fault is--but I'm unsure what next steps to take to try and alleviate it. I was going to maybe modify the coalescer to try and "change the RecordBatch to stringview if the datatype is Utf8View" as the input occurs around here (if that makes sense):

match input_batch {
Some(Ok(batch)) => match self.coalescer.push_batch(batch) {
CoalescerState::Continue => {}
CoalescerState::LimitReached => {
self.inner_state = CoalesceBatchesStreamState::Exhausted;
}
CoalescerState::TargetReached => {
self.inner_state =
CoalesceBatchesStreamState::ReturnBuffer;
}
},

I don't have a meaningful plan yet just have been doing some exploratory work as of right now while benchmarking locally.

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2024

#11628 might have some ideas. I think @XiangpengHao has also been thinking of some way to do this too

Basically one thing you might be able to try is to switch from using take to directly building the output (with a StringViewBuilder or something equvialent)

But as you are hinting at this is a non trivial thing to do

@alamb
Copy link
Contributor Author

alamb commented Nov 5, 2024

I finally filed a ticket upstream in arrow trying to explain what I have been thinking about:

@jayzhan211
Copy link
Contributor

Other than apache/arrow-rs#6692.
If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192). We can then aggregate those small batches in filter exec and produce a single large output for next step. Does this sound a possible improvement?

I think the downside is again we need to implement builder per type like GroupColumn and the improvement is unclear

@alamb
Copy link
Contributor Author

alamb commented Nov 7, 2024

If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)

Would this be a specialization of a filter of the aggregates like for SELECT SUM(x) FROM ... HAVING SUM(x) > 5 ?

@jayzhan211
Copy link
Contributor

If we create filter version of GroupColumn that accumulate the filtered array into array builder for each column, output if the batch size reach target (i.e. 8192)

Would this be a specialization of a filter of the aggregates like for SELECT SUM(x) FROM ... HAVING SUM(x) > 5 ?

Probably, but I think it has potential to reduce copied too #11628. It aggregates at the first place similar to coalescer's role and no gc required since we aggregate the value without necessary buffer copied.

@devanbenz
Copy link
Contributor

If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.

@jayzhan211
Copy link
Contributor

jayzhan211 commented Nov 20, 2024

If someone else would like to take this please let me know. I may not be able to look at this again until the upcoming weekend.

Feel free to work on it since this kind of task is usually not that trivial and everyone may come out a different solution (less chance of duplicated work)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants