Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

Closed
Tracked by #5637
zeodtr opened this issue Feb 2, 2023 · 19 comments
Closed
Tracked by #5637

Optimizer is slow: Avoid too many string cloning in the optimizer #5157

zeodtr opened this issue Feb 2, 2023 · 19 comments
Labels
enhancement New feature or request

Comments

@zeodtr
Copy link

zeodtr commented Feb 2, 2023

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

I'm not sure this is a feature request, but at least this is not a bug (albeit it's a performance problem), so I write this issue as a feature request.

I'm benchmarking the optimizers of DataFusion and Calcite.
I intended to compare the quality of the optimized plans between them, assuming that DataFusion's optimizing speed would be faster (since it's written in Rust).
But to my surprise, I found that Calcite's optimizer is way faster (~ 20x) in some cases.

The case is as follows:

  • Query statement is very simple: "select column_1 from table_1"
  • table_1 has about 700 columns (Yes, it has so many columns, that's the problem).

While Calcite finished the optimization in about 7 msec, DataFusion's optimizer took about 120 msec.
At first, the number was worse, but it settled to about 120 msec when I set the global allocator to mimalloc. (I've tried snmalloc and it was somewhat faster - about 100 msec. But somehow snmalloc did not play well with valgrind, I chose mimalloc at least temporarily)

I ran the test program with valgrind / callgrind and drew the call graph. The graph showed that about half of the execution time is being spent on <alloc::string::String as core::clone::Clone>::clone. The call count was 3,930,814.

I ran the optimizer for another table with fewer columns (about 200 columns), and it took much less time - about 12msec.

So, I suspect that the optimizer becomes slow (at least for a table with many columns) because it clones the strings related to the schema of the table too many times.

Describe the solution you'd like

Perhaps removing unnecessary cloning may help. Or, make the fields immutable and manage them with reference counted smart pointers.

Describe alternatives you've considered

No alternatives.

Additional context

The following attachment is the call graph in SVG format. It was created by gprof2dot.py and dot with callgrind's output data. 'batch_test' is the name of my test program. Somewhat contrary to the name, The program only tests one query statement.

out_mimalloc_simple_query

@zeodtr zeodtr added the enhancement New feature or request label Feb 2, 2023
@tustvold
Copy link
Contributor

tustvold commented Feb 2, 2023

Possibly related - #4680

@alamb
Copy link
Contributor

alamb commented Feb 2, 2023

I looked at the trace and here are my observations:

As @tustvold has said, if we can have DFSchema / DFField that don't copy the values #4680 around that would help immensly

A large amount of the allocations come from DFSchema::merge -- see https://github.com/apache/arrow-datafusion/blob/224c682101949da57aebc36e92e5a881ef3040d4/datafusion/common/src/dfschema.rs#L135-L151

Screenshot 2023-02-02 at 3 26 45 PM

And a large part of that is how it ignores errors with .ok() where were quite expensive to produce

It also appears there is copying going on in unwrap_cast_in_comparison and common subexpr eliminiate

@alamb
Copy link
Contributor

alamb commented Feb 2, 2023

@zeodtr Would you be willing to contribute your benchmark program -- we have https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/benches/sql_planner.rs but maybe we could have a more end to end plan creation test 🤔

@ygf11
Copy link
Contributor

ygf11 commented Feb 3, 2023

And a large part of that is how it ignores errors with .ok() where were quite expensive to produce

The result type of field_with_name is Result<&DFField>, it is not clear enough I think.

Giving the name, there are three results:

  1. No such field, return FieldNotFound error.
  2. Find only one field, return this field.
  3. Find one more field, return Ambiguous reference error.

The first and third both will return error, in most times we should distinguish them, but it is not easy. I am not sure if we should distinguish them.

Maybe we can change it to Result<Option<&Field>>.

@zeodtr
Copy link
Author

zeodtr commented Feb 3, 2023

@alamb
I've simplified my benchmark program as follows. (I would like to upload the file as an attachment, but it's impossible in my current environment) It is a single file, named main.rs. It is based on datafusion/sql/examples/sql.rs except for the optimizer part. The optimizer part is based on the documentation in https://crates.io/crates/datafusion-optimizer.

use std::{collections::HashMap, sync::Arc};

use datafusion::{
    arrow::datatypes::{DataType, Field, Schema},
    common::Result,
    config::ConfigOptions,
    error::DataFusionError,
    logical_expr::{
        logical_plan::builder::LogicalTableSource, AggregateUDF, LogicalPlan, ScalarUDF,
        TableSource,
    },
    optimizer::{optimizer::Optimizer, OptimizerContext, OptimizerRule},
    sql::{
        planner::{ContextProvider, SqlToRel},
        sqlparser::{dialect::GenericDialect, parser::Parser},
        TableReference,
    },
};

#[global_allocator]
static GLOBAL: mimalloc::MiMalloc = mimalloc::MiMalloc;

fn main() {
    let sql = "select column1 from table1";
    let schema_provider = TestSchemaProvider::new();

    let now = std::time::Instant::now();

    let dialect = GenericDialect {};
    let ast = Parser::parse_sql(&dialect, sql).unwrap();
    let statement = &ast[0];
    let sql_to_rel = SqlToRel::new(&schema_provider);
    let plan = sql_to_rel.sql_statement_to_plan(statement.clone()).unwrap();

    println!(
        "elapsed time after creating a logical plan: {}",
        now.elapsed().as_millis()
    );

    let config = OptimizerContext::default();
    let optimizer = Optimizer::new();
    let optimized_plan = optimizer.optimize(&plan, &config, observe).unwrap();

    println!(
        "elapsed time after optimization: {}\n",
        now.elapsed().as_millis()
    );

    println!("plan:\n{:?}\n", plan);
    println!("optimized plan:\n{:?}", optimized_plan);
}

fn observe(_plan: &LogicalPlan, _rule: &dyn OptimizerRule) {}

struct TestSchemaProvider {
    options: ConfigOptions,
    tables: HashMap<String, Arc<dyn TableSource>>,
}

impl TestSchemaProvider {
    pub fn new() -> Self {
        let mut tables = HashMap::new();
        tables.insert(
            "table1".to_string(),
            create_table_source({
                let mut fields = Vec::new();
                for num in 0..700 {
                    fields.push(Field::new(
                        format!("column{}", num + 1),
                        DataType::Int32,
                        false,
                    ))
                }
                fields
            }),
        );

        Self {
            options: Default::default(),
            tables,
        }
    }
}

fn create_table_source(fields: Vec<Field>) -> Arc<dyn TableSource> {
    Arc::new(LogicalTableSource::new(Arc::new(
        Schema::new_with_metadata(fields, HashMap::new()),
    )))
}

impl ContextProvider for TestSchemaProvider {
    fn get_table_provider(&self, name: TableReference) -> Result<Arc<dyn TableSource>> {
        match self.tables.get(name.table()) {
            Some(table) => Ok(table.clone()),
            _ => Err(DataFusionError::Plan(format!(
                "Table not found: {}",
                name.table()
            ))),
        }
    }

    fn get_function_meta(&self, _name: &str) -> Option<Arc<ScalarUDF>> {
        None
    }

    fn get_aggregate_meta(&self, _name: &str) -> Option<Arc<AggregateUDF>> {
        None
    }

    fn get_variable_type(&self, _variable_names: &[String]) -> Option<DataType> {
        None
    }

    fn options(&self) -> &ConfigOptions {
        &self.options
    }
}

And the content of Cargo.toml is as follows:

[package]
name = "simple_optimizer_test"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
datafusion = "17.0.0"
mimalloc = "0.1.34"

[profile.release]
debug = 2

BTW, It seems that the performance of the optimizer worsens more than linearly as the number of the columns increases.

For example, when the number of the column is 1000, the total elapsed time is about 200 msec (30 for creating the logical plan, 170 for the optimizer), but when the number of the column is 2000, the total elapsed time is about 800 msec (100 for creating the logical plan, 700 for the optimizer).

@alamb
Copy link
Contributor

alamb commented Feb 8, 2023

Thank you @zeodtr -- I intend to port this into the datafusion benchmark suite eventually.

@alamb alamb changed the title Optimizer: Avoid too many string cloning in the optimizer Optimizer is slow: Avoid too many string cloning in the optimizer Feb 11, 2023
@alamb
Copy link
Contributor

alamb commented Feb 11, 2023

Proposed adding in #5256

@alamb
Copy link
Contributor

alamb commented Feb 16, 2023

Here is one possible task that would help: #5309

@alamb
Copy link
Contributor

alamb commented Mar 18, 2023

Filed #5637 to track making the optimizer faster in general

@mslapek
Copy link
Contributor

mslapek commented Mar 19, 2023

TBH String interning would be the best ⭐️ - no cloning, quick Eq/Hash...

Instead of String we could use StringId:

#[derive(Hash, PartialEq, Eq, Copy, Clone)]
struct StringId(usize);

And some HashMap<String, StringId> to assign strings to unique IDs.

@alamb
Copy link
Contributor

alamb commented Mar 20, 2023

TBH String interning would be the best ⭐️ - no cloning, quick Eq/Hash...

Agreed though then we have to thread the HashMap through

Another possibility would be to use Arc<str> (aka refcounted strings) -- not quite as cheap to clone / create but also don't need any external context

@mslapek
Copy link
Contributor

mslapek commented Mar 20, 2023

Another possibility would be to use Arc<str> (aka refcounted strings) -- not quite as cheap to clone / create but also don't need any external context

Looks like a good tradeoff.

What do you think about 🎯 precomputed hash + Arc<str>?

Performance almost-like string interning.

@alamb
Copy link
Contributor

alamb commented Mar 20, 2023

What do you think about 🎯 precomputed hash + Arc?

Seems reasonable to me. Maybe as some intermediate state we could have a StringInterner structure that computed the hash + Arc<str> and tried to reuse any previously known about.

Then we could thread through a StringInterner when possible (e.g. on the optimizer or the sql planner) but also be able to make these strings easily without one (so we could incrementally update the code)

Any such solution, I think, should strive very hard to keep the burden of working in the DataFusion codebase low (e.g. should be easy to work with for anyone used to String, be well documented, etc)

@tustvold
Copy link
Contributor

tustvold commented Mar 26, 2023

apache/arrow-rs#3955 may also be relevant here, I'm optimistic that we can drastically reduce the overheads without needing to reach for exotic solutions like string interning, at least initially

@comphead
Copy link
Contributor

@alamb I'm wondering if #9020 closes this one

@alamb
Copy link
Contributor

alamb commented Jan 31, 2024

@alamb I'm wondering if #9020 closes this one

It is a good call @comphead -- It certainly helps 😅.

Speeds up benchmarks in sql_planner.rs by 18-75%.

(BTW these are the benchmarks that @zeodtr contributed ❤️ )

I am also quite excited about @matthewmturner 's work in #8905

@zeodtr / @mslapek I wonder if you have an opinion about this. Shall we keep it open? Shall we close it in favor of tickets that describe specific improvements? Do you have a chance to test with more recent versions of DataFusion?

@zeodtr
Copy link
Author

zeodtr commented Jan 31, 2024

@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues.
Thanks for your great work.
By the way, at the moment, I'm generally satisfied with the performance of logical planning and optimization routines after applying some optimizations here and there, as I explained in #7698 (comment). It would be nice if these optimizations could be incorporated into the official source code, perhaps with some adjustments if needed.

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

as I explained in #7698 (comment). I

I forgot out great that description was. FYI @matthewmturner for your inspiration

@alamb
Copy link
Contributor

alamb commented Jun 23, 2024

@alarmb Since there is an epic issue(#5637), this issue can be closed in favor of more specific issues.

Thanks again @zeodtr

BTW depending on the usecase we have made datafusion planning about 2x faster between 37.0.0 and 40.0.0 (will be released shortly). There were more pronounced gains for schemas with larger numbers of columns

@alamb alamb closed this as completed Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants