Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use btree to search fields in DFSchema #7870

Closed
wants to merge 1 commit into from

Conversation

oleggator
Copy link

Which issue does this PR close?

Part of #7698.

Rationale for this change

Current DFSchema implementation uses vector to operate with fields. It makes search of a column by name algorithmically complex.

What changes are included in this PR?

Use BTreeMap to index field qualifiers.

Are these changes tested?

Are there any user-facing changes?

No

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Oct 19, 2023
@oleggator oleggator force-pushed the dfschema-optimization-main branch from 1966c0c to 95fe304 Compare October 19, 2023 14:55
@crepererum
Copy link
Contributor

Is there a reason to use a b-tree ( $\mathrm{O}(\log{n})$ ) vs a hash map ( $\mathrm{O}(1)$ )?

@@ -102,8 +217,12 @@ impl DFSchema {
));
}
}

let fields_index = build_index(&fields);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the index is built for all DFSchema that are created, I wonder if that will be too much overhead. Maybe we could consider creating it on first use 🤔 or finding some way to canonicalize / cache the map

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this use case might indeed be a good call of interior mutability, i.e. use an RWLock and init the lookup table on the first use

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more: instead of RwLock, this can be solved even more elegantly w/ OnceLock::get_or_init (this is usually used for static variables, but you can totally use that for struct members as well).

@alamb
Copy link
Contributor

alamb commented Oct 25, 2023

I plan to review this and related PRs tomorrow morning

@alamb
Copy link
Contributor

alamb commented Oct 26, 2023

Related comment: #7698 (comment)

@oleggator
Copy link
Author

Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?

Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (column.*.*.*, column.table.*.*, column.table.schema.*, column.table.schema.catalog).
It is used in fields_with_unqualified_name method to query all fields by specific name.

Comment on lines +63 to +67
match self.field().cmp(other.field()) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
match self.field().cmp(other.field()) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
}
let field_cmp = self.field().cmp(other.field());
if field_cmp != Ordering::Equal {
return field_cmp;
}

Comment on lines +70 to +74
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
(Some(lhs), Some(rhs)) => {
let cmp = lhs.cmp(rhs);
if cmp != Ordering::Equal {
return cmp;
}
}

Comment on lines +81 to +85
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
(Some(lhs), Some(rhs)) => {
let cmp = lhs.cmp(rhs);
if cmp != Ordering::Equal {
return cmp;
}
}

Comment on lines +92 to +96
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) {
Ordering::Less => return Ordering::Less,
Ordering::Greater => return Ordering::Greater,
Ordering::Equal => {}
},
(Some(lhs), Some(rhs)) => {
let cmp = lhs.cmp(rhs);
if cmp != Ordering::Equal {
return cmp;
}
}

/// DFSchema wraps an Arrow schema and adds relation names
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct DFSchema {
/// Fields
fields: Vec<DFField>,
/// Fields index
fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we use BTree here because it ensures the order of the index 🤔.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do you care about the index order? You either iterate over the fields in order (use self.fields.iter()) or you lookup a field by name (use self.field_index.get(...)). The index is orderd by field name. So this argument would only be valid if we OFTEN iterate over the fields in name order, which I don't think we do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@crepererum
Copy link
Contributor

Is there a reason to use a b-tree ( O(log⁡n) ) vs a hash map ( O(1) )?

Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop (column.*.*.*, column.table.*.*, column.table.schema.*, column.table.schema.catalog). It is used in fields_with_unqualified_name method to query all fields by specific name.

Is that such a common operation that it is worth to keep an expensive index on every single schema in the query graph? I think the planner that resolves these names can easily order the fields and build this index locally.

@oleggator
Copy link
Author

oleggator commented Oct 27, 2023

Made a benchmark.

Baseline - Data Fusion 32 (a0c5aff)

index_of_column_by_name 10
                        time:   [11.323 ns 11.325 ns 11.328 ns]
                        change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 20
                        time:   [4.1947 ns 4.1963 ns 4.1981 ns]
                        change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

index_of_column_by_name 50
                        time:   [34.841 ns 34.851 ns 34.871 ns]
                        change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 100
                        time:   [88.736 ns 88.927 ns 89.119 ns]
                        change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

index_of_column_by_name 500
                        time:   [403.20 ns 403.70 ns 404.29 ns]
                        change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high severe

index_of_column_by_name 1000
                        time:   [909.73 ns 910.11 ns 910.48 ns]
                        change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
                        change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
                        Change within noise threshold.

DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
                        change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
                        change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
                        change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
                        change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
                        change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total

This PR

index_of_column_by_name 10
                        time:   [33.607 ns 33.663 ns 33.717 ns]
                        change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 20
                        time:   [21.509 ns 21.522 ns 21.535 ns]
                        change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 50
                        time:   [43.590 ns 43.651 ns 43.713 ns]
                        change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 100
                        time:   [68.349 ns 68.373 ns 68.401 ns]
                        change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 500
                        time:   [65.428 ns 65.444 ns 65.461 ns]
                        change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 1000
                        time:   [74.167 ns 74.174 ns 74.183 ns]
                        change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
                        change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
                        change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
                        change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
                        change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
                        change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
                        change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total

@karlovnv
Copy link

Made a benchmark.

Baseline - Data Fusion 32 (a0c5aff)

index_of_column_by_name 10
                        time:   [11.323 ns 11.325 ns 11.328 ns]
                        change: [-0.0714% +0.3045% +0.6180%] (p = 0.09 > 0.05)
                        No change in performance detected.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 20
                        time:   [4.1947 ns 4.1963 ns 4.1981 ns]
                        change: [-2.1038% -1.5880% -1.2714%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

index_of_column_by_name 50
                        time:   [34.841 ns 34.851 ns 34.871 ns]
                        change: [-0.2590% -0.1783% -0.0774%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) low severe
  4 (4.00%) low mild
  5 (5.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 100
                        time:   [88.736 ns 88.927 ns 89.119 ns]
                        change: [+4.6597% +5.0086% +5.3786%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild

index_of_column_by_name 500
                        time:   [403.20 ns 403.70 ns 404.29 ns]
                        change: [+1.5771% +1.6483% +1.7326%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  3 (3.00%) low mild
  4 (4.00%) high severe

index_of_column_by_name 1000
                        time:   [909.73 ns 910.11 ns 910.48 ns]
                        change: [-2.0626% -1.6648% -1.3588%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

DFSchema::new 10        time:   [328.91 ns 329.14 ns 329.38 ns]
                        change: [-0.8652% -0.8013% -0.7418%] (p = 0.00 < 0.05)
                        Change within noise threshold.

DFSchema::new 20        time:   [725.37 ns 725.93 ns 726.56 ns]
                        change: [+0.4542% +0.5177% +0.5841%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [1.6864 µs 1.6892 µs 1.6924 µs]
                        change: [+1.3382% +1.4765% +1.6362%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [3.4953 µs 3.4965 µs 3.4978 µs]
                        change: [-3.4655% -3.2889% -3.1317%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low severe
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 500       time:   [23.470 µs 23.477 µs 23.485 µs]
                        change: [-1.8427% -1.7821% -1.7253%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 1000      time:   [45.504 µs 45.515 µs 45.528 µs]
                        change: [-2.8088% -2.6555% -2.4933%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

cargo bench  172.06s user 0.50s system 153% cpu 1:52.07 total

This PR

index_of_column_by_name 10
                        time:   [33.607 ns 33.663 ns 33.717 ns]
                        change: [+196.44% +196.92% +197.41%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 20
                        time:   [21.509 ns 21.522 ns 21.535 ns]
                        change: [+412.46% +412.90% +413.42%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 50
                        time:   [43.590 ns 43.651 ns 43.713 ns]
                        change: [+24.956% +25.143% +25.325%] (p = 0.00 < 0.05)
                        Performance has regressed.

index_of_column_by_name 100
                        time:   [68.349 ns 68.373 ns 68.401 ns]
                        change: [-23.444% -23.221% -22.998%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  2 (2.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe

index_of_column_by_name 500
                        time:   [65.428 ns 65.444 ns 65.461 ns]
                        change: [-83.785% -83.768% -83.752%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) low severe
  1 (1.00%) low mild
  4 (4.00%) high mild
  3 (3.00%) high severe

index_of_column_by_name 1000
                        time:   [74.167 ns 74.174 ns 74.183 ns]
                        change: [-91.855% -91.850% -91.844%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  1 (1.00%) low mild
  3 (3.00%) high mild
  3 (3.00%) high severe

DFSchema::new 10        time:   [956.63 ns 957.20 ns 957.81 ns]
                        change: [+190.77% +191.00% +191.28%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

DFSchema::new 20        time:   [2.4375 µs 2.4384 µs 2.4393 µs]
                        change: [+235.82% +236.06% +236.36%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  1 (1.00%) high mild
  2 (2.00%) high severe

DFSchema::new 50        time:   [6.5247 µs 6.5275 µs 6.5303 µs]
                        change: [+287.52% +288.07% +288.63%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

DFSchema::new 100       time:   [15.298 µs 15.330 µs 15.368 µs]
                        change: [+337.14% +340.86% +347.06%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) low mild
  6 (6.00%) high mild
  5 (5.00%) high severe

DFSchema::new 500       time:   [92.211 µs 92.284 µs 92.361 µs]
                        change: [+292.82% +293.14% +293.47%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) low mild

DFSchema::new 1000      time:   [204.70 µs 204.87 µs 205.05 µs]
                        change: [+349.22% +349.78% +350.32%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

cargo bench  252.05s user 1.60s system 150% cpu 2:48.82 total

Could you please add summary?

It seems that btree provides an advantage with 100+ cols

@alamb
Copy link
Contributor

alamb commented Oct 31, 2023

Thank you -- I plan to review this more carefully tomorrow

@karlovnv
Copy link

karlovnv commented Nov 3, 2023

Thank you -- I plan to review this more carefully tomorrow

@alamb I think it's a good idea to introduce user defined cacheprovider for both DFSchema and arrow Schema. It will allow to take benefits from btree and avoid building it when is not necessary.
My assumption is that user knows when schema become invalid and can manage it invalidation from the cache

Copy link

Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the Stale PR has not had any activity for some time label Apr 25, 2024
@github-actions github-actions bot closed this May 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules Stale PR has not had any activity for some time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants