-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use btree to search fields in DFSchema #7870
Conversation
1966c0c
to
95fe304
Compare
Is there a reason to use a b-tree ( |
@@ -102,8 +217,12 @@ impl DFSchema { | |||
)); | |||
} | |||
} | |||
|
|||
let fields_index = build_index(&fields); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the index is built for all DFSchema that are created, I wonder if that will be too much overhead. Maybe we could consider creating it on first use 🤔 or finding some way to canonicalize / cache the map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this use case might indeed be a good call of interior mutability, i.e. use an RWLock and init the lookup table on the first use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this more: instead of RwLock
, this can be solved even more elegantly w/ OnceLock::get_or_init
(this is usually used for static variables, but you can totally use that for struct members as well).
I plan to review this and related PRs tomorrow morning |
Related comment: #7698 (comment) |
Using b-tree we can query all fields matching to a "prefix" in one O(logn) hop ( |
match self.field().cmp(other.field()) { | ||
Ordering::Less => return Ordering::Less, | ||
Ordering::Greater => return Ordering::Greater, | ||
Ordering::Equal => {} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
match self.field().cmp(other.field()) { | |
Ordering::Less => return Ordering::Less, | |
Ordering::Greater => return Ordering::Greater, | |
Ordering::Equal => {} | |
} | |
let field_cmp = self.field().cmp(other.field()); | |
if field_cmp != Ordering::Equal { | |
return field_cmp; | |
} |
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | ||
Ordering::Less => return Ordering::Less, | ||
Ordering::Greater => return Ordering::Greater, | ||
Ordering::Equal => {} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | |
Ordering::Less => return Ordering::Less, | |
Ordering::Greater => return Ordering::Greater, | |
Ordering::Equal => {} | |
}, | |
(Some(lhs), Some(rhs)) => { | |
let cmp = lhs.cmp(rhs); | |
if cmp != Ordering::Equal { | |
return cmp; | |
} | |
} |
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | ||
Ordering::Less => return Ordering::Less, | ||
Ordering::Greater => return Ordering::Greater, | ||
Ordering::Equal => {} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | |
Ordering::Less => return Ordering::Less, | |
Ordering::Greater => return Ordering::Greater, | |
Ordering::Equal => {} | |
}, | |
(Some(lhs), Some(rhs)) => { | |
let cmp = lhs.cmp(rhs); | |
if cmp != Ordering::Equal { | |
return cmp; | |
} | |
} |
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | ||
Ordering::Less => return Ordering::Less, | ||
Ordering::Greater => return Ordering::Greater, | ||
Ordering::Equal => {} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Some(lhs), Some(rhs)) => match lhs.cmp(rhs) { | |
Ordering::Less => return Ordering::Less, | |
Ordering::Greater => return Ordering::Greater, | |
Ordering::Equal => {} | |
}, | |
(Some(lhs), Some(rhs)) => { | |
let cmp = lhs.cmp(rhs); | |
if cmp != Ordering::Equal { | |
return cmp; | |
} | |
} |
/// DFSchema wraps an Arrow schema and adds relation names | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct DFSchema { | ||
/// Fields | ||
fields: Vec<DFField>, | ||
/// Fields index | ||
fields_index: BTreeMap<OwnedFieldReference, Vec<usize>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we use BTree here because it ensures the order of the index 🤔.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do you care about the index order? You either iterate over the fields in order (use self.fields.iter()
) or you lookup a field by name (use self.field_index.get(...)
). The index is orderd by field name. So this argument would only be valid if we OFTEN iterate over the fields in name order, which I don't think we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough
Is that such a common operation that it is worth to keep an expensive index on every single schema in the query graph? I think the planner that resolves these names can easily order the fields and build this index locally. |
Made a benchmark. Baseline - Data Fusion 32 (a0c5aff)
This PR
|
Could you please add summary? It seems that btree provides an advantage with 100+ cols |
Thank you -- I plan to review this more carefully tomorrow |
@alamb I think it's a good idea to introduce user defined cacheprovider for both DFSchema and arrow Schema. It will allow to take benefits from btree and avoid building it when is not necessary. |
Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or this will be closed in 7 days. |
Which issue does this PR close?
Part of #7698.
Rationale for this change
Current DFSchema implementation uses vector to operate with fields. It makes search of a column by name algorithmically complex.
What changes are included in this PR?
Use BTreeMap to index field qualifiers.
Are these changes tested?
Are there any user-facing changes?
No