-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-11561: [Rust][DataFusion] Add Send + Sync to MemTable::load #9448
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9448 +/- ##
==========================================
- Coverage 82.14% 82.09% -0.06%
==========================================
Files 232 233 +1
Lines 54150 54371 +221
==========================================
+ Hits 44484 44637 +153
- Misses 9666 9734 +68
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good idea, though it might cause a small breaking API change.
@@ -107,7 +107,7 @@ impl MemTable { | |||
|
|||
/// Create a mem table by reading from another data source | |||
pub async fn load( | |||
t: &dyn TableProvider, | |||
t: Box<dyn TableProvider + Send + Sync>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t: Box<dyn TableProvider + Send + Sync>, | |
t: Arc<dyn TableProvider + Send + Sync>, |
I suggest using Arc
to follow the convention in ExecutionContext:
https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/context.rs#L565
I don't think there is any reason load
needs exclusive ownership over the TableProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @alamb .
You make a good point and it is confusing to me why Box
is used in some places and Arc
in others (given the I thought ideal characteristics of Arc
): e.g.
/// Execution context for registering data sources and executing queries
#[derive(Clone)]
pub struct ExecutionContextState {
/// Data sources that are registered with the context
pub datasources: HashMap<String, Arc<dyn TableProvider + Send + Sync>>,
vs
pub fn register_table(
&mut self,
name: &str,
provider: Box<dyn TableProvider + Send + Sync>,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can take a shot at cleaning up all the APIs to take an Arc over the next few days. I think that would be the ideal outcome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great. I did replace everything with Arc locally for testing and tests pass at least.
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function. Closes apache#9448 from seddonm1/send-sync Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function. Closes apache#9448 from seddonm1/send-sync Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r> rather than Box and Arc NOTE: This is a backwards incompatible change in DataFusion Inspired by a conversation with @seddonm1 #9448 (comment) and #9445 as well as some upcoming needs in IOx (a consumer of DataFusion) # Rationale: * No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`) * Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125)) * This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s # Changes * Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box` Closes #9487 from alamb/alamb/arcd_table_provider Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
…r> rather than Box and Arc NOTE: This is a backwards incompatible change in DataFusion Inspired by a conversation with @seddonm1 apache/arrow#9448 (comment) and apache/arrow#9445 as well as some upcoming needs in IOx (a consumer of DataFusion) # Rationale: * No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`) * Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/f055d5e8ee8c6065f38d8351e3f668a43358cd98/rust/datafusion/src/logical_plan/plan.rs#L125)) * This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s # Changes * Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box` Closes #9487 from alamb/alamb/arcd_table_provider Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function. Closes apache#9448 from seddonm1/send-sync Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r> rather than Box and Arc NOTE: This is a backwards incompatible change in DataFusion Inspired by a conversation with @seddonm1 apache#9448 (comment) and apache#9445 as well as some upcoming needs in IOx (a consumer of DataFusion) # Rationale: * No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`) * Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125)) * This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s # Changes * Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box` Closes apache#9487 from alamb/alamb/arcd_table_provider Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function. Closes apache#9448 from seddonm1/send-sync Authored-by: Mike Seddon <seddonm1@gmail.com> Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
…r> rather than Box and Arc NOTE: This is a backwards incompatible change in DataFusion Inspired by a conversation with @seddonm1 apache#9448 (comment) and apache#9445 as well as some upcoming needs in IOx (a consumer of DataFusion) # Rationale: * No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`) * Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125)) * This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s # Changes * Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box` Closes apache#9487 from alamb/alamb/arcd_table_provider Authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
This PR adds Send + Sync to the MemTable::load method to allow implementation of a
persist
method like Spark's Dataframe in an async function.