Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-11561: [Rust][DataFusion] Add Send + Sync to MemTable::load #9448

Closed
wants to merge 1 commit into from

Conversation

seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Feb 8, 2021

This PR adds Send + Sync to the MemTable::load method to allow implementation of a persist method like Spark's Dataframe in an async function.

@github-actions
Copy link

github-actions bot commented Feb 8, 2021

@codecov-io
Copy link

Codecov Report

Merging #9448 (40e3b42) into master (a321cde) will decrease coverage by 0.05%.
The diff coverage is 66.52%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9448      +/-   ##
==========================================
- Coverage   82.14%   82.09%   -0.06%     
==========================================
  Files         232      233       +1     
  Lines       54150    54371     +221     
==========================================
+ Hits        44484    44637     +153     
- Misses       9666     9734      +68     
Impacted Files Coverage Δ
rust/arrow/src/datatypes.rs 78.33% <ø> (ø)
rust/benchmarks/src/bin/tpch.rs 12.15% <0.00%> (+0.06%) ⬆️
rust/datafusion/src/datasource/memory.rs 80.00% <ø> (ø)
...tafusion/src/physical_plan/distinct_expressions.rs 90.29% <0.00%> (-0.89%) ⬇️
...atafusion/src/physical_plan/expressions/average.rs 82.30% <0.00%> (-1.49%) ⬇️
.../datafusion/src/physical_plan/expressions/count.rs 85.86% <0.00%> (-1.91%) ⬇️
...atafusion/src/physical_plan/expressions/min_max.rs 91.54% <0.00%> (-1.76%) ⬇️
...st/datafusion/src/physical_plan/expressions/sum.rs 77.27% <0.00%> (-1.02%) ⬇️
rust/datafusion/src/physical_plan/mod.rs 86.00% <ø> (ø)
rust/datafusion/src/physical_plan/udaf.rs 75.00% <0.00%> (-3.95%) ⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1c219e3...40e3b42. Read the comment docs.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good idea, though it might cause a small breaking API change.

@@ -107,7 +107,7 @@ impl MemTable {

/// Create a mem table by reading from another data source
pub async fn load(
t: &dyn TableProvider,
t: Box<dyn TableProvider + Send + Sync>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
t: Box<dyn TableProvider + Send + Sync>,
t: Arc<dyn TableProvider + Send + Sync>,

I suggest using Arc to follow the convention in ExecutionContext:
https://github.com/apache/arrow/blob/master/rust/datafusion/src/execution/context.rs#L565

I don't think there is any reason load needs exclusive ownership over the TableProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb .

You make a good point and it is confusing to me why Box is used in some places and Arc in others (given the I thought ideal characteristics of Arc): e.g.

/// Execution context for registering data sources and executing queries
#[derive(Clone)]
pub struct ExecutionContextState {
    /// Data sources that are registered with the context
    pub datasources: HashMap<String, Arc<dyn TableProvider + Send + Sync>>,

vs

    pub fn register_table(
        &mut self,
        name: &str,
        provider: Box<dyn TableProvider + Send + Sync>,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can take a shot at cleaning up all the APIs to take an Arc over the next few days. I think that would be the ideal outcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. I did replace everything with Arc locally for testing and tests pass at least.

@alamb alamb closed this in 7c5cf92 Feb 11, 2021
nevi-me pushed a commit to nevi-me/arrow that referenced this pull request Feb 13, 2021
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function.

Closes apache#9448 from seddonm1/send-sync

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Feb 17, 2021
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function.

Closes apache#9448 from seddonm1/send-sync

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
jorgecarleitao pushed a commit that referenced this pull request Feb 17, 2021
…r> rather than Box and Arc

NOTE: This is a backwards incompatible change in DataFusion

Inspired by a conversation with @seddonm1  #9448 (comment) and #9445 as well as some upcoming needs in IOx (a consumer of DataFusion)

# Rationale:
* No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`)
* Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers  (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125))
* This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s

# Changes
* Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box`

Closes #9487 from alamb/alamb/arcd_table_provider

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
@seddonm1 seddonm1 deleted the send-sync branch February 18, 2021 07:12
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
…r> rather than Box and Arc

NOTE: This is a backwards incompatible change in DataFusion

Inspired by a conversation with @seddonm1  apache/arrow#9448 (comment) and apache/arrow#9445 as well as some upcoming needs in IOx (a consumer of DataFusion)

# Rationale:
* No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`)
* Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers  (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/f055d5e8ee8c6065f38d8351e3f668a43358cd98/rust/datafusion/src/logical_plan/plan.rs#L125))
* This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s

# Changes
* Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box`

Closes #9487 from alamb/alamb/arcd_table_provider

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function.

Closes apache#9448 from seddonm1/send-sync

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…r> rather than Box and Arc

NOTE: This is a backwards incompatible change in DataFusion

Inspired by a conversation with @seddonm1  apache#9448 (comment) and apache#9445 as well as some upcoming needs in IOx (a consumer of DataFusion)

# Rationale:
* No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`)
* Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers  (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125))
* This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s

# Changes
* Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box`

Closes apache#9487 from alamb/alamb/arcd_table_provider

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
This PR adds Send + Sync to the MemTable::load method to allow implementation of a `persist` method like Spark's Dataframe in an async function.

Closes apache#9448 from seddonm1/send-sync

Authored-by: Mike Seddon <seddonm1@gmail.com>
Signed-off-by: Andrew Lamb <andrew@nerdnetworks.org>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
…r> rather than Box and Arc

NOTE: This is a backwards incompatible change in DataFusion

Inspired by a conversation with @seddonm1  apache#9448 (comment) and apache#9445 as well as some upcoming needs in IOx (a consumer of DataFusion)

# Rationale:
* No `TableProvider` APIs actually require ownership of the `TableProvider` (they all take `&self`)
* Internally DataFusion was storing the TableProvider as an Arc already and inconsistently uses `Box`d and `Arc`d table providers  (e.g. in [`LogicalPlan::TableScan`](https://github.com/apache/arrow/blob/437c9173c3e067712eb714c643ca839acc7ed7f6/rust/datafusion/src/logical_plan/plan.rs#L125))
* This change allows the same `TableProvider` instance to be reused easily for different `ExecutionContext`s

# Changes
* Change all uses of `TableProvider` to be wrapped in `Arc` rather than `Box`

Closes apache#9487 from alamb/alamb/arcd_table_provider

Authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants