Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

Closed
wants to merge 3 commits into from

Conversation

liancheng
Copy link
Contributor

In cases like Limit and TakeOrdered, executeCollect() makes optimizations that execute().collect() will not.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@rxin
Copy link
Contributor

rxin commented Jun 2, 2014

LGTM.

// Overriden RDD actions
// =======================================================================

override def collect() = queryExecution.executedPlan.executeCollect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay

@rxin
Copy link
Contributor

rxin commented Jun 2, 2014

Actually - please define the return type explicitly for public methods.

@AmplabJenkins
Copy link

Merged build finished.

@AmplabJenkins
Copy link

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15341/

@rxin
Copy link
Contributor

rxin commented Jun 2, 2014

Actually a lot of tests are failing ...

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@liancheng
Copy link
Contributor Author

@rxin Shame... underestimated this issue and didn't run full test locally :( I think the problem is that executeCollect() should copy row objects to keep data immutable. Also, now user shouldn't call collect() on SchemaRDDs returned by a SQL/HiveQL command, since executeCollect() calls execute() and causes duplicated command execution.

@AmplabJenkins
Copy link

Merged build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15349/

@rxin
Copy link
Contributor

rxin commented Jun 2, 2014

Thanks. I've merged this in master & branch-1.0.

@asfgit asfgit closed this in d000ca9 Jun 2, 2014
asfgit pushed a commit that referenced this pull request Jun 2, 2014
…lect() on the underlying query plan.

In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #939 from liancheng/spark-1958 and squashes the following commits:

bdc4a14 [Cheng Lian] Copy rows to present immutable data to users
8250976 [Cheng Lian] Added return type explicitly for public API
192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan.

(cherry picked from commit d000ca9)
Signed-off-by: Reynold Xin <rxin@apache.org>
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014