[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

liancheng · 2014-06-02T03:01:52Z

In cases like Limit and TakeOrdered, executeCollect() makes optimizations that execute().collect() will not.

…lect() on the underlying query plan.

AmplabJenkins · 2014-06-02T03:02:58Z

Merged build triggered.

AmplabJenkins · 2014-06-02T03:03:07Z

Merged build started.

rxin · 2014-06-02T03:09:14Z

LGTM.

aarondav · 2014-06-02T03:09:31Z

sql/core/src/main/scala/org/apache/spark/sql/SchemaRDD.scala

+  // Overriden RDD actions
+  // =======================================================================
+
+  override def collect() = queryExecution.executedPlan.executeCollect()


rxin · 2014-06-02T03:11:35Z

Actually - please define the return type explicitly for public methods.

AmplabJenkins · 2014-06-02T04:13:58Z

Merged build finished.

AmplabJenkins · 2014-06-02T04:13:59Z

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15341/

rxin · 2014-06-02T04:16:18Z

Actually a lot of tests are failing ...

AmplabJenkins · 2014-06-02T15:32:58Z

Merged build triggered.

AmplabJenkins · 2014-06-02T15:33:08Z

Merged build started.

liancheng · 2014-06-02T16:03:22Z

@rxin Shame... underestimated this issue and didn't run full test locally :( I think the problem is that executeCollect() should copy row objects to keep data immutable. Also, now user shouldn't call collect() on SchemaRDDs returned by a SQL/HiveQL command, since executeCollect() calls execute() and causes duplicated command execution.

AmplabJenkins · 2014-06-02T16:48:31Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-06-02T16:48:31Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15349/

rxin · 2014-06-02T19:09:30Z

Thanks. I've merged this in master & branch-1.0.

…lect() on the underlying query plan. In cases like `Limit` and `TakeOrdered`, `executeCollect()` makes optimizations that `execute().collect()` will not. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #939 from liancheng/spark-1958 and squashes the following commits: bdc4a14 [Cheng Lian] Copy rows to present immutable data to users 8250976 [Cheng Lian] Added return type explicitly for public API 192a25c [Cheng Lian] [SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. (cherry picked from commit d000ca9) Signed-off-by: Reynold Xin <rxin@apache.org>

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCol…

192a25c

…lect() on the underlying query plan.

aarondav reviewed Jun 2, 2014
View reviewed changes

liancheng added 2 commits June 2, 2014 17:25

Added return type explicitly for public API

8250976

Copy rows to present immutable data to users

bdc4a14

asfgit closed this in d000ca9 Jun 2, 2014

pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCol…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

liancheng commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

aarondav Jun 2, 2014

rxin commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

liancheng commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

[SPARK-1958] Calling .collect() on a SchemaRDD should call executeCollect() on the underlying query plan. #939

Conversation

liancheng commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

aarondav Jun 2, 2014

Choose a reason for hiding this comment

rxin commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

liancheng commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

AmplabJenkins commented Jun 2, 2014

rxin commented Jun 2, 2014