[SPARK-20960][SQL] make ColumnVector public #20116

cloud-fan · 2017-12-29T16:25:29Z

What changes were proposed in this pull request?

move ColumnVector and related classes to org.apache.spark.sql.vectorized, and improve the document.

How was this patch tested?

existing tests.

cloud-fan · 2017-12-29T16:27:11Z

cc @hvanhovell @kiszk @viirya @gatorsmile

cloud-fan · 2017-12-29T16:29:57Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java

@@ -87,19 +79,7 @@ public void remove() {
  }

  /**
-   * Resets the batch for writing.
-   */
-  public void reset() {


remove this as it's for WritableColumnVector only

SparkQA · 2017-12-29T19:21:48Z

Test build #85517 has finished for PR 20116 at commit 7dc4496.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-12-29T23:54:50Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java

- * TODO:
- *  - There are many TODOs for the existing APIs. They should throw a not implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a table. It provides a row-view


a table -> a portion of a table?

viirya · 2017-12-30T00:00:25Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnVector.java

@@ -14,32 +14,38 @@
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
-package org.apache.spark.sql.execution.vectorized;
+package org.apache.spark.sql.sources.v2.vectorized;


Is v2 package proper for them? Are ColumnVector and related classes belong to data source v2 API?

viirya · 2017-12-30T00:17:59Z

sql/core/src/main/java/org/apache/spark/sql/sources/v2/vectorized/ColumnarBatch.java

-        ((WritableColumnVector) columns[i]).reset();
-      }
-    }
-    this.numRows = 0;


This can result an incorrect numRows after the column vectors are reset. May it be a potential error?

This doesn't matter. The numRows is only used when calling rowsInterator, and we always call setNumRows before calling rowsIterator

SparkQA · 2018-01-02T08:05:02Z

Test build #85587 has finished for PR 20116 at commit 4bf69d8.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-02T08:14:56Z

retest this please

SparkQA · 2018-01-02T10:36:38Z

Test build #85589 has finished for PR 20116 at commit 4bf69d8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-02T11:52:14Z

retest this please

SparkQA · 2018-01-02T14:55:10Z

Test build #85593 has finished for PR 20116 at commit 4bf69d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-01-02T15:32:20Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

 *
- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct type, it stores the actual


nit: child -> children

it's already children

Sorry for my mistake.

kiszk · 2018-01-02T15:33:49Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+ * ColumnVector is expected to be reused during the entire data loading process, to avoid allocating
+ * memory again and again.
+ *
+ * ColumnVector is meant to maximize CPU efficiency and not storage footprint, implementations


nit: not -> not to minimize

kiszk · 2018-01-02T16:25:52Z

sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

@@ -586,7 +587,7 @@ public final int appendStruct(boolean isNull) {
    if (isNull) {
      appendNull();
      for (ColumnVector c: childColumns) {
-        if (c.type instanceof StructType) {
+        if (c.dataType() instanceof StructType) {


nit: Which access type will we use for ColumnVector.type? dataType() or type?
For example, OnHeapColumnVector.reserveInternal() uses type while this line uses dataType().

SparkQA · 2018-01-03T08:05:01Z

Test build #85620 has finished for PR 20116 at commit e3a7a07.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-01-03T08:39:48Z

retest this please

SparkQA · 2018-01-03T11:43:56Z

Test build #85623 has finished for PR 20116 at commit e3a7a07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile

LGTM

gatorsmile · 2018-01-03T16:23:23Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

@@ -248,7 +248,10 @@ public void enableReturningBatches() {
   * Advances to the next batch of rows. Returns false if there are no more.
   */
  public boolean nextBatch() throws IOException {
-    columnarBatch.reset();
+    for (WritableColumnVector vector : columnVectors) {


Remove the space before :

This is the standard java foreach code style

gatorsmile · 2018-01-03T16:30:18Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java

- *  - There are many TODOs for the existing APIs. They should throw a not implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a logical table-like data
+ * structure. It provides a row-view of this batch so that Spark can access the data row by row.


row-view -> row view

gatorsmile · 2018-01-03T16:39:15Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java

- * TODO:
- *  - There are many TODOs for the existing APIs. They should throw a not implemented exception.
- *  - Compaction: The batch and columns should be able to compact based on a selection vector.
+ * This class is a wrapper of multiple ColumnVectors and represents a logical table-like data


How about?

This class wraps multiple ColumnVectors as a row-wise table

gatorsmile · 2018-01-03T16:40:47Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnarBatch.java

-
-  /**
-   * Sets the number of rows that are valid.
+   * Sets the number of rows that are valid in this batch.


How about?

Sets the number of rows

gatorsmile · 2018-01-03T16:41:34Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+ * memory again and again.
+ *
+ * ColumnVector is meant to maximize CPU efficiency but not to minimize storage footprint,
+ * implementations should prefer computing efficiency over storage efficiency when design the


implementations -> Implementations

gatorsmile · 2018-01-03T16:42:15Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

- * Maps are just a special case of a two field struct.
+ * ColumnVector supports all the data types including nested types. To handle nested types,
+ * ColumnVector can have children and is a tree structure. For struct type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only store null information in


store -> stores

gatorsmile · 2018-01-03T16:42:27Z

sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java

+ * ColumnVector can have children and is a tree structure. For struct type, it stores the actual
+ * data of each field in the corresponding child ColumnVector, and only store null information in
+ * the parent ColumnVector. For array type, it stores the actual array elements in the child
+ * ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector.


store -> stores

SparkQA · 2018-01-03T19:45:54Z

Test build #85637 has finished for PR 20116 at commit c82fc5b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-01-03T23:26:54Z

LGTM

Thanks! Merged to master and 2.3.

## What changes were proposed in this pull request? move `ColumnVector` and related classes to `org.apache.spark.sql.vectorized`, and improve the document. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20116 from cloud-fan/column-vector. (cherry picked from commit b297029) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

cloud-fan commented Dec 29, 2017

View reviewed changes

viirya reviewed Dec 30, 2017

View reviewed changes

viirya mentioned this pull request Dec 30, 2017

[SPARK-16060][SQL] Support Vectorized ORC Reader #19943

Closed

cloud-fan added 2 commits January 2, 2018 13:57

make ColumnVector public

ea1bb78

address comments

4bf69d8

cloud-fan force-pushed the column-vector branch from 7dc4496 to 4bf69d8 Compare January 2, 2018 06:00

kiszk reviewed Jan 2, 2018

View reviewed changes

address comments

e3a7a07

gatorsmile approved these changes Jan 3, 2018

View reviewed changes

address comments

c82fc5b

dongjoon-hyun approved these changes Jan 3, 2018

View reviewed changes

asfgit closed this in b297029 Jan 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20960][SQL] make ColumnVector public #20116

[SPARK-20960][SQL] make ColumnVector public #20116

cloud-fan commented Dec 29, 2017 •

edited

Loading

cloud-fan commented Dec 29, 2017

cloud-fan Dec 29, 2017

dongjoon-hyun Dec 29, 2017

SparkQA commented Dec 29, 2017

viirya Dec 29, 2017

viirya Dec 30, 2017

kiszk Dec 30, 2017

viirya Dec 30, 2017 •

edited

Loading

cloud-fan Jan 2, 2018

SparkQA commented Jan 2, 2018

cloud-fan commented Jan 2, 2018

SparkQA commented Jan 2, 2018

cloud-fan commented Jan 2, 2018

SparkQA commented Jan 2, 2018

kiszk Jan 2, 2018

cloud-fan Jan 3, 2018

kiszk Jan 3, 2018

kiszk Jan 2, 2018

kiszk Jan 2, 2018

SparkQA commented Jan 3, 2018

kiszk commented Jan 3, 2018

SparkQA commented Jan 3, 2018

gatorsmile left a comment

gatorsmile Jan 3, 2018

cloud-fan Jan 3, 2018

gatorsmile Jan 3, 2018

gatorsmile Jan 3, 2018

gatorsmile Jan 3, 2018

gatorsmile Jan 3, 2018

gatorsmile Jan 3, 2018

gatorsmile Jan 3, 2018

SparkQA commented Jan 3, 2018

gatorsmile commented Jan 3, 2018 •

edited

Loading

[SPARK-20960][SQL] make ColumnVector public #20116

[SPARK-20960][SQL] make ColumnVector public #20116

Conversation

cloud-fan commented Dec 29, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cloud-fan commented Dec 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 29, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya Dec 30, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 2, 2018

cloud-fan commented Jan 2, 2018

SparkQA commented Jan 2, 2018

cloud-fan commented Jan 2, 2018

SparkQA commented Jan 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2018

kiszk commented Jan 3, 2018

SparkQA commented Jan 3, 2018

gatorsmile left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 3, 2018

gatorsmile commented Jan 3, 2018 • edited Loading

cloud-fan commented Dec 29, 2017 •

edited

Loading

viirya Dec 30, 2017 •

edited

Loading

gatorsmile commented Jan 3, 2018 •

edited

Loading