Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20960][SQL] make ColumnVector public #20116

Closed
wants to merge 4 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 29, 2017

What changes were proposed in this pull request?

move ColumnVector and related classes to org.apache.spark.sql.vectorized, and improve the document.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

@@ -87,19 +79,7 @@ public void remove() {
}

/**
* Resets the batch for writing.
*/
public void reset() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove this as it's for WritableColumnVector only

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@SparkQA
Copy link

SparkQA commented Dec 29, 2017

Test build #85517 has finished for PR 20116 at commit 7dc4496.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* TODO:
* - There are many TODOs for the existing APIs. They should throw a not implemented exception.
* - Compaction: The batch and columns should be able to compact based on a selection vector.
* This class is a wrapper of multiple ColumnVectors and represents a table. It provides a row-view
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a table -> a portion of a table?

@@ -14,32 +14,38 @@
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.vectorized;
package org.apache.spark.sql.sources.v2.vectorized;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is v2 package proper for them? Are ColumnVector and related classes belong to data source v2 API?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

((WritableColumnVector) columns[i]).reset();
}
}
this.numRows = 0;
Copy link
Member

@viirya viirya Dec 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can result an incorrect numRows after the column vectors are reset. May it be a potential error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't matter. The numRows is only used when calling rowsInterator, and we always call setNumRows before calling rowsIterator

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85587 has finished for PR 20116 at commit 4bf69d8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85589 has finished for PR 20116 at commit 4bf69d8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 2, 2018

Test build #85593 has finished for PR 20116 at commit 4bf69d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* Maps are just a special case of a two field struct.
* ColumnVector supports all the data types including nested types. To handle nested types,
* ColumnVector can have children and is a tree structure. For struct type, it stores the actual
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: child -> children

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's already children

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my mistake.

* ColumnVector is expected to be reused during the entire data loading process, to avoid allocating
* memory again and again.
*
* ColumnVector is meant to maximize CPU efficiency and not storage footprint, implementations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not -> not to minimize

@@ -586,7 +587,7 @@ public final int appendStruct(boolean isNull) {
if (isNull) {
appendNull();
for (ColumnVector c: childColumns) {
if (c.type instanceof StructType) {
if (c.dataType() instanceof StructType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Which access type will we use for ColumnVector.type? dataType() or type?
For example, OnHeapColumnVector.reserveInternal() uses type while this line uses dataType().

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85620 has finished for PR 20116 at commit e3a7a07.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jan 3, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85623 has finished for PR 20116 at commit e3a7a07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@gatorsmile gatorsmile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -248,7 +248,10 @@ public void enableReturningBatches() {
* Advances to the next batch of rows. Returns false if there are no more.
*/
public boolean nextBatch() throws IOException {
columnarBatch.reset();
for (WritableColumnVector vector : columnVectors) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the space before :

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the standard java foreach code style

* - There are many TODOs for the existing APIs. They should throw a not implemented exception.
* - Compaction: The batch and columns should be able to compact based on a selection vector.
* This class is a wrapper of multiple ColumnVectors and represents a logical table-like data
* structure. It provides a row-view of this batch so that Spark can access the data row by row.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row-view -> row view

* TODO:
* - There are many TODOs for the existing APIs. They should throw a not implemented exception.
* - Compaction: The batch and columns should be able to compact based on a selection vector.
* This class is a wrapper of multiple ColumnVectors and represents a logical table-like data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

This class wraps multiple ColumnVectors as a row-wise table


/**
* Sets the number of rows that are valid.
* Sets the number of rows that are valid in this batch.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

Sets the number of rows

* memory again and again.
*
* ColumnVector is meant to maximize CPU efficiency but not to minimize storage footprint,
* implementations should prefer computing efficiency over storage efficiency when design the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations -> Implementations

* Maps are just a special case of a two field struct.
* ColumnVector supports all the data types including nested types. To handle nested types,
* ColumnVector can have children and is a tree structure. For struct type, it stores the actual
* data of each field in the corresponding child ColumnVector, and only store null information in
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store -> stores

* ColumnVector can have children and is a tree structure. For struct type, it stores the actual
* data of each field in the corresponding child ColumnVector, and only store null information in
* the parent ColumnVector. For array type, it stores the actual array elements in the child
* ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

store -> stores

@SparkQA
Copy link

SparkQA commented Jan 3, 2018

Test build #85637 has finished for PR 20116 at commit c82fc5b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

gatorsmile commented Jan 3, 2018

LGTM

Thanks! Merged to master and 2.3.

asfgit pushed a commit that referenced this pull request Jan 3, 2018
## What changes were proposed in this pull request?

move `ColumnVector` and related classes to `org.apache.spark.sql.vectorized`, and improve the document.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #20116 from cloud-fan/column-vector.

(cherry picked from commit b297029)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
@asfgit asfgit closed this in b297029 Jan 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants