-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20960][SQL] make ColumnVector public #20116
Conversation
@@ -87,19 +79,7 @@ public void remove() { | |||
} | |||
|
|||
/** | |||
* Resets the batch for writing. | |||
*/ | |||
public void reset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this as it's for WritableColumnVector
only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Test build #85517 has finished for PR 20116 at commit
|
* TODO: | ||
* - There are many TODOs for the existing APIs. They should throw a not implemented exception. | ||
* - Compaction: The batch and columns should be able to compact based on a selection vector. | ||
* This class is a wrapper of multiple ColumnVectors and represents a table. It provides a row-view |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a table -> a portion of a table?
@@ -14,32 +14,38 @@ | |||
* See the License for the specific language governing permissions and | |||
* limitations under the License. | |||
*/ | |||
package org.apache.spark.sql.execution.vectorized; | |||
package org.apache.spark.sql.sources.v2.vectorized; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is v2 package proper for them? Are ColumnVector and related classes belong to data source v2 API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
((WritableColumnVector) columns[i]).reset(); | ||
} | ||
} | ||
this.numRows = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can result an incorrect numRows
after the column vectors are reset. May it be a potential error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't matter. The numRows
is only used when calling rowsInterator
, and we always call setNumRows
before calling rowsIterator
7dc4496
to
4bf69d8
Compare
Test build #85587 has finished for PR 20116 at commit
|
retest this please |
Test build #85589 has finished for PR 20116 at commit
|
retest this please |
Test build #85593 has finished for PR 20116 at commit
|
* | ||
* Maps are just a special case of a two field struct. | ||
* ColumnVector supports all the data types including nested types. To handle nested types, | ||
* ColumnVector can have children and is a tree structure. For struct type, it stores the actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: child -> children
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's already children
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my mistake.
* ColumnVector is expected to be reused during the entire data loading process, to avoid allocating | ||
* memory again and again. | ||
* | ||
* ColumnVector is meant to maximize CPU efficiency and not storage footprint, implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: not -> not to minimize
@@ -586,7 +587,7 @@ public final int appendStruct(boolean isNull) { | |||
if (isNull) { | |||
appendNull(); | |||
for (ColumnVector c: childColumns) { | |||
if (c.type instanceof StructType) { | |||
if (c.dataType() instanceof StructType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Which access type will we use for ColumnVector.type
? dataType()
or type
?
For example, OnHeapColumnVector.reserveInternal()
uses type
while this line uses dataType()
.
Test build #85620 has finished for PR 20116 at commit
|
retest this please |
Test build #85623 has finished for PR 20116 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -248,7 +248,10 @@ public void enableReturningBatches() { | |||
* Advances to the next batch of rows. Returns false if there are no more. | |||
*/ | |||
public boolean nextBatch() throws IOException { | |||
columnarBatch.reset(); | |||
for (WritableColumnVector vector : columnVectors) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the space before :
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the standard java foreach code style
* - There are many TODOs for the existing APIs. They should throw a not implemented exception. | ||
* - Compaction: The batch and columns should be able to compact based on a selection vector. | ||
* This class is a wrapper of multiple ColumnVectors and represents a logical table-like data | ||
* structure. It provides a row-view of this batch so that Spark can access the data row by row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row-view
-> row view
* TODO: | ||
* - There are many TODOs for the existing APIs. They should throw a not implemented exception. | ||
* - Compaction: The batch and columns should be able to compact based on a selection vector. | ||
* This class is a wrapper of multiple ColumnVectors and represents a logical table-like data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
This class wraps multiple ColumnVectors as a row-wise table
|
||
/** | ||
* Sets the number of rows that are valid. | ||
* Sets the number of rows that are valid in this batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about?
Sets the number of rows
* memory again and again. | ||
* | ||
* ColumnVector is meant to maximize CPU efficiency but not to minimize storage footprint, | ||
* implementations should prefer computing efficiency over storage efficiency when design the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementations
-> Implementations
* Maps are just a special case of a two field struct. | ||
* ColumnVector supports all the data types including nested types. To handle nested types, | ||
* ColumnVector can have children and is a tree structure. For struct type, it stores the actual | ||
* data of each field in the corresponding child ColumnVector, and only store null information in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store
-> stores
* ColumnVector can have children and is a tree structure. For struct type, it stores the actual | ||
* data of each field in the corresponding child ColumnVector, and only store null information in | ||
* the parent ColumnVector. For array type, it stores the actual array elements in the child | ||
* ColumnVector, and store null information, array offsets and lengths in the parent ColumnVector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store
-> stores
Test build #85637 has finished for PR 20116 at commit
|
LGTM Thanks! Merged to master and 2.3. |
## What changes were proposed in this pull request? move `ColumnVector` and related classes to `org.apache.spark.sql.vectorized`, and improve the document. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #20116 from cloud-fan/column-vector. (cherry picked from commit b297029) Signed-off-by: gatorsmile <gatorsmile@gmail.com>
What changes were proposed in this pull request?
move
ColumnVector
and related classes toorg.apache.spark.sql.vectorized
, and improve the document.How was this patch tested?
existing tests.