Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting GpuRowToColumnar converters to InternalColumnarRDDConverter #4206

Merged

Conversation

wjxiz1992
Copy link
Collaborator

@wjxiz1992 wjxiz1992 commented Nov 24, 2021

  • allow converting array, map, struct and decimal
  • the converter in GpuRowToColumanrExec are for InternalRow, while this is for Row

Signed-off-by: Allen Xu allxu@nvidia.com

- allow converting array, map, struct and decimal

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 requested review from wbo4958 and revans2 November 24, 2021 10:53
@wjxiz1992 wjxiz1992 self-assigned this Nov 24, 2021
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to add tests for this. To what extent has it been tested already?

Copy link
Collaborator

@firestarman firestarman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You only need to enable the nested types support for your case here, instead of updating the API definition for all cases.

  def convert(df: DataFrame): RDD[Table] = {
    val schema = df.schema
-   if (!GpuOverrides.areAllSupportedTypes(schema.map(_.dataType) :_*)) {
-     val unsupported = schema.map(_.dataType).filter(!GpuOverrides.isSupportedType(_)).toSet
+   val unsupported = schema.map(_.dataType).filter( d => 
+     !GpuOverrides.isSupportedType(d, allowArray=true, allowNesting=true, ....)
+   ).toSet
+   if (unsupported.nonEmpty) {
      throw new IllegalArgumentException(s"Cannot convert $df to GPU columnar $unsupported are " +
        s"not currently supported data types for columnar.")
    }

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as draft November 26, 2021 09:49
Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@jlowe
Copy link
Contributor

jlowe commented Nov 30, 2021

java.lang.UnsatisfiedLinkError: ai.rapids.cudf.ColumnView.getNativeValidPointerSize(I)J

The problem is caused by a bug in cudf where ColumnView does not guarantee the cudf native libraries are loaded before attempting a JNI call into those libraries. I've posted a fix at rapidsai/cudf#9800.

Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992 wjxiz1992 marked this pull request as ready for review December 2, 2021 13:46
@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992 wjxiz1992 requested a review from jlowe December 2, 2021 13:51
@wjxiz1992
Copy link
Collaborator Author

build

Signed-off-by: Allen Xu <allxu@nvidia.com>
Signed-off-by: Allen Xu <allxu@nvidia.com>
@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992 wjxiz1992 requested a review from jlowe December 6, 2021 07:20
Copy link
Collaborator Author

@wjxiz1992 wjxiz1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jlowe , not in the unit tests, but in the performance tests for the PCA training, I saw several ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED.
After enabling the ai.rapids.refcount.debug, I can see the detailed logs:

21/12/06 16:54:40 ERROR ColumnVector: A DEVICE COLUMN VECTOR WAS LEAKED (ID: 503900 7f3d5c8a8730)
21/12/06 16:54:40 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/12/06 16:54:40 ERROR MemoryCleaner: Leaked vector (ID: 503900): 2021-12-06 16:54:08.0841 CST: INC
java.lang.Thread.getStackTrace(Thread.java:1556)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.ColumnVector.incRefCountInternal(ColumnVector.java:245)
ai.rapids.cudf.ColumnVector.<init>(ColumnVector.java:134)
ai.rapids.cudf.ColumnView$NestedColumnVector.createColumnVector(ColumnView.java:3835)
ai.rapids.cudf.HostColumnVector.copyToDevice(HostColumnVector.java:220)
ai.rapids.cudf.HostColumnVector$ColumnBuilder.buildAndPutOnDevice(HostColumnVector.java:1290)
com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilder.buildAndPutOnDevice(GpuColumnVector.java:402)
com.nvidia.spark.rapids.GpuColumnVector$GpuColumnarBatchBuilderBase.build(GpuColumnVector.java:277)
org.apache.spark.sql.rapids.execution.ExternalRowToColumnarIterator.buildBatch(InternalColumnarRddConverter.scala:614)
org.apache.spark.sql.rapids.execution.ExternalRowToColumnarIterator.next(InternalColumnarRddConverter.scala:578)
org.apache.spark.sql.rapids.execution.ExternalRowToColumnarIterator.next(InternalColumnarRddConverter.scala:561)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
...

2021-12-06 16:54:08.0841 CST: INC
java.lang.Thread.getStackTrace(Thread.java:1556)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.addRef(MemoryCleaner.java:82)
ai.rapids.cudf.ColumnVector.incRefCountInternal(ColumnVector.java:245)
ai.rapids.cudf.ColumnVector.incRefCount(ColumnVector.java:241)
ai.rapids.cudf.Table.<init>(Table.java:71)
com.nvidia.spark.rapids.GpuColumnVector.from(GpuColumnVector.java:610)
org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.$anonfun$convert$10(InternalColumnarRddConverter.scala:724)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
...

2021-12-06 16:54:08.0841 CST: DEC
java.lang.Thread.getStackTrace(Thread.java:1556)
ai.rapids.cudf.MemoryCleaner$RefCountDebugItem.<init>(MemoryCleaner.java:301)
ai.rapids.cudf.MemoryCleaner$Cleaner.delRef(MemoryCleaner.java:90)
ai.rapids.cudf.ColumnVector.close(ColumnVector.java:213)
com.nvidia.spark.rapids.GpuColumnVector.close(GpuColumnVector.java:1045)
org.apache.spark.sql.vectorized.ColumnarBatch.close(ColumnarBatch.java:48)
org.apache.spark.sql.rapids.execution.InternalColumnarRddConverter$.$anonfun$convert$10(InternalColumnarRddConverter.scala:726)
scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
...

The pattern is like "2 INC 1 DEC". But checking the code accordingly, the one that has no DEC, is this line. But I saw the comments in below lines saying that the consumer should be responsible to close it. I'm confused here, should it be closed somewhere manually?

@revans2
Copy link
Collaborator

revans2 commented Dec 6, 2021

The InternalColumnarRDDConverter assumes that whatever is consuming the Tables it produces will close them. That is not happening in this case and the stack traces do not show enough information to know which part of the code is calling this and not closing the input Table. If it is only happening in PCA training then you need to look at that code.

@jlowe
Copy link
Contributor

jlowe commented Dec 6, 2021

According to the leak stacktraces, the leak is caused by whoever is responsible for the resulting RDD[Table]. Note that we see a Table.<init> call that is incrementing references, but we do not see a corresponding Table.close call that is decrementing the column references. As documented in the ml integration docs, the resulting tables from the RDD[Table] must be closed, and that is not happening in this case. The PCA code probably has a bug where it's not always closing the tables it is using.

jlowe
jlowe previously approved these changes Dec 6, 2021
Copy link
Contributor

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but otherwise lgtm.

}
}

override def getNullSize: Double = OFFSET + VALIDITY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can use VALIDITY_N_OFFSET here, also applies to the instance a few lines below.

@wjxiz1992
Copy link
Collaborator Author

build

@wjxiz1992
Copy link
Collaborator Author

Oh, yes, I create a table from my PCA side and didn't close it. Thx for the analysis.


private object BooleanConverter extends TypeConverter {
override def append(row: Row,
column: Int,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
column: Int,
column: Int,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: 2 more spaces for parameters would be better.

@wjxiz1992 wjxiz1992 merged commit df4a047 into NVIDIA:branch-22.02 Dec 7, 2021
@sameerz sameerz added the task Work required that improves the product but is not user facing label Dec 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
task Work required that improves the product but is not user facing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants