-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4505][Core] Add a ClassTag parameter to CompactBuffer[T] #3378
Conversation
…T] when T is a primitive type
Test build #23661 has started for PR 3378 at commit
|
This seems like probably a great idea. Do you know what the overhead of including a classtag is? Does it mean an extra pointer per object? |
Sorry. Yes the CompactBuffer will has one extra pointer for ClassTag. |
It's weird. I just found both the sizes of old and new Then I added a field to the old CompactBuffer like this: class CompactBuffer[T] extends Seq[T] with Serializable {
val dummy: AnyRef = null
// First two elements
private var element0: T = _
private var element1: T = _
|
This does seem like a good change, though I'll note that I think groupBy is the only current user of this API that is able to have a primitive ClassTag. Still worthwhile, especially for future usage. I do wonder if it could have a runtime impact due to increased primitive wrapping, possibly creating a lot of short-lived garbage if it were iterated over many times. |
Test build #23661 has finished for PR 3378 at commit
|
Test PASSed. |
Found the cause. My JVM enables |
Ping @rxin, since this seems like the sort of optimization that you'd be interested in. |
My motivation is that we encountered a skew data set that a special hot key has too many values and could not fit into memory. Spilling helps nothing in this case since groupBy will put all values of a key into a CompactBuffer. After this optimization, at least, my job could run using the same memory limitation. |
We should definitely add a ClassTag since this can be used for primitive types. However, there might be places where we create a lot of CompactBuffers. I haven't had a chance to look at where CompactBuffers are used yet, but for those places, would it be possible to create a single ClassTag reference? |
Cogroup uses class CoGroupedRDD[K](@transient var rdds: Seq[RDD[_ <: Product2[K, _]]], part: Partitioner)
extends RDD[(K, Array[Iterable[_]])](rdds.head.context, Nil) Here |
@rxin Is it OK to merge? |
I don't understand the architecture here as well as @rxin but this change seems like a strict improvement in its current form, so I'm gonna pull it in. LGTM. |
Added a ClassTag parameter to CompactBuffer. So CompactBuffer[T] can create primitive arrays for primitive types. It will reduce the memory usage for primitive types significantly and only pay minor performance lost.
Here is my test code:
Using the previous CompactBuffer outputed
Using the new CompactBuffer outputed
In this case, the new
CompactBuffer
only used 20% memory of the previous one. It's really helpful forgroupByKey
when using a primitive value.