-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance #3435
Conversation
Test build #23800 has started for PR 3435 at commit
|
Test build #23801 has started for PR 3435 at commit
|
Test build #23803 has started for PR 3435 at commit
|
@dbtsai Did you measure the performance gain from the following change?
Could you post the generated bytecode? |
Test build #23805 has started for PR 3435 at commit
|
Test build #23800 has finished for PR 3435 at commit
|
Test PASSed. |
Test build #23801 has finished for PR 3435 at commit
|
Test PASSed. |
Test build #23803 has finished for PR 3435 at commit
|
Test PASSed. |
Test build #23805 has finished for PR 3435 at commit
|
Test PASSed. |
Without the local reference copy of DenseVector withMean and withStd: 18.15secs The following code, while (i < size) {
values(i) = (values(i) - shift(i)) * factor(i)
i += 1
} will generate the bytecode
, while with the local reference of the
You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling |
PS, we may want to go though the mllib codebase, and find things like this. This issue impacts the performance quite a lot. |
@dbtsai What if we mark |
Wow, with private[this] val factor: Array[Double] = {
val f = Array.ofDim[Double](variance.size)
var i = 0
while (i < f.size) {
f(i) = if (variance(i) != 0.0) 1.0 / math.sqrt(variance(i)) else 0.0
i += 1
}
f
}
private[this] val shift: Array[Double] = mean.toArray and while (i < size) {
values(i) = (values(i) - shift(i)) * factor(i)
i += 1
} , I got different bytecode as the following
It's slightly slower than the local reference version. Instead of calling |
By default, Scala generates Java methods for members, no matter whether you use I like using |
@@ -87,6 +85,8 @@ class StandardScalerModel private[mllib] ( | |||
f | |||
} | |||
|
|||
private lazy val shift: Array[Double] = mean.toArray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need lazy
here, because mean.toArray
is not expensive.
Test build #23817 has started for PR 3435 at commit
|
Test build #23817 has finished for PR 3435 at commit
|
Test PASSed. |
// By default, Scala generates Java methods for member variables. So every time when | ||
// the member variables are accessed, `invokespecial` will be called which is expensive. | ||
// This can be avoid by having a local reference of `shift`. | ||
val localShift = shift |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shift
is only used in this branch. Shall we just put val shift = mean.toArray
here instead of having a member variable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I'll change it back to lazy since it will not be evaluated in those branches which don't use shift
. I don't want to create shift
array/object for each sample since shift
will always be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shift
only holds a reference to mean.values
. We don't really need to define it as a member and make it lazy. It should give the same performance if we only define it inside the if branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For different implementation of vector, toArray can be very expensive. For example, toArray for sparse vector requires to create a new array object and loop through all the non zero values. As a result, we can have a global lazy shift
which can prevent this happens.
Test build #23825 has started for PR 3435 at commit
|
Test build #23825 has finished for PR 3435 at commit
|
Test PASSed. |
LGTM. Merged into master and branch-1.2. Thanks! |
…rmation performance The following optimizations are done to improve the StandardScaler model transformation performance. 1) Covert Breeze dense vector to primitive vector to reduce the overhead. 2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector. 3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call. 4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to make the codebase cleaner. Benchmark with mnist8m dataset: Before, DenseVector withMean and withStd: 50.97secs DenseVector withMean and withoutStd: 42.11secs DenseVector withoutMean and withStd: 8.75secs SparseVector withoutMean and withStd: 5.437secs With this PR, DenseVector withMean and withStd: 5.76secs DenseVector withMean and withoutStd: 5.28secs DenseVector withoutMean and withStd: 5.30secs SparseVector withoutMean and withStd: 1.27secs Note that without the local reference copy of `factor` and `shift` arrays, the runtime is almost three time slower. DenseVector withMean and withStd: 18.15secs DenseVector withMean and withoutStd: 18.05secs DenseVector withoutMean and withStd: 18.54secs SparseVector withoutMean and withStd: 2.01secs The following code, ```scala while (i < size) { values(i) = (values(i) - shift(i)) * factor(i) i += 1 } ``` will generate the bytecode ``` L13 LINENUMBER 106 L13 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] [] ILOAD 7 ILOAD 6 IF_ICMPGE L14 L15 LINENUMBER 107 L15 ALOAD 5 ILOAD 7 ALOAD 5 ILOAD 7 DALOAD ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D ILOAD 7 DALOAD DSUB ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ILOAD 7 DALOAD DMUL DASTORE L16 LINENUMBER 108 L16 ILOAD 7 ICONST_1 IADD ISTORE 7 GOTO L13 ``` , while with the local reference of the `shift` and `factor` arrays, the bytecode will be ``` L14 LINENUMBER 107 L14 ALOAD 0 INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D ASTORE 9 L15 LINENUMBER 108 L15 FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] [] ILOAD 8 ILOAD 7 IF_ICMPGE L16 L17 LINENUMBER 109 L17 ALOAD 6 ILOAD 8 ALOAD 6 ILOAD 8 DALOAD ALOAD 2 ILOAD 8 DALOAD DSUB ALOAD 9 ILOAD 8 DALOAD DMUL DASTORE L18 LINENUMBER 110 L18 ILOAD 8 ICONST_1 IADD ISTORE 8 GOTO L15 ``` You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`. Author: DB Tsai <dbtsai@alpinenow.com> Closes #3435 from dbtsai/standardscaler and squashes the following commits: 85885a9 [DB Tsai] revert to have lazy in shift array. daf2b06 [DB Tsai] Address the feedback cdb5cef [DB Tsai] small change 9c51eef [DB Tsai] style fc795e4 [DB Tsai] update 5bffd3d [DB Tsai] first commit (cherry picked from commit bf1a6aa) Signed-off-by: Xiangrui Meng <meng@databricks.com>
The following optimizations are done to improve the StandardScaler model
transformation performance.
shift
andfactor
array so JVM can locate the value with one operation call.make the codebase cleaner.
Benchmark with mnist8m dataset:
Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs
With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs
Note that without the local reference copy of
factor
andshift
arrays,the runtime is almost three time slower.
DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs
The following code,
will generate the bytecode
, while with the local reference of the
shift
andfactor
arrays, the bytecode will beYou can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling
INVOKESPECIAL
.