Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4581][MLlib] Refactorize StandardScaler to improve the transformation performance #3435

Closed
wants to merge 6 commits into from
Closed

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Nov 25, 2014

The following optimizations are done to improve the StandardScaler model
transformation performance.

  1. Covert Breeze dense vector to primitive vector to reduce the overhead.
  2. Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
  3. Have a local reference to shift and factor array so JVM can locate the value with one operation call.
  4. In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
    make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs

Note that without the local reference copy of factor and shift arrays,
the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,

while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}

will generate the bytecode

   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13

, while with the local reference of the shift and factor arrays, the bytecode will be

   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling INVOKESPECIAL.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23800 has started for PR 3435 at commit 5bffd3d.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23801 has started for PR 3435 at commit fc795e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23803 has started for PR 3435 at commit 9c51eef.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor

mengxr commented Nov 25, 2014

@dbtsai Did you measure the performance gain from the following change?

  1. Have a local reference to shift and factor array so JVM can locate the value with one operation call.

Could you post the generated bytecode?

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23805 has started for PR 3435 at commit cdb5cef.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23800 has finished for PR 3435 at commit 5bffd3d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23800/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23801 has finished for PR 3435 at commit fc795e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23801/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23803 has finished for PR 3435 at commit 9c51eef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23803/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23805 has finished for PR 3435 at commit cdb5cef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23805/
Test PASSed.

@dbtsai
Copy link
Member Author

dbtsai commented Nov 25, 2014

@mengxr

Without the local reference copy of factor and shift arrays, the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,

while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}

will generate the bytecode

   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13

, while with the local reference of the shift and factor arrays, the bytecode will be

   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling INVOKESPECIAL.

@dbtsai
Copy link
Member Author

dbtsai commented Nov 25, 2014

PS, we may want to go though the mllib codebase, and find things like this. This issue impacts the performance quite a lot.

@mengxr
Copy link
Contributor

mengxr commented Nov 25, 2014

@dbtsai What if we mark factor and shift as private[this]?

@dbtsai
Copy link
Member Author

dbtsai commented Nov 25, 2014

Wow, with

  private[this] val factor: Array[Double] = {
    val f = Array.ofDim[Double](variance.size)
    var i = 0
    while (i < f.size) {
      f(i) = if (variance(i) != 0.0) 1.0 / math.sqrt(variance(i)) else 0.0
      i += 1
    }
    f
  }

  private[this] val shift: Array[Double] = mean.toArray

and

            while (i < size) {
              values(i) = (values(i) - shift(i)) * factor(i)
              i += 1
            }

, I got different bytecode as the following

   L14
    LINENUMBER 108 L14
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L15
   L16
    LINENUMBER 109 L16
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 0
    GETFIELD org/apache/spark/mllib/feature/StandardScalerModel.shift : [D
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 0
    GETFIELD org/apache/spark/mllib/feature/StandardScalerModel.factor : [D
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L17
    LINENUMBER 110 L17
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L14

It's slightly slower than the local reference version.
DenseVector withMean and withStd: 5.92secs
DenseVector withMean and withoutStd: 5.36secs
DenseVector withoutMean and withStd: 5.51secs
SparseVector withoutMean and withStd: 1.30secs

Instead of calling INVOKESPECIAL, it's now calling GETFIELD.
What's difference between private[this] and private? Also, it doesn't
work with private [this] lazy val which will generate the same bytecode
as private lazy val. As a result, shift and factor will be always evaluated
when we create the model.

@mengxr
Copy link
Contributor

mengxr commented Nov 25, 2014

By default, Scala generates Java methods for members, no matter whether you use val or def. That's why you saw invokespecial for shift and factor. But if a member is marked as private[this], it generates the code as a private field, and hence you see getfield in the bytecode.

I like using private[this] instead of having local references for code simplicity, but no strong preference. The current form looks good to me.

@@ -87,6 +85,8 @@ class StandardScalerModel private[mllib] (
f
}

private lazy val shift: Array[Double] = mean.toArray
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need lazy here, because mean.toArray is not expensive.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23817 has started for PR 3435 at commit daf2b06.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23817 has finished for PR 3435 at commit daf2b06.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23817/
Test PASSed.

// By default, Scala generates Java methods for member variables. So every time when
// the member variables are accessed, `invokespecial` will be called which is expensive.
// This can be avoid by having a local reference of `shift`.
val localShift = shift
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shift is only used in this branch. Shall we just put val shift = mean.toArray here instead of having a member variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll change it back to lazy since it will not be evaluated in those branches which don't use shift. I don't want to create shift array/object for each sample since shift will always be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shift only holds a reference to mean.values. We don't really need to define it as a member and make it lazy. It should give the same performance if we only define it inside the if branch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For different implementation of vector, toArray can be very expensive. For example, toArray for sparse vector requires to create a new array object and loop through all the non zero values. As a result, we can have a global lazy shift which can prevent this happens.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23825 has started for PR 3435 at commit 85885a9.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 25, 2014

Test build #23825 has finished for PR 3435 at commit 85885a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23825/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 25, 2014

LGTM. Merged into master and branch-1.2. Thanks!

@asfgit asfgit closed this in bf1a6aa Nov 25, 2014
asfgit pushed a commit that referenced this pull request Nov 25, 2014
…rmation performance

The following optimizations are done to improve the StandardScaler model
transformation performance.

1) Covert Breeze dense vector to primitive vector to reduce the overhead.
2) Since mean can be potentially a sparse vector, we explicitly convert it to dense primitive vector.
3) Have a local reference to `shift` and `factor` array so JVM can locate the value with one operation call.
4) In pattern matching part, we use the mllib SparseVector/DenseVector instead of breeze's vector to
make the codebase cleaner.

Benchmark with mnist8m dataset:

Before,
DenseVector withMean and withStd: 50.97secs
DenseVector withMean and withoutStd: 42.11secs
DenseVector withoutMean and withStd: 8.75secs
SparseVector withoutMean and withStd: 5.437secs

With this PR,
DenseVector withMean and withStd: 5.76secs
DenseVector withMean and withoutStd: 5.28secs
DenseVector withoutMean and withStd: 5.30secs
SparseVector withoutMean and withStd: 1.27secs

Note that without the local reference copy of `factor` and `shift` arrays,
the runtime is almost three time slower.

DenseVector withMean and withStd: 18.15secs
DenseVector withMean and withoutStd: 18.05secs
DenseVector withoutMean and withStd: 18.54secs
SparseVector withoutMean and withStd: 2.01secs

The following code,
```scala
while (i < size) {
   values(i) = (values(i) - shift(i)) * factor(i)
   i += 1
}
```
will generate the bytecode
```
   L13
    LINENUMBER 106 L13
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I] []
    ILOAD 7
    ILOAD 6
    IF_ICMPGE L14
   L15
    LINENUMBER 107 L15
    ALOAD 5
    ILOAD 7
    ALOAD 5
    ILOAD 7
    DALOAD
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.shift ()[D
    ILOAD 7
    DALOAD
    DSUB
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ILOAD 7
    DALOAD
    DMUL
    DASTORE
   L16
    LINENUMBER 108 L16
    ILOAD 7
    ICONST_1
    IADD
    ISTORE 7
    GOTO L13
```
, while with the local reference of the `shift` and `factor` arrays, the bytecode will be
```
   L14
    LINENUMBER 107 L14
    ALOAD 0
    INVOKESPECIAL org/apache/spark/mllib/feature/StandardScalerModel.factor ()[D
    ASTORE 9
   L15
    LINENUMBER 108 L15
   FRAME FULL [org/apache/spark/mllib/feature/StandardScalerModel org/apache/spark/mllib/linalg/Vector [D org/apache/spark/mllib/linalg/Vector org/apache/spark/mllib/linalg/DenseVector T [D I I [D] []
    ILOAD 8
    ILOAD 7
    IF_ICMPGE L16
   L17
    LINENUMBER 109 L17
    ALOAD 6
    ILOAD 8
    ALOAD 6
    ILOAD 8
    DALOAD
    ALOAD 2
    ILOAD 8
    DALOAD
    DSUB
    ALOAD 9
    ILOAD 8
    DALOAD
    DMUL
    DASTORE
   L18
    LINENUMBER 110 L18
    ILOAD 8
    ICONST_1
    IADD
    ISTORE 8
    GOTO L15
```

You can see that with local reference, the both of the arrays will be in the stack, so JVM can access the value without calling `INVOKESPECIAL`.

Author: DB Tsai <dbtsai@alpinenow.com>

Closes #3435 from dbtsai/standardscaler and squashes the following commits:

85885a9 [DB Tsai] revert to have lazy in shift array.
daf2b06 [DB Tsai] Address the feedback
cdb5cef [DB Tsai] small change
9c51eef [DB Tsai] style
fc795e4 [DB Tsai] update
5bffd3d [DB Tsai] first commit

(cherry picked from commit bf1a6aa)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
@dbtsai dbtsai deleted the standardscaler branch November 25, 2014 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants