[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` #47342

zhengruifeng · 2024-07-14T11:50:53Z

What changes were proposed in this pull request?

Inspired by #47258, I am checking other ML implementations, and find that we can also optimize Tokenizer in the same way

Why are the changes needed?

the function createTransformFunc is to build the udf for UnaryTransformer.transform:

spark/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala

Line 118 in d679dab

val transformUDF = udf(this.createTransformFunc)

existing implementation read the params for each row.

Does this PR introduce any user-facing change?

No

How was this patch tested?

CI and manually tests:

create test dataset

spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet")

duration

val df = spark.read.parquet("/tmp/regex_tokenizer.parquet")
import org.apache.spark.ml.feature._
val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid")
Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up
val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic

result (before this PR)

scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic
val tic: Long = 1720613235068
val res5: Long = 50397

result (after this PR)

scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic
val tic: Long = 1720612871256
val res5: Long = 43748

Was this patch authored or co-authored using generative AI tooling?

No

srowen

Looks fine, it adds some complexity, but not much

JoshRosen

I'll wager that the expensive part was probably the configuration check itself plus the regex compilation, but not the branching (since those would predict well). Therefore I predict that you can get almost the whole speedup if you did something like

override protected def createTransformFunc: String => Seq[String] = {
  val re = $(pattern).r
  val _toLowercase = $(toLowercase)
  val _gaps = $(gaps)
  val minLength = $(minTokenLength)
  { originStr =>
      // scalastyle:off caselocale
      val str = if (_toLowerCase) originStr.toLowerCase() else originStr
      // scalastyle:on caselocale
      val tokens = if (_gaps) re.split(str).toImmutableArraySeq else re.findAllIn(str).toSeq
      tokens.filter(_.length >= minLength)
    }
}

Basically, I think it might be overkill or unnecessary to fully inline and expand the cross product like this. My suggested approach is easier to understand and probably nearly equivalent in performance.

zhengruifeng · 2024-07-16T00:26:45Z

I'll wager that the expensive part was probably the configuration check itself plus the regex compilation, but not the branching (since those would predict well). Therefore I predict that you can get almost the whole speedup if you did something like
override protected def createTransformFunc: String => Seq[String] = {
  val re = $(pattern).r
  val _toLowercase = $(toLowercase)
  val _gaps = $(gaps)
  val minLength = $(minTokenLength)
  { originStr =>
      // scalastyle:off caselocale
      val str = if (_toLowerCase) originStr.toLowerCase() else originStr
      // scalastyle:on caselocale
      val tokens = if (_gaps) re.split(str).toImmutableArraySeq else re.findAllIn(str).toSeq
      tokens.filter(_.length >= minLength)
    }
}
Basically, I think it might be overkill or unnecessary to fully inline and expand the cross product like this. My suggested approach is easier to understand and probably nearly equivalent in performance.

make sense, let me make the changes simple

init

zhengruifeng · 2024-07-16T23:18:57Z

merged to master

### What changes were proposed in this pull request? Inspired by apache#47258, I am checking other ML implementations, and find that we can also optimize `Tokenizer` in the same way ### Why are the changes needed? the function `createTransformFunc` is to build the udf for `UnaryTransformer.transform`: https://github.com/apache/spark/blob/d679dabdd1b5ad04b8c7deb1c06ce886a154a928/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L118 existing implementation read the params for each row. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI and manually tests: create test dataset ``` spark.range(1000000).select(uuid().as("uuid")).write.mode("overwrite").parquet("/tmp/regex_tokenizer.parquet") ``` duration ``` val df = spark.read.parquet("/tmp/regex_tokenizer.parquet") import org.apache.spark.ml.feature._ val tokenizer = new RegexTokenizer().setPattern("-").setInputCol("uuid") Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()) // warm up val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic ``` result (before this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720613235068 val res5: Long = 50397 ``` result (after this PR) ``` scala> val tic = System.currentTimeMillis; Seq.range(0, 1000).foreach(i => tokenizer.transform(df).count()); System.currentTimeMillis - tic val tic: Long = 1720612871256 val res5: Long = 43748 ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#47342 from zhengruifeng/opt_tokenizer. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

github-actions bot added the ML label Jul 14, 2024

zhengruifeng requested review from WeichenXu123, JoshRosen and srowen July 15, 2024 11:35

srowen approved these changes Jul 15, 2024

View reviewed changes

JoshRosen reviewed Jul 15, 2024

View reviewed changes

zhengruifeng added 2 commits July 16, 2024 10:28

init

cacf9d6

init

address comments

3aae4f0

zhengruifeng force-pushed the opt_tokenizer branch from f1f1ca6 to 3aae4f0 Compare July 16, 2024 02:28

JoshRosen approved these changes Jul 16, 2024

View reviewed changes

zhengruifeng closed this in 3755d51 Jul 16, 2024

zhengruifeng deleted the opt_tokenizer branch July 16, 2024 23:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` #47342

[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` #47342

zhengruifeng commented Jul 14, 2024

srowen left a comment

JoshRosen left a comment

zhengruifeng commented Jul 16, 2024

zhengruifeng commented Jul 16, 2024

[SPARK-48892][ML] Avoid per-row param read in Tokenizer #47342

[SPARK-48892][ML] Avoid per-row param read in Tokenizer #47342

Conversation

zhengruifeng commented Jul 14, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

srowen left a comment

Choose a reason for hiding this comment

JoshRosen left a comment

Choose a reason for hiding this comment

zhengruifeng commented Jul 16, 2024

zhengruifeng commented Jul 16, 2024

[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` #47342

[SPARK-48892][ML] Avoid per-row param read in `Tokenizer` #47342