[SPARK-50963][ML][PYTHON][CONNECT] Support Tokenizers, SQLTransform and StopWordsRemover on Connect #49624

zhengruifeng · 2025-01-23T09:57:25Z

What changes were proposed in this pull request?

Support a group of text processing algorithms:

Tokenizer
RegexTokenizer
SQLTransform
StopWordsRemover

Why are the changes needed?

for feature parity

Does this PR introduce any user-facing change?

yes

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2025-01-23T10:02:12Z

python/pyspark/ml/feature.py

@@ -5072,7 +5073,7 @@ def __init__(
        self._setDefault(
            stopWords=StopWordsRemover.loadDefaultStopWords("english"),
            caseSensitive=False,
-            locale=self._java_obj.getLocale(),
+            locale="en_US" if isinstance(self._java_obj, str) else self._java_obj.getLocale(),


PySpark Classic calls the jvm side to get the default value

spark/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala

Lines 129 to 138 in e7e0826

private val getDefaultOrUS: Locale = {

if (Locale.getAvailableLocales.contains(Locale.getDefault)) {

Locale.getDefault

} else {

logWarning(log"Default locale set was [${MDC(LogKeys.LOCALE, Locale.getDefault)}]; " +

log"however, it was not found in available locales in JVM, falling back to en_US locale. " +

log"Set param `locale` in order to respect another locale.")

Locale.US

}

}

which is not available in Connect Mode.

let me rethink here, if we skip _setDefault locale, then remover.getLocale() fails if user does not explicitly set it, but it will reuse the default value in the jvm side.

…nd StopWordsRemover on Connect ### What changes were proposed in this pull request? Support a group of text processing algorithms: - Tokenizer - RegexTokenizer - SQLTransform - StopWordsRemover ### Why are the changes needed? for feature parity ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? added tests ### Was this patch authored or co-authored using generative AI tooling? no Closes #49624 from zhengruifeng/ml_connect_tokenizer. Authored-by: Ruifeng Zheng <ruifengz@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org> (cherry picked from commit 42b15c9) Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

zhengruifeng · 2025-01-24T00:15:37Z

thanks. merged to master/4.0

init

0ef1267

github-actions bot added ML MLLIB PYTHON labels Jan 23, 2025

init

676ea83

zhengruifeng commented Jan 23, 2025

View reviewed changes

zhengruifeng requested a review from HyukjinKwon January 23, 2025 10:10

init

7858635

HyukjinKwon approved these changes Jan 24, 2025

View reviewed changes

zhengruifeng closed this in 42b15c9 Jan 24, 2025

zhengruifeng deleted the ml_connect_tokenizer branch January 24, 2025 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-50963][ML][PYTHON][CONNECT] Support Tokenizers, SQLTransform and StopWordsRemover on Connect #49624

[SPARK-50963][ML][PYTHON][CONNECT] Support Tokenizers, SQLTransform and StopWordsRemover on Connect #49624

zhengruifeng commented Jan 23, 2025

zhengruifeng Jan 23, 2025

zhengruifeng Jan 23, 2025

zhengruifeng commented Jan 24, 2025

	private val getDefaultOrUS: Locale = {
	if (Locale.getAvailableLocales.contains(Locale.getDefault)) {
	Locale.getDefault
	} else {
	logWarning(log"Default locale set was [${MDC(LogKeys.LOCALE, Locale.getDefault)}]; " +
	log"however, it was not found in available locales in JVM, falling back to en_US locale. " +
	log"Set param `locale` in order to respect another locale.")
	Locale.US
	}
	}

[SPARK-50963][ML][PYTHON][CONNECT] Support Tokenizers, SQLTransform and StopWordsRemover on Connect #49624

[SPARK-50963][ML][PYTHON][CONNECT] Support Tokenizers, SQLTransform and StopWordsRemover on Connect #49624

Conversation

zhengruifeng commented Jan 23, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng Jan 23, 2025

Choose a reason for hiding this comment

zhengruifeng Jan 23, 2025

Choose a reason for hiding this comment

zhengruifeng commented Jan 24, 2025