-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini #17407
[SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini #17407
Conversation
8951fe1
to
78aa380
Compare
Thanks for catching this bug! I recommend we not modify the value when the setter is called, but rather convert to lowercase internally. In this case, it looks like we can just change ImpurityCalculator.getCalculator. |
Test build #3611 has finished for PR 17407 at commit
|
@jkbradley Thanks. I agree with your advice. Modifying the value is a little aggressive, while changing ImpurityCalculator.getCalculator is more moderate. However, I'm afraid that the similar bugs, because of mismatch, won't leave if we don't fix it at first. Anyway, I'd like to take a try according to your advice. |
604dcd1
to
05c998c
Compare
05c998c
to
80f3306
Compare
I agree it'd be nice to come up with a generic fix for making String Params robust to case, but I don't have a good solution right now. I'll think about how we might put some generic testing in place... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates! My comments are just for small cleanups.
val rdd = TreeTests.getTreeReadWriteData(sc) | ||
|
||
val categoricalData: DataFrame = | ||
TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify this, you can write TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2)
.
Since this is not testing categorical features, there's no need to throw them in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
val categoricalData: DataFrame = | ||
TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) | ||
|
||
// BUG: see SPARK-20043 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put the JIRA number in the test title.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the comment.
@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite | |||
testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), | |||
allParamSettings ++ Map("maxDepth" -> 0), checkModelData) | |||
} | |||
|
|||
test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be more specific, how about:
"SAPRK-20043: ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2) | ||
|
||
// BUG: see SPARK-20043 | ||
val dt = new DecisionTreeClassifier().setImpurity("Gini") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make this faster, set maxDepth = 2 (something small)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
set maxDepth = 2.
Test build #3613 has finished for PR 17407 at commit
|
@@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite | |||
testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0), | |||
allParamSettings ++ Map("maxDepth" -> 0), checkModelData) | |||
} | |||
|
|||
test("SAPRK-20043: " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I had a typo: SPARK, not SAPRK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, copy & paste. thanks for correction.
Test build #3616 has finished for PR 17407 at commit
|
@jkbradley Hi, tests passed. Is it good enough to be merged? By the way, String Params are fragile when saving/loading model, as setParam and getParam methods are useless in such case. |
.setMaxDepth(2) | ||
|
||
val model = dt.fit(data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second unit test seems redundant for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed.
.setMaxDepth(2) | ||
|
||
val model = dt.fit(data) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blank lines kinda stand out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete some blank lines to keep compact.
|
||
val model = dt.fit(data) | ||
|
||
testDefaultReadWrite(model, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about setting testParams=true for this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
LGTM pending tests |
Test build #3618 has finished for PR 17407 at commit
|
Merging with master and branch-2.1 |
… for uppercase impurity type Gini Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [x] add unit test + [x] fix the bug Author: 颜发才(Yan Facai) <facai.yan@gmail.com> Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity. (cherry picked from commit 7d432af) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
TODO: