Descale feature contribution for Linear Regression & Logistic Regression #345

TuanNguyen27 · 2019-06-25T17:10:59Z

Problem context
Spark returns feature contribution on the original scale of the feature, making it hard to compare relative importance between two features of different scales. During training for linear regression, Spark normalizes features and label to zero mean and unit variance, and we want to display feature contribution (which is the model's coefficient) on this normalized scale.

Note: Spark does the same standardization for logistic regression, but it is unclear if the same problem applies to feature contribution for logistic regression.

Describe the proposed solution
Descale feature contribution by multiplying Spark's returned weight with the respective feature standard deviation, and divide by the label standard deviation. This change only involves ModelInsights, and will not affect scoring.

Describe alternatives you've considered
N/A. The descaling needs access to the best trained model, label summary stats & feature summary stats.

… & the features are standardized during training

codecov · 2019-06-25T18:54:10Z

Codecov Report

Merging #345 into master will increase coverage by <.01%.
The diff coverage is 82.14%.

@@            Coverage Diff             @@
##           master     #345      +/-   ##
==========================================
+ Coverage   86.79%   86.79%   +<.01%     
==========================================
  Files         336      336              
  Lines       10895    10921      +26     
  Branches      347      570     +223     
==========================================
+ Hits         9456     9479      +23     
- Misses       1439     1442       +3

Impacted Files	Coverage Δ
...c/main/scala/com/salesforce/op/ModelInsights.scala	`91.54% <82.14%> (-1.14%)`	⬇️
...ges/impl/classification/OpLogisticRegression.scala	`60.71% <0%> (+3.57%)`	⬆️
...op/stages/impl/regression/OpLinearRegression.scala	`80.95% <0%> (+4.76%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3e02bf7...51627e9. Read the comment docs.

tovbinm · 2019-06-27T04:24:52Z

@TuanNguyen27 can you also please explain what problem are you solving in the PR description?

tovbinm

please add some more tests to cover new scenarios

core/src/main/scala/com/salesforce/op/ModelInsights.scala

tovbinm · 2019-06-27T04:31:21Z

core/src/main/scala/com/salesforce/op/ModelInsights.scala

+            case Continuous(_, _, _, variance) => math.sqrt(variance)
+            // for (binary) logistic regression we only need to multiply by feature standard deviation
+            case Discrete(domain, prob) =>
+              def computeVariance(domain: Seq[String], prob: Seq[Double]): Double = {


you dont need the inner function and you can do it faster. simply do:

case Discrete(domain, prob) => val (weighted, sqweighted) = (domain zip prob).foldLeft((0.0, 0.0)) { case ((weightSum, sqweightSum), (d, p)) => val floatD = d.toDouble val weight = floatD * prob val sqweight = floatD * weight (weightSum+ weight, sqweightSum + sqweight) } sqweighted - weighted

another question, since domain is Seq[String] how can we be sure that d.toDouble wont throw an error?! should we handle it gracefully?

For regression problem where there are too few unique labels (and they get treated as categoricals), d.toDouble should work fine. I'm not sure how it will behave on classification. For classification I assume that domain = Array("0", "1"), is this correct ?

Why did we decide to define domain as Seq[String] in the first place? @Jauntbox @leahmcguire

If you have the raw label rather than the indexed label you will get strings - this was designed to support that

is it safe to do toDouble in this case ?

Any value which is not numeric with throw an exception, e.g. "".toDouble

but will there ever be non numeric string in the label field...? i thought they should be filtered out.

core/src/main/scala/com/salesforce/op/ModelInsights.scala

…rifAI into tn/descaleLR

core/src/main/scala/com/salesforce/op/ModelInsights.scala

leahmcguire

Just fix the error message and then LGTM

core/src/main/scala/com/salesforce/op/ModelInsights.scala

Jauntbox

One small comparison thing I'd like you to check, then LGTM.

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala

Jauntbox

I don't want to hold this up, lgtm

salesforce-cla · 2019-07-25T17:06:19Z

Thanks for the contribution! Before we can merge this, we need @tovbinm to sign the Salesforce.com Contributor License Agreement.

salesforce-cla · 2019-07-25T17:06:19Z

Thanks for the contribution! It looks like @leahmcguire is an internal user so signing the CLA is not required. However, we need to confirm this.

Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)

salesforce-cla · 2020-12-09T12:18:52Z

Thanks for the contribution! Unfortunately we can't verify the commit author(s): Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request.

move PR to a branch on Tmog

fe8e6df

TuanNguyen27 requested review from leahmcguire and tovbinm as code owners June 25, 2019 17:10

salesforce-cla bot added the cla:signed label Jun 25, 2019

TuanNguyen27 requested a review from Jauntbox June 25, 2019 17:11

TuanNguyen27 added DO NOT MERGE work in progress and removed DO NOT MERGE labels Jun 25, 2019

TuanNguyen27 added 2 commits June 25, 2019 11:07

add condition to descale only when it is Linear / Logistic regression…

4ad6204

… & the features are standardized during training

fix modelinsighttest

1e2bc81

TuanNguyen27 added 2 commits June 25, 2019 13:14

only compute descaled contrib if a model is present & fits our criteria

5cbe132

fix test failure with a check for empty list of feature contribution

673dad8

tovbinm reviewed Jun 27, 2019

View reviewed changes

addressing comments

4c44263

tovbinm reviewed Jun 27, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/ModelInsights.scala Show resolved Hide resolved

tovbinm reviewed Jun 27, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

TuanNguyen27 and others added 7 commits June 27, 2019 13:11

more comment addressing

d3fa01b

test in progress, still broken

7a882f2

Merge branch 'master' into tn/descaleLR

8ba36c2

seems to be working

e4da5ab

first version of test

0767e84

Merge branch 'tn/descaleLR' of https://github.com/salesforce/Transmog…

992db1a

…rifAI into tn/descaleLR

fix scala style

7eaa209

TuanNguyen27 added ready for review and removed work in progress labels Jul 2, 2019

tovbinm and others added 2 commits July 2, 2019 10:16

Merge branch 'master' into tn/descaleLR

2c5d2f7

Merge branch 'master' into tn/descaleLR

99236f4

leahmcguire requested changes Jul 8, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

TuanNguyen27 added ready for review and removed work in progress labels Jul 15, 2019

add citations for future readability

90ff504

leahmcguire reviewed Jul 16, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

TuanNguyen27 added 5 commits July 18, 2019 16:26

refactor & add test for binary logistic regression case

a7dea4e

Merge branch 'master' into tn/descaleLR

35bdfe8

remove redundant import

c6bae48

fix scala style

bc60187

fix scala style again

a6839b2

leahmcguire approved these changes Jul 22, 2019

View reviewed changes

core/src/main/scala/com/salesforce/op/ModelInsights.scala Outdated Show resolved Hide resolved

Update warning message

23b2443

TuanNguyen27 changed the title ~~Descale feature contribution for Linear Regression~~ Descale feature contribution for Linear Regression & Logistic Regression Jul 22, 2019

update failure threshold so test will pass

36c8420

Jauntbox suggested changes Jul 22, 2019

View reviewed changes

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

core/src/test/scala/com/salesforce/op/ModelInsightsTest.scala Outdated Show resolved Hide resolved

update test to be ratio instead of absolute difference

c80cc1a

Jauntbox approved these changes Jul 25, 2019

View reviewed changes

salesforce-cla bot added cla:missing and removed cla:signed labels Jul 25, 2019

small update to set tolerance threshold

51627e9

salesforce-cla bot added cla:signed and removed cla:missing labels Jul 25, 2019

leahmcguire merged commit 82bb2c1 into master Jul 25, 2019

leahmcguire deleted the tn/descaleLR branch July 25, 2019 17:39

gerashegalov mentioned this pull request Sep 8, 2019

0.6.1 release #403

Merged

salesforce-cla bot added cla:missing and removed cla:signed labels Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Descale feature contribution for Linear Regression & Logistic Regression #345

Descale feature contribution for Linear Regression & Logistic Regression #345

TuanNguyen27 commented Jun 25, 2019 •

edited

Loading

codecov bot commented Jun 25, 2019 •

edited

Loading

tovbinm commented Jun 27, 2019

tovbinm left a comment

tovbinm Jun 27, 2019

tovbinm Jun 27, 2019

TuanNguyen27 Jun 27, 2019

tovbinm Jun 27, 2019

leahmcguire Jul 9, 2019

TuanNguyen27 Jul 9, 2019

tovbinm Jul 9, 2019

TuanNguyen27 Jul 9, 2019 •

edited

Loading

leahmcguire left a comment

Jauntbox left a comment

Jauntbox left a comment

salesforce-cla bot commented Jul 25, 2019

salesforce-cla bot commented Jul 25, 2019

salesforce-cla bot commented Dec 9, 2020

Descale feature contribution for Linear Regression & Logistic Regression #345

Descale feature contribution for Linear Regression & Logistic Regression #345

Conversation

TuanNguyen27 commented Jun 25, 2019 • edited Loading

codecov bot commented Jun 25, 2019 • edited Loading

Codecov Report

tovbinm commented Jun 27, 2019

tovbinm left a comment

Choose a reason for hiding this comment

tovbinm Jun 27, 2019

Choose a reason for hiding this comment

tovbinm Jun 27, 2019

Choose a reason for hiding this comment

TuanNguyen27 Jun 27, 2019

Choose a reason for hiding this comment

tovbinm Jun 27, 2019

Choose a reason for hiding this comment

leahmcguire Jul 9, 2019

Choose a reason for hiding this comment

TuanNguyen27 Jul 9, 2019

Choose a reason for hiding this comment

tovbinm Jul 9, 2019

Choose a reason for hiding this comment

TuanNguyen27 Jul 9, 2019 • edited Loading

Choose a reason for hiding this comment

leahmcguire left a comment

Choose a reason for hiding this comment

Jauntbox left a comment

Choose a reason for hiding this comment

Jauntbox left a comment

Choose a reason for hiding this comment

salesforce-cla bot commented Jul 25, 2019

salesforce-cla bot commented Jul 25, 2019

salesforce-cla bot commented Dec 9, 2020

TuanNguyen27 commented Jun 25, 2019 •

edited

Loading

codecov bot commented Jun 25, 2019 •

edited

Loading

TuanNguyen27 Jul 9, 2019 •

edited

Loading