-
Notifications
You must be signed in to change notification settings - Fork 397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model combiner #385
Model combiner #385
Conversation
Codecov Report
@@ Coverage Diff @@
## master #385 +/- ##
===========================================
- Coverage 86.82% 49.75% -37.08%
===========================================
Files 336 337 +1
Lines 10962 11076 +114
Branches 346 588 +242
===========================================
- Hits 9518 5511 -4007
- Misses 1444 5565 +4121
Continue to review full report at Codecov.
|
core/src/main/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerMetadata.scala
Show resolved
Hide resolved
override def hashCode(): Int = super.hashCode() | ||
} | ||
|
||
object CombinationStrategy extends Enum[CombinationStrategy] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this enum should be defined in feature
module and registered with the json in OpPipelineStageReadWriteFormats
core/src/main/scala/com/salesforce/op/stages/impl/selector/SelectedCombiner.scala
Outdated
Show resolved
Hide resolved
core/src/main/scala/com/salesforce/op/stages/impl/preparators/SanityCheckerMetadata.scala
Show resolved
Hide resolved
@@ -313,10 +316,22 @@ case class Correlations | |||
corrMeta.putString(SanityCheckerNames.CorrelationType, corrType.sparkName) | |||
corrMeta.build() | |||
} | |||
|
|||
private[op] def +(corr: Correlations): Correlations = | |||
new Correlations(featuresIn ++ corr.featuresIn, values ++ corr.values, nanCorrs ++ corr.nanCorrs, corrType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would the Correlations have to have the same correlation type for this to make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes- but I am not sure it makes sense to error out on that... should I maybe log a warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added custom to reflect if have multiple corr types
if (m2e1.nonEmpty) { | ||
(getMetricValue(summary1.trainEvaluation, eval1), m2e1, eval1) | ||
} else if (m1e2.nonEmpty) { | ||
(m1e2, getMetricValue(summary2.trainEvaluation, eval2)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you need an eval2
at the end of the tuple here? How did the Scala compiler miss that actually??
Can you add some tests that cover some of these weird edge cases? I feel like there's a lot of new stuff being added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
lazy val labelColName: String = in1.name | ||
|
||
@transient private var evaluatorList: Seq[OpEvaluatorBase[_ <: EvaluationMetrics]] = Seq.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to be transient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that we can save the model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire I think you need to write a custom serializer for SelectedCombinerModel
that would serialize and deserialize the evaluators top make this work.
I wonder why OpEstimatorSpec
base class did not catch it?!?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
matthew this is exactly how it works in the model selector - we dont need these in the serialized model as they are only called during training
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So no we do not need a custom serializer - evaluators do not need to be serialized
core/src/main/scala/com/salesforce/op/stages/impl/selector/SelectedCombiner.scala
Outdated
Show resolved
Hide resolved
} | ||
} | ||
|
||
object CombinationStrategy extends Enum[CombinationStrategy] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
features/src/main/scala/com/salesforce/op/utils/spark/OpVectorMetadata.scala
Show resolved
Hide resolved
override def entryName: String = name.toLowerCase | ||
} | ||
def withName(name: String, isLargerBetter: Boolean): OpEvaluatorNames = | ||
Try(super.withName(name)).getOrElse(Custom(name, name, isLargerBetter)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid Try and exceptions in simple program flows, use super.withNameOption
instead
|
||
def withNameInsensitive(name: String, isLargerBetter: Boolean): OpEvaluatorNames = | ||
super.withNameInsensitiveOption(name).getOrElse(Custom(name, name, isLargerBetter)) | ||
override def withNameInsensitive(name: String): OpEvaluatorNames = withNameInsensitive(name, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want to have the default of true
for isLargerBetter
value? perhaps let's avoid specifying any defaults for withName
and withNameInsensitive
methods.
Instead it's better to have the method:
def withNameOrDefault(name: String, default: String => OpEvaluatorNames => name => Custom(name, name, isLargerBetter)): OpEvaluatorNames = {
super.withNameInsensitiveOption(name).getOrElse(default)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are base methods @tovbinm , hence the override. if they want the fallback behavior your method provides they can use my definitions - but what happens when people call the base methods and there is no default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See some of my comments.
Another question: would you think we should add a shortcut for it?
perhaps?
val (pred1, pred2) = ...
val combinedPred = label.combinePredictions(pred1, pred2)
* @param operationName name of operation | ||
* @param uid stage uid | ||
*/ | ||
class SelectedCombiner |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SelectedModelCombiner
or ModelCombiner
perhaps? it's difficult to understand what this stage does otherwise.
isValid = (in: String) => CombinationStrategy.values.map(_.entryName).contains(in) | ||
) | ||
def setCombinationStrategy(value: CombinationStrategy): this.type = set(combinationStrategy, value.entryName) | ||
def getCombinationStrategy(): CombinationStrategy = CombinationStrategy.namesToValuesMap($(combinationStrategy)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CombinationStrategy.withNameIncensitive($(combinationStrategy))
|
||
def getMetricValue(metrics: EvaluationMetrics, name: EvalMetric) = | ||
metrics.toMap.collectFirst{ | ||
case (k, v) if k.contains(name.humanFriendlyName) || k.contains(name.entryName) => v.asInstanceOf[Double]} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metrics.toMap.collectFirst {
case (k, v: Double) if k.contains(name.humanFriendlyName) || k.contains(name.entryName) => v
}
} else (None, None, eval1) | ||
} | ||
|
||
def makeMeta(model: SelectedCombinerModel): Unit = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please make these private methods on the stage instead of them being nested inside fit
?
dataPrepResults = summary1.dataPrepResults.orElse(summary2.dataPrepResults), | ||
evaluationMetric = metricName, | ||
problemType = summary1.problemType, | ||
bestModelUID = summary1.bestModelUID + " " + summary2.bestModelUID, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is space a good separator here? perhaps _
or -
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space will be parsable to get them separated while "-" or "_" would not
|
||
val strategy = getCombinationStrategy() | ||
val (weight1, weight2) = strategy match { | ||
case CombinationStrategy.Best => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a method on CombinationStrategy
itself. e.g. val (weight1, weight2) = getCombinationStrategy().computeWeights(metricValue1, metricValue2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it actually cant unless I move the EvalMetric class to the features module
|
||
lazy val labelColName: String = in1.name | ||
|
||
@transient private var evaluatorList: Seq[OpEvaluatorBase[_ <: EvaluationMetrics]] = Seq.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leahmcguire I think you need to write a custom serializer for SelectedCombinerModel
that would serialize and deserialize the evaluators top make this work.
I wonder why OpEstimatorSpec
base class did not catch it?!?
…I into lm/modelCombiner
Bug fixes: - Ensure correct metrics despite model failures on some CV folds [#404](#404) - Fix flaky `ModelInsight` tests [#395](#395) - Avoid creating `SparseVector`s for LOCO [#377](#377) New features / updates: - Model combiner [#385](#399) - Added new sample for HousingPrices [#365](#365) - Test to verify that custom metrics appear in model insight metrics [#387](#387) - Add `FeatureDistribution` to `SerializationFormat`s [#383](#383) - Add metadata to `OpStandadrdScaler` to allow for descaling [#378](#378) - Improve json serde error in `evalMetFromJson` [#380](#380) - Track mean & standard deviation as metrics for numeric features and for text length of text features [#354](#354) - Making model selectors robust to failing models [#372](#372) - Use compact and compressed model json by default [#375](#375) - Descale feature contribution for Linear Regression & Logistic Regression [#345](#345) Dependency updates: - Update tika version [#382](#382)
Thanks for the contribution! Unfortunately we can't verify the commit author(s): leahmcguire <l***@s***.com> Leah McGuire <l***@s***.com>. One possible solution is to add that email to your GitHub account. Alternatively you can change your commits to another email and force push the change. After getting your commits associated with your GitHub account, refresh the status of this Pull Request. |
Related issues
Want to do different feature engineering for different model types or do model combination. Basically this allows two different feature engineering flows to be passed into different model selectors (eg you want to do one kind of feature eng for random forest and one for regression but want grid search for each). The accuracy of the models can be compared and then the predictions willbe combined by either taking the best model, equal combination, or weighted combination of the predictions.
Describe the proposed solution
Made a stage which can combine the results of two model selectors to produce a new score. The inputs are the two prediction outputs and the label. The output is a new prediction which is the combined result of the input predictions. The metadata is updated to reflect the combined prediction.