-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] adds multi-class feature importance support #53803
[ML] adds multi-class feature importance support #53803
Conversation
Pinging @elastic/ml-core (:ml) |
run elasticsearch-ci/packaging-sample-matrix-unix |
} | ||
|
||
public static FeatureImportance forClassification(String featureName, Map<String, Double> classImportance) { | ||
return new FeatureImportance(featureName, classImportance.values().stream().mapToDouble(Math::abs).sum(), classImportance); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not 100% convinced this should be abs
.
We don't write the feature importance value on the native side by looking at the norm
of the vector.
Do we want to make this the norm too? Or do we thing abs
is good enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please provide more context. What are you calculating here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@valeriy42 @tveasey this is calculating the "overall importance" of all the classes combined for a given feature. This is so we can measure "most important feature" independent of the classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
norm
would make it an L2 norm, abs
makes it an L1 norm. Either way is suitable. I think, abs
is better, since norm
over-treats larger importances and ignores smaller once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 abs
Feature importance is already calculated for multi-class models. This commit adjusts the output sent to ES so that multi-class importance can be explored. Feature importance objects are now mapped as follows (logistic) Regression: ``` { "feature_name": "feature_0", "importance": -1.3 } ``` Multi-class [class names are `foo`, `bar`, `baz`] ``` { “feature_name”: “feature_0”, “importance”: 2.0, // sum(abs()) of class importances “foo”: 1.0, “bar”: 0.5, “baz”: -0.5 }, ``` Java side change: elastic/elasticsearch#53803
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions but LGTM.
Quite a lot of code really
return unsortedFeatureImportances; | ||
} | ||
return unsortedFeatureImportances.stream() | ||
.sorted((l, r)-> Double.compare(Math.abs(r.getImportance()), Math.abs(l.getImportance()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the abs necessary when the score is a norm? If the score can be -ve why is it wrong to use the -ve value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Score is not absolutely the norm. Additionally, we want to have the MOST influential values, regardless of direction. We could have feature importances like this:
{
A: -1.2,
B: -0.2,
C: 0.5
}
If we want the top two influential features, we want A and C.
The getImportance
is only the norm when it comes to multi-class. This is not the case for (logistic) regression.
...e/src/main/java/org/elasticsearch/xpack/core/ml/inference/trainedmodel/InferenceHelpers.java
Outdated
Show resolved
Hide resolved
this.featureName = in.readString(); | ||
this.importance = in.readDouble(); | ||
if (in.readBoolean()) { | ||
this.classImportance = in.readLinkedHashMap(StreamInput::readString, StreamInput::readDouble); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why this has to be a linked hash map? I'm assuming to preserve insertion order but why? If this was ever serialisable to xcontent the ordering could not be guaranteed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ToXContent
does not factor here. We are concerned about the order when the values are written to the ingest document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this sort of thing is already done with Object maps. Just cannot do it with specific stream inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about it more and looking more into the ingest doc code. I agree with you. This seems superfluous for now. If ordering becomes a concern for usability, we can add it in the future. The reading from the wire for both LinkedHashMap and HashMap would be exactly the same, so BWC is not a concern.
) Feature importance is already calculated for multi-class models. This commit adjusts the output sent to ES so that multi-class importance can be explored. Feature importance objects are now mapped as follows (logistic) Regression: ``` { "feature_name": "feature_0", "importance": -1.3 } ``` Multi-class [class names are `foo`, `bar`, `baz`] ``` { “feature_name”: “feature_0”, “importance”: 2.0, // sum(abs()) of class importances “foo”: 1.0, “bar”: 0.5, “baz”: -0.5 }, ``` Java side change: elastic/elasticsearch#53803
… github.com:benwtrent/elasticsearch into feature/ml-inference-multi-class-feature-importance
@elasticmachine update branch |
Adds multi-class feature importance calculation. Feature importance objects are now mapped as follows (logistic) Regression: ``` { "feature_name": "feature_0", "importance": -1.3 } ``` Multi-class [class names are `foo`, `bar`, `baz`] ``` { “feature_name”: “feature_0”, “importance”: 2.0, // sum(abs()) of class importances “foo”: 1.0, “bar”: 0.5, “baz”: -0.5 }, ``` For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines) ``` "ml.inference.feature_importance": { "type": "nested", "dynamic": true, "properties": { "feature_name": { "type": "keyword" }, "importance": { "type": "double" } } } ``` The mapping field name is as follows `ml.<inference.target_field>.<inference.tag>.feature_importance` if `inference.tag` is not provided in the processor definition, it is not part of the field path. `inference.target_field` is defaulted to `ml.inference`. //cc @lcawl ^ Where should we document this? If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Adds multi-class feature importance calculation. Feature importance objects are now mapped as follows (logistic) Regression: ``` { "feature_name": "feature_0", "importance": -1.3 } ``` Multi-class [class names are `foo`, `bar`, `baz`] ``` { “feature_name”: “feature_0”, “importance”: 2.0, // sum(abs()) of class importances “foo”: 1.0, “bar”: 0.5, “baz”: -0.5 }, ``` For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines) ``` "ml.inference.feature_importance": { "type": "nested", "dynamic": true, "properties": { "feature_name": { "type": "keyword" }, "importance": { "type": "double" } } } ``` The mapping field name is as follows `ml.<inference.target_field>.<inference.tag>.feature_importance` if `inference.tag` is not provided in the processor definition, it is not part of the field path. `inference.target_field` is defaulted to `ml.inference`. //cc @lcawl ^ Where should we document this? If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Adds multi-class feature importance calculation.
Feature importance objects are now mapped as follows
(logistic) Regression:
Multi-class [class names are
foo
,bar
,baz
]For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
The mapping field name is as follows
ml.<inference.target_field>.<inference.tag>.feature_importance
if
inference.tag
is not provided in the processor definition, it is not part of the field path.inference.target_field
is defaulted toml.inference
.//cc @lcawl ^ Where should we document this?
If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.