Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adds multi-class feature importance support #53803

Conversation

benwtrent
Copy link
Member

Adds multi-class feature importance calculation.

Feature importance objects are now mapped as follows
(logistic) Regression:

{
   "feature_name": "feature_0",
   "importance": -1.3
}

Multi-class [class names are foo, bar, baz]

{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // abs(sum()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},

For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)

 "ml.inference.feature_importance": {
          "type": "nested",
          "dynamic": true,
          "properties": {
            "feature_name": {
              "type": "keyword"
            },
            "importance": {
              "type": "double"
            }
          }
        }

The mapping field name is as follows
ml.<inference.target_field>.<inference.tag>.feature_importance
if inference.tag is not provided in the processor definition, it is not part of the field path.
inference.target_field is defaulted to ml.inference.
//cc @lcawl ^ Where should we document this?

If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-sample-matrix-unix

}

public static FeatureImportance forClassification(String featureName, Map<String, Double> classImportance) {
return new FeatureImportance(featureName, classImportance.values().stream().mapToDouble(Math::abs).sum(), classImportance);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not 100% convinced this should be abs.

We don't write the feature importance value on the native side by looking at the norm of the vector.

Do we want to make this the norm too? Or do we thing abs is good enough?

@tveasey @valeriy42

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please provide more context. What are you calculating here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@valeriy42 @tveasey this is calculating the "overall importance" of all the classes combined for a given feature. This is so we can measure "most important feature" independent of the classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm would make it an L2 norm, abs makes it an L1 norm. Either way is suitable. I think, abs is better, since norm over-treats larger importances and ignores smaller once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 abs

benwtrent added a commit to elastic/ml-cpp that referenced this pull request Mar 23, 2020
Feature importance is already calculated for multi-class models. This commit adjusts the output sent to ES so that multi-class importance can be explored.

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```
Java side change: elastic/elasticsearch#53803
Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions but LGTM.

Quite a lot of code really

return unsortedFeatureImportances;
}
return unsortedFeatureImportances.stream()
.sorted((l, r)-> Double.compare(Math.abs(r.getImportance()), Math.abs(l.getImportance())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the abs necessary when the score is a norm? If the score can be -ve why is it wrong to use the -ve value?

Copy link
Member Author

@benwtrent benwtrent Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkyle

Score is not absolutely the norm. Additionally, we want to have the MOST influential values, regardless of direction. We could have feature importances like this:

{
A: -1.2,
B: -0.2,
C: 0.5
}

If we want the top two influential features, we want A and C.

The getImportance is only the norm when it comes to multi-class. This is not the case for (logistic) regression.

this.featureName = in.readString();
this.importance = in.readDouble();
if (in.readBoolean()) {
this.classImportance = in.readLinkedHashMap(StreamInput::readString, StreamInput::readDouble);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why this has to be a linked hash map? I'm assuming to preserve insertion order but why? If this was ever serialisable to xcontent the ordering could not be guaranteed

Copy link
Member Author

@benwtrent benwtrent Mar 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ToXContent does not factor here. We are concerned about the order when the values are written to the ingest document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this sort of thing is already done with Object maps. Just cannot do it with specific stream inputs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it more and looking more into the ingest doc code. I agree with you. This seems superfluous for now. If ordering becomes a concern for usability, we can add it in the future. The reading from the wire for both LinkedHashMap and HashMap would be exactly the same, so BWC is not a concern.

benwtrent added a commit to elastic/ml-cpp that referenced this pull request Mar 23, 2020
)

Feature importance is already calculated for multi-class models. This commit adjusts the output sent to ES so that multi-class importance can be explored.

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```
Java side change: elastic/elasticsearch#53803
@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent benwtrent merged commit 756a297 into elastic:master Mar 23, 2020
@benwtrent benwtrent deleted the feature/ml-inference-multi-class-feature-importance branch March 23, 2020 20:53
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Mar 23, 2020
Adds multi-class feature importance calculation. 

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```

For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
 "ml.inference.feature_importance": {
          "type": "nested",
          "dynamic": true,
          "properties": {
            "feature_name": {
              "type": "keyword"
            },
            "importance": {
              "type": "double"
            }
          }
        }
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?

If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
benwtrent added a commit that referenced this pull request Mar 23, 2020
Adds multi-class feature importance calculation. 

Feature importance objects are now mapped as follows
(logistic) Regression:
```
{
   "feature_name": "feature_0",
   "importance": -1.3
}
```
Multi-class [class names are `foo`, `bar`, `baz`]
```
{ 
   “feature_name”: “feature_0”, 
   “importance”: 2.0, // sum(abs()) of class importances
   “foo”: 1.0, 
   “bar”: 0.5, 
   “baz”: -0.5 
},
```

For users to get the full benefit of aggregating and searching for feature importance, they should update their index mapping as follows (before turning this option on in their pipelines)
```
 "ml.inference.feature_importance": {
          "type": "nested",
          "dynamic": true,
          "properties": {
            "feature_name": {
              "type": "keyword"
            },
            "importance": {
              "type": "double"
            }
          }
        }
```
The mapping field name is as follows
`ml.<inference.target_field>.<inference.tag>.feature_importance`
if `inference.tag` is not provided in the processor definition, it is not part of the field path.
`inference.target_field` is defaulted to `ml.inference`.
//cc @lcawl ^ Where should we document this?

If this makes it in for 7.7, there shouldn't be any feature_importance at inference BWC worries as 7.7 is the first version to have it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants