-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Job and datafeed mappings with index template #32594
Conversation
Pinging @elastic/ml-core |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments and questions.
* is stored | ||
* @return The index name | ||
*/ | ||
public static String jobConfigIndexName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just configIndexName()
, as it will also store datafeed configs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
.field(TYPE, KEYWORD) | ||
.endObject() | ||
.startObject(Detector.CUSTOM_RULES_FIELD.getPreferredName()) | ||
.field(TYPE, NESTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need the nested document functionality here? If not then OBJECT
would be lighter-weight. (Maybe at this level NESTED
is useful - I'm not sure.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Custom rules are an array which is why I used nested
.field(ENABLED, false) | ||
.endObject() | ||
.startObject(DetectionRule.CONDITIONS_FIELD.getPreferredName()) | ||
.field(TYPE, NESTED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at this level NESTED
is definitely overkill and should be OBJECT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again conditions are an array. It's arguable that the extensive mappings are overkill anyway, not all fields need to be searchable and these mappings come with a maintenance burden. Is it conceivable that someone will want to search for a job where RuleCondition.AppliesTo == ACTUAL && RuleCondition.Operator == GT
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there will be relatively few documents in the index, so we can get away with high-cost mappings. We have NESTED
in our results index where it is much more costly due to the higher document volume and it's never been found to be the root cause of any problem, so it's probably fine.
.endObject() | ||
.endObject() | ||
.startObject(Detector.DETECTOR_INDEX.getPreferredName()) | ||
.field(TYPE, LONG) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is INTEGER
in other mappings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@@ -175,14 +277,19 @@ | |||
*/ | |||
public static boolean isValidFieldName(String fieldName) { | |||
String[] segments = DOT_PATTERN.split(fieldName); | |||
return !RESERVED_FIELD_NAMES.contains(segments[0]); | |||
return RESERVED_RESULT_FIELD_NAMES.contains(segments[0]) == false && RESERVED_CONFIG_FIELD_NAMES.contains(segments[0]) == false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary for the config field names to be included here? It will prevent fields being added to results documents just because they clash with a config field name. But that's not necessary because the config and results are in different indices. Is there another reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
.startObject(Job.RESULTS_INDEX_NAME.getPreferredName()) | ||
.field(TYPE, KEYWORD) | ||
.endObject() | ||
.startObject(Job.DELETED.getPreferredName()) // TODO can this be removed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting question. At the moment this is what other APIs use to determine that a job is in the process of being deleted. Maybe storing that flag in an index won't be sufficiently atomic in the future. An alternative might be to make the job deletion process a persistent task, and use the existence of that persistent task to determine whether a job is being deleted. This field is probably just one of several places where indices won't give the same ordering guarantee that cluster state gave us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the loose eventually consistent guarantees this field is useless, we have considered other ways to mark the fact that a job is in the process of deletion I'd like to remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the DELETED
field from the mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Mappings for job and datafeed configurations and the index template.
For moving the configuration classes out of the clusterstate