-
Notifications
You must be signed in to change notification settings - Fork 24.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Add log structure finder functionality #32788
[ML] Add log structure finder functionality #32788
Conversation
This change adds a library to ML that can be used to deduce a log file's structure given only a sample of the log file. Eventually this will be used to add an endpoint to ML to make the functionality available to end users, but this will follow in a separate change. The functionality is split into a library so that it can also be used by a command line tool without requiring the command line tool to include all server code.
I've marked this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a lot to grok (pun intended). I have read through most of it once :D
} | ||
} | ||
|
||
if (wildcardRequired && others.stream().allMatch(String::isEmpty) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
others.stream().noneMatch(String::isEmpty)
is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think others.stream().noneMatch(String::isEmpty)
is logically equivalent. Consider the case where others
contains one empty string and one non-empty string. allMatch(String::isEmpty)
would be false
, but so would noneMatch(String::isEmpty)
. So ditching the == false
would yield the opposite result.
I think the logically equivalent condition is this:
others.stream().anyMatch(s -> s.isEmpty() == false)
Or alternatively if you don't like == false
anywhere:
others.stream().filter(String::isEmpty).findFirst().isPresent()
But IntelliJ warns that this last one could be replaced with an anyMatch()
.
Since you found the allMatch()
confusing I'll replace it with an anyMatch()
.
boolean wildcardRequired = true; | ||
for (int i = 0; i < driver.length(); ++i) { | ||
char ch = driver.charAt(i); | ||
if (PUNCTUATION_OR_SPACE.indexOf(ch) >= 0 && others.stream().allMatch(other -> other.indexOf(ch) >= 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I see a loop like this, my initial gut reaction is "Why isn't this a HashSet"? Are we worried about preoptimization or is there some Java thing I am missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed PUNCTUATION_OR_SPACE
and NEEDS_ESCAPING
and replaced them with a HashMap
that enables both things to be checked with one lookup.
|
||
@Override | ||
public boolean equals(Object other) { | ||
if (other == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#nit add the equal object predicate.
if (this == other) {
return true;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Set of comments as I got to one third into the review.
* or more contributor license agreements. Licensed under the Elastic License; | ||
* you may not use this file except in compliance with the Elastic License. | ||
*/ | ||
package org.elasticsearch.xpack.ml.logstructure; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if the package and the library name should match, ie should both be logstructure
or logstructurefinder
. Just raising it to ensure we double-check this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK cool. I renamed the package to logstructurefinder
.
import java.util.List; | ||
import java.util.Map; | ||
|
||
public abstract class AbstractLogFileStructureFinder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should consider dropping File
from those class names. Whether a log is in a file or not is irrelevant as it could be the log was stored in elasticsearch or another storage like that.
private static final int KEYWORD_MAX_LEN = 256; | ||
private static final int KEYWORD_MAX_SPACES = 5; | ||
|
||
protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to store explanation in a member field. Then we also add a getExplanations
or explain
method in the interface of the structure finder. We also add a protected method addExplanation(String)
for sub-classes to use. The benefit will be that we won't have to be passing the explanation list as an argument around. Also, in theory, that would make it easy to, say, allow enabling/disabling explanations, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nicer in some ways, but it would cause quite a lot of side effects. At the moment explanation
is built up by 3 pieces of code:
- The character set finder
- The high level structure determination (i.e. the
*LogFileStructure::canCreateFromSample()
methods) - The detailed structure determination
The unit tests test static methods that update explanations without having to create concrete objects. And the unit tests also print the explanation or partial explanation for whatever they've tested, which means that for unit tests we want the explanation to be owned by the test code.
I think one solution would be to introduce an Explanation
class that encapsulates the list of strings. Then that could be a member of AbstractLogFileStructureFinder
, which would avoid passing it around the member functions of that hierarchy so much. But it would be documented that other classes might be adding to the same explanation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also tricky because the GrokPatternCreator
needs to find mappings, relying on the fact that guessScalarMapping()
is static
.
private static final int KEYWORD_MAX_LEN = 256; | ||
private static final int KEYWORD_MAX_SPACES = 5; | ||
|
||
protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would be nice to add a javadoc for this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for Javadoc
} | ||
} | ||
|
||
else if (fieldValues.stream().allMatch(IP_GROK::match)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this else if
is getting away! Should it just be if
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be else if
, but I'll remove the accidental extra blank line.
|
||
SortedMap<String, Object> mappings = new TreeMap<>(); | ||
|
||
if (sampleRecords != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also unnecessary null check?
for (Map<String, ?> sampleRecord : sampleRecords) { | ||
for (String fieldName : sampleRecord.keySet()) { | ||
mappings.computeIfAbsent(fieldName, key -> guessMapping(explanation, fieldName, | ||
sampleRecords.stream().flatMap(record -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be more readable to store this in a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would negate the benefit of using computeIfAbsent()
. This is in an inner loop and if 1000 records all contain the same fields we don't want to do the complex processing 1000 times more than necessary.
If it's really hard to read as it stands I guess an alternative approach would be to first build a HashSet
of all the unique field names across all sample records, then iterate through that to drive creation of the mappings for each field.
return Collections.singletonMap(MAPPING_TYPE_SETTING, "object"); | ||
} | ||
throw new IllegalArgumentException("Field [" + fieldName + | ||
"] has both object and non-object values - this won't work with Elasticsearch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rephrase to - this is not supported by Elasticsearch
?
} | ||
|
||
/** | ||
* This must only be called if {@link #matchesAll} returns <code>true</code>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is only called in one place and right after matchesAll
was called, I think we could move the matchesAll
check in this method and so remove the risk of letting it be called without fulfilling the matchesAll
contract.
* matched section and the pieces before and after it. Recurse to find more matches in the pieces | ||
* before and after and update the supplied string builder. | ||
*/ | ||
private static void processCandidateAndSplit(List<String> explanation, Map<String, Integer> fieldNameCountStore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The long arg lists in these methods make me wonder if we could refactor as an object whose members capture those parameters. Hard to come with a more concrete suggestion without trying to do it, but these long arg lists and static methods set make me wonder if we could simplify by going more OO here. This class seems like it is doing a lot and could be split up in several smaller classes with reduced responsibilities.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - it's done in 37a92f7
This class got progressively more complex over time. When I started out it made sense to just have a couple of static methods but now you're right it needs to be an object with members.
|
||
/** | ||
* The timestamp patterns are complex and it can be slow to prove they do not | ||
* match anywhere in a long message. Many of the timestamps have similar and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Many of the timestamps have similar and' -> 'are similar'
private static final List<Pattern> QUICK_RULE_OUT_PATTERNS = Arrays.asList( | ||
Pattern.compile("\\b\\d{4}-\\d{2}-\\d{2} "), | ||
Pattern.compile("\\d \\d{2}:\\d{2}\\b"), | ||
Pattern.compile(" \\d{2}:\\d{2}:\\d{2} ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first is YYYY-MM-DD, the last is a timestamp but I don't know what the second is checking. Can you add a short comment for each pattern
if (rows.size() < 3) { | ||
explanation.add("Too little data to accurately assess whether header is in sample - guessing it is"); | ||
} else { | ||
isCsvHeaderInFile = isFirstRowUnusual(explanation, rows); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we want to automatically determine if a CSV header is given in the sample or not. Here are a couple of quick checks that we may want to try before doing expensive statistical checking and Levenshtein distances. Of course, this depends on how flexible we want to be.
- Are all entries unique in the header row? You should not have duplicate header entries
- Do we want to support
null
headers fields? If not, it stands to reason that none should be null in the sample - Do we want to support just white space or empty string header fields?
- Are all entries actual
string
values? I am not sure it is valid to have a header that is just a number or a blank string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's really no distinction between null
and empty string, as CSV cannot distinguish them. Whether the values are null
or empty string in the code depends on the CSV parser implementation and any subsequent post processing.
The only reason why I'm hesitant to outlaw empty fields in the header is that people have sent me examples of files that were CSV but where the header row was shorter than the data rows.
But the duplicates check is a good idea. I'll add an extra initial test that if the first row contains duplicate non-empty field values then it's not allowed to be a header row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The duplicates test is added in cf2d675
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments after looking at the JSON and XML finders.
private final String grokPattern; | ||
private final List<String> timestampFormats; | ||
private final String timestampField; | ||
private final boolean needClientTimezone; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason this is not optional (not a Boolean
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it makes sense to include this in the output in all cases, whereas for some of the other boolean flags they only make sense when a particular high level structure has been found. For example, has_header_row
does not make sense at all for JSON.
public LogFileStructure build() { | ||
|
||
if (numLinesAnalyzed <= 0) { | ||
throw new IllegalArgumentException("Number of lines analyzed must be positive."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In other classes we have such validations in the corresponding builder setter. Arguably there are different benefits doing it either way. Just raising to see what we think about the 2 ways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it here because the most likely error is failing to call the builder setter at all. This error would occur during development/adding a new log file structure. It won't be encountered by end users because they should never be building one of these objects.
public LogFileStructure build() { | ||
|
||
if (numLinesAnalyzed <= 0) { | ||
throw new IllegalArgumentException("Number of lines analyzed must be positive."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another inconsistency here is whether the error messages contain a natural description of the field vs. the actual field name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's academic in this case, because it's our code supplying the values rather than the end user. So this check won't trigger as a result of a user calling a REST endpoint unless there's a bug in our code.
|
||
JsonLogFileStructureFinder(List<String> explanation, String sample, String charsetName, Boolean hasByteOrderMarker) throws IOException { | ||
|
||
List<Map<String, ?>> sampleRecords = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sees that the Finder classes' constructors are doing the bulk of the work here. I wonder if it would be best to avoid that by introducing a method that clearly communicates that processing is going to occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about this, but then the member variables cannot be final
, and if you have one of these objects there's a potential doubt about whether the processing method has been called or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. The other idea would be to go with static factory methods. But I don't feel strongly about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea - I moved the major chunks of processing to static factory methods in baf863f
Refactoring of GrokPatternCreator and explanation still outstanding
private static final int KEYWORD_MAX_LEN = 256; | ||
private static final int KEYWORD_MAX_SPACES = 5; | ||
|
||
protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for Javadoc
} | ||
|
||
TimestampMatch singleMatch = null; | ||
for (String fieldValue : fieldValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this checking for a date mapping? It took me a while to figure that out could you add a comment.
This for
loop has a if ... else
and 2 break
statements there must be a way to simplify it. Why not call TimestampFormatFinder.findFirstFullMatch
on the first element of the list and if that is non-null perform the check singleMatch.equals(TimestampFormatFinder...
on every other element.
if (shouldTrimFields != null) { | ||
throw new IllegalArgumentException("Should trim fields may not be specified for [" + format + "] structures."); | ||
} | ||
// $FALL-THROUGH$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, does this suppress warnings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does in Eclipse - see
https://stackoverflow.com/a/13145718
However, I had to add the @SuppressWarnings
on the whole method as the Gradle build runs with -WError
, so this isn’t actually necessary to suppress the warning. I can change it to a more human-readable // Fallthrough
if you prefer.
for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) { | ||
spaceEncodingContainsZeroByte = (spaceBytes[i] == 0); | ||
} | ||
if (containsZeroBytes && spaceEncodingContainsZeroByte == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm learning a lot about character encoding
|
||
Tuple<Boolean, String[]> headerInfo = findHeaderFromSample(explanation, rows); | ||
boolean isCsvHeaderInFile = headerInfo.v1(); | ||
String[] csvHeader = headerInfo.v2(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit delineatedHeader
or separatedValuesHeader
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to just header
. csv
still appears in a few places because SuperCSV has the same problem of using "CSV" in its name while being a generic delimited values parser.
Major processing is now done in static factory methods and the constructors are trivial
I think I've now addressed all the review comments so far. The only thing I'm left wondering is whether after baf863f it would make sense to get rid of the I think I'll make this change tomorrow unless anyone has a better idea and then hopefully the PR will be in a fit state to merge. |
These methods are now in a utils class
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This change adds a library to ML that can be used to deduce a log file's structure given only a sample of the log file. Eventually this will be used to add an endpoint to ML to make the functionality available to end users, but this will follow in a separate change. The functionality is split into a library so that it can also be used by a command line tool without requiring the command line tool to include all server code.
…-checkpoint-poll * 'master' of github.com:elastic/elasticsearch: Fix docs for fixed filename for heap dump path (elastic#32882) Painless: Special Case def (elastic#32871) AwaitFix FullClusterRestartIT#testRollupIDSchemeAfterRestart. [Test] Fix DuelScrollIT#testDuelIndexOrderQueryThenFetch HLRC: adding machine learning delete job (elastic#32820) [DOCS] Update WordPress plugins links (elastic#32194) Remove passphrase support from reload settings API (elastic#32889) AwaitFix AckIT. Mutes test in DuelScrollIT CharArraysTests: Fix test bug. [ML] Choose seconds to fix intermittent DatafeeedConfigTest failure Test: Fix unpredictive merges in DocumentSubsetReaderTests [DOCS] Clarify sentence in network-host.asciidoc (elastic#32429) Docs enhancement: added reference to cluster-level setting `search.default_allow_partial_results` (elastic#32810) [DOCS] Fixing cross doc link to Stack Overview security topic. Move CharArrays to core lib (elastic#32851) Fix global checkpoint listeners test HLRC: adding machine learning open job (elastic#32860) [ML] Add log structure finder functionality (elastic#32788) INGEST: Add Configuration Except. Data to Metdata (elastic#32322)
This change adds a library to ML that can be used to deduce a log
file's structure given only a sample of the log file.
Eventually this will be used to add an endpoint to ML to make the
functionality available to end users, but this will follow in a
separate change.
The functionality is split into a library so that it can also be
used by a command line tool without requiring the command line
tool to include all server code.