Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Add log structure finder functionality #32788

Merged
merged 10 commits into from
Aug 15, 2018

Conversation

droberts195
Copy link
Contributor

This change adds a library to ML that can be used to deduce a log
file's structure given only a sample of the log file.

Eventually this will be used to add an endpoint to ML to make the
functionality available to end users, but this will follow in a
separate change.

The functionality is split into a library so that it can also be
used by a command line tool without requiring the command line
tool to include all server code.

This change adds a library to ML that can be used to deduce a log
file's structure given only a sample of the log file.

Eventually this will be used to add an endpoint to ML to make the
functionality available to end users, but this will follow in a
separate change.

The functionality is split into a library so that it can also be
used by a command line tool without requiring the command line
tool to include all server code.
@droberts195
Copy link
Contributor Author

I've marked this >non-issue as it's invisible to the end user in this state. It will get added to the release notes when the corresponding endpoint is added.

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a lot to grok (pun intended). I have read through most of it once :D

}
}

if (wildcardRequired && others.stream().allMatch(String::isEmpty) == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

others.stream().noneMatch(String::isEmpty) is more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think others.stream().noneMatch(String::isEmpty) is logically equivalent. Consider the case where others contains one empty string and one non-empty string. allMatch(String::isEmpty) would be false, but so would noneMatch(String::isEmpty). So ditching the == false would yield the opposite result.

I think the logically equivalent condition is this:

others.stream().anyMatch(s -> s.isEmpty() == false)

Or alternatively if you don't like == false anywhere:

others.stream().filter(String::isEmpty).findFirst().isPresent()

But IntelliJ warns that this last one could be replaced with an anyMatch().

Since you found the allMatch() confusing I'll replace it with an anyMatch().

boolean wildcardRequired = true;
for (int i = 0; i < driver.length(); ++i) {
char ch = driver.charAt(i);
if (PUNCTUATION_OR_SPACE.indexOf(ch) >= 0 && others.stream().allMatch(other -> other.indexOf(ch) >= 0)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I see a loop like this, my initial gut reaction is "Why isn't this a HashSet"? Are we worried about preoptimization or is there some Java thing I am missing?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed PUNCTUATION_OR_SPACE and NEEDS_ESCAPING and replaced them with a HashMap that enables both things to be checked with one lookup.


@Override
public boolean equals(Object other) {
if (other == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#nit add the equal object predicate.

if (this == other) {
    return true;
}

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Set of comments as I got to one third into the review.

* or more contributor license agreements. Licensed under the Elastic License;
* you may not use this file except in compliance with the Elastic License.
*/
package org.elasticsearch.xpack.ml.logstructure;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the package and the library name should match, ie should both be logstructure or logstructurefinder. Just raising it to ensure we double-check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK cool. I renamed the package to logstructurefinder.

import java.util.List;
import java.util.Map;

public abstract class AbstractLogFileStructureFinder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider dropping File from those class names. Whether a log is in a file or not is irrelevant as it could be the log was stored in elasticsearch or another storage like that.

private static final int KEYWORD_MAX_LEN = 256;
private static final int KEYWORD_MAX_SPACES = 5;

protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to store explanation in a member field. Then we also add a getExplanations or explain method in the interface of the structure finder. We also add a protected method addExplanation(String) for sub-classes to use. The benefit will be that we won't have to be passing the explanation list as an argument around. Also, in theory, that would make it easy to, say, allow enabling/disabling explanations, etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nicer in some ways, but it would cause quite a lot of side effects. At the moment explanation is built up by 3 pieces of code:

  1. The character set finder
  2. The high level structure determination (i.e. the *LogFileStructure::canCreateFromSample() methods)
  3. The detailed structure determination

The unit tests test static methods that update explanations without having to create concrete objects. And the unit tests also print the explanation or partial explanation for whatever they've tested, which means that for unit tests we want the explanation to be owned by the test code.

I think one solution would be to introduce an Explanation class that encapsulates the list of strings. Then that could be a member of AbstractLogFileStructureFinder, which would avoid passing it around the member functions of that hierarchy so much. But it would be documented that other classes might be adding to the same explanation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also tricky because the GrokPatternCreator needs to find mappings, relying on the fact that guessScalarMapping() is static.

private static final int KEYWORD_MAX_LEN = 256;
private static final int KEYWORD_MAX_SPACES = 5;

protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it would be nice to add a javadoc for this method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Javadoc

}
}

else if (fieldValues.stream().allMatch(IP_GROK::match)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else if is getting away! Should it just be if ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be else if, but I'll remove the accidental extra blank line.


SortedMap<String, Object> mappings = new TreeMap<>();

if (sampleRecords != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also unnecessary null check?

for (Map<String, ?> sampleRecord : sampleRecords) {
for (String fieldName : sampleRecord.keySet()) {
mappings.computeIfAbsent(fieldName, key -> guessMapping(explanation, fieldName,
sampleRecords.stream().flatMap(record -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be more readable to store this in a local variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would negate the benefit of using computeIfAbsent(). This is in an inner loop and if 1000 records all contain the same fields we don't want to do the complex processing 1000 times more than necessary.

If it's really hard to read as it stands I guess an alternative approach would be to first build a HashSet of all the unique field names across all sample records, then iterate through that to drive creation of the mappings for each field.

return Collections.singletonMap(MAPPING_TYPE_SETTING, "object");
}
throw new IllegalArgumentException("Field [" + fieldName +
"] has both object and non-object values - this won't work with Elasticsearch");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rephrase to - this is not supported by Elasticsearch ?

}

/**
* This must only be called if {@link #matchesAll} returns <code>true</code>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is only called in one place and right after matchesAll was called, I think we could move the matchesAll check in this method and so remove the risk of letting it be called without fulfilling the matchesAll contract.

* matched section and the pieces before and after it. Recurse to find more matches in the pieces
* before and after and update the supplied string builder.
*/
private static void processCandidateAndSplit(List<String> explanation, Map<String, Integer> fieldNameCountStore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The long arg lists in these methods make me wonder if we could refactor as an object whose members capture those parameters. Hard to come with a more concrete suggestion without trying to do it, but these long arg lists and static methods set make me wonder if we could simplify by going more OO here. This class seems like it is doing a lot and could be split up in several smaller classes with reduced responsibilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - it's done in 37a92f7

This class got progressively more complex over time. When I started out it made sense to just have a couple of static methods but now you're right it needs to be an object with members.


/**
* The timestamp patterns are complex and it can be slow to prove they do not
* match anywhere in a long message. Many of the timestamps have similar and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'Many of the timestamps have similar and' -> 'are similar'

private static final List<Pattern> QUICK_RULE_OUT_PATTERNS = Arrays.asList(
Pattern.compile("\\b\\d{4}-\\d{2}-\\d{2} "),
Pattern.compile("\\d \\d{2}:\\d{2}\\b"),
Pattern.compile(" \\d{2}:\\d{2}:\\d{2} ")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first is YYYY-MM-DD, the last is a timestamp but I don't know what the second is checking. Can you add a short comment for each pattern

if (rows.size() < 3) {
explanation.add("Too little data to accurately assess whether header is in sample - guessing it is");
} else {
isCsvHeaderInFile = isFirstRowUnusual(explanation, rows);
Copy link
Member

@benwtrent benwtrent Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we want to automatically determine if a CSV header is given in the sample or not. Here are a couple of quick checks that we may want to try before doing expensive statistical checking and Levenshtein distances. Of course, this depends on how flexible we want to be.

  • Are all entries unique in the header row? You should not have duplicate header entries
  • Do we want to support null headers fields? If not, it stands to reason that none should be null in the sample
  • Do we want to support just white space or empty string header fields?
  • Are all entries actual string values? I am not sure it is valid to have a header that is just a number or a blank string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's really no distinction between null and empty string, as CSV cannot distinguish them. Whether the values are null or empty string in the code depends on the CSV parser implementation and any subsequent post processing.

The only reason why I'm hesitant to outlaw empty fields in the header is that people have sent me examples of files that were CSV but where the header row was shorter than the data rows.

But the duplicates check is a good idea. I'll add an extra initial test that if the first row contains duplicate non-empty field values then it's not allowed to be a header row.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The duplicates test is added in cf2d675

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments after looking at the JSON and XML finders.

private final String grokPattern;
private final List<String> timestampFormats;
private final String timestampField;
private final boolean needClientTimezone;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason this is not optional (not a Boolean)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because it makes sense to include this in the output in all cases, whereas for some of the other boolean flags they only make sense when a particular high level structure has been found. For example, has_header_row does not make sense at all for JSON.

public LogFileStructure build() {

if (numLinesAnalyzed <= 0) {
throw new IllegalArgumentException("Number of lines analyzed must be positive.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In other classes we have such validations in the corresponding builder setter. Arguably there are different benefits doing it either way. Just raising to see what we think about the 2 ways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it here because the most likely error is failing to call the builder setter at all. This error would occur during development/adding a new log file structure. It won't be encountered by end users because they should never be building one of these objects.

public LogFileStructure build() {

if (numLinesAnalyzed <= 0) {
throw new IllegalArgumentException("Number of lines analyzed must be positive.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another inconsistency here is whether the error messages contain a natural description of the field vs. the actual field name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's academic in this case, because it's our code supplying the values rather than the end user. So this check won't trigger as a result of a user calling a REST endpoint unless there's a bug in our code.


JsonLogFileStructureFinder(List<String> explanation, String sample, String charsetName, Boolean hasByteOrderMarker) throws IOException {

List<Map<String, ?>> sampleRecords = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sees that the Finder classes' constructors are doing the bulk of the work here. I wonder if it would be best to avoid that by introducing a method that clearly communicates that processing is going to occur.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but then the member variables cannot be final, and if you have one of these objects there's a potential doubt about whether the processing method has been called or not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The other idea would be to go with static factory methods. But I don't feel strongly about it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea - I moved the major chunks of processing to static factory methods in baf863f

Refactoring of GrokPatternCreator and explanation still outstanding
private static final int KEYWORD_MAX_LEN = 256;
private static final int KEYWORD_MAX_SPACES = 5;

protected static Map<String, String> guessScalarMapping(List<String> explanation, String fieldName, Collection<String> fieldValues) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for Javadoc

}

TimestampMatch singleMatch = null;
for (String fieldValue : fieldValues) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this checking for a date mapping? It took me a while to figure that out could you add a comment.

This for loop has a if ... else and 2 break statements there must be a way to simplify it. Why not call TimestampFormatFinder.findFirstFullMatch on the first element of the list and if that is non-null perform the check singleMatch.equals(TimestampFormatFinder... on every other element.

if (shouldTrimFields != null) {
throw new IllegalArgumentException("Should trim fields may not be specified for [" + format + "] structures.");
}
// $FALL-THROUGH$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, does this suppress warnings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does in Eclipse - see
https://stackoverflow.com/a/13145718

However, I had to add the @SuppressWarnings on the whole method as the Gradle build runs with -WError, so this isn’t actually necessary to suppress the warning. I can change it to a more human-readable // Fallthrough if you prefer.

for (int i = 0; i < spaceBytes.length && spaceEncodingContainsZeroByte == false; ++i) {
spaceEncodingContainsZeroByte = (spaceBytes[i] == 0);
}
if (containsZeroBytes && spaceEncodingContainsZeroByte == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm learning a lot about character encoding


Tuple<Boolean, String[]> headerInfo = findHeaderFromSample(explanation, rows);
boolean isCsvHeaderInFile = headerInfo.v1();
String[] csvHeader = headerInfo.v2();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit delineatedHeader or separatedValuesHeader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to just header. csv still appears in a few places because SuperCSV has the same problem of using "CSV" in its name while being a generic delimited values parser.

@droberts195
Copy link
Contributor Author

I think I've now addressed all the review comments so far.

The only thing I'm left wondering is whether after baf863f it would make sense to get rid of the abstract base classes and move the methods (which are all now static) into a new LogStructureUtils class, a it's a bit weird to have abstract base classes with only static methods. I could also move a static method from the GrokPatternCreator into the new utility class to break a circular dependency with the TextLogStructureFinder.

I think I'll make this change tomorrow unless anyone has a better idea and then hopefully the PR will be in a fit state to merge.

@droberts195
Copy link
Contributor Author

@elasticmachine test this please

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195 droberts195 merged commit 5ba04e2 into elastic:master Aug 15, 2018
@droberts195 droberts195 deleted the log-structure-finder-library branch August 15, 2018 17:04
droberts195 added a commit that referenced this pull request Aug 15, 2018
This change adds a library to ML that can be used to deduce a log
file's structure given only a sample of the log file.

Eventually this will be used to add an endpoint to ML to make the
functionality available to end users, but this will follow in a
separate change.

The functionality is split into a library so that it can also be
used by a command line tool without requiring the command line
tool to include all server code.
jasontedor added a commit that referenced this pull request Aug 15, 2018
* master:
  Fix global checkpoint listeners test
  HLRC: adding machine learning open job (#32860)
  [ML] Add log structure finder functionality (#32788)
  INGEST: Add Configuration Except. Data to Metdata (#32322)
jasontedor added a commit that referenced this pull request Aug 15, 2018
* 6.x:
  Fix global checkpoint listeners test
  HLRC: adding machine learning open job (#32860)
  [ML] Add log structure finder functionality (#32788)
  Use generic AcknowledgedResponse instead of extended classes (#32859)
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Aug 24, 2018
…-checkpoint-poll

* 'master' of github.com:elastic/elasticsearch:
  Fix docs for fixed filename for heap dump path (elastic#32882)
  Painless: Special Case def (elastic#32871)
  AwaitFix FullClusterRestartIT#testRollupIDSchemeAfterRestart.
  [Test] Fix DuelScrollIT#testDuelIndexOrderQueryThenFetch
  HLRC: adding machine learning delete job (elastic#32820)
  [DOCS] Update WordPress plugins links (elastic#32194)
  Remove passphrase support from reload settings API (elastic#32889)
  AwaitFix AckIT.
  Mutes test in DuelScrollIT
  CharArraysTests: Fix test bug.
  [ML] Choose seconds to fix intermittent DatafeeedConfigTest failure
  Test: Fix unpredictive merges in DocumentSubsetReaderTests
  [DOCS] Clarify sentence in network-host.asciidoc (elastic#32429)
  Docs enhancement: added reference to cluster-level setting `search.default_allow_partial_results` (elastic#32810)
  [DOCS] Fixing cross doc link to Stack Overview security topic.
  Move CharArrays to core lib (elastic#32851)
  Fix global checkpoint listeners test
  HLRC: adding machine learning open job (elastic#32860)
  [ML] Add log structure finder functionality (elastic#32788)
  INGEST: Add Configuration Except. Data to Metdata (elastic#32322)
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Aug 28, 2018
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants