Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

expect regex extracted tokens in database bloom filters #103

Merged
merged 26 commits into from
Nov 11, 2024

Conversation

elliVM
Copy link
Contributor

@elliVM elliVM commented Oct 22, 2024

  • Use regex to extract tokens from search term. Use tokens to select joined tables and generated bloommatch conditions.
  • Move bloom filter related classes to bloomfilter package
  • DatabaseTables interface
  • Tokenizable interface and decorator TokensAsStrings
  • CategoryTable interface refactored to only have create() method
  • Renaming of classes and variables and refactoring to improve code clarity

@elliVM elliVM self-assigned this Oct 22, 2024
@elliVM elliVM requested a review from 51-code October 22, 2024 07:04
@elliVM elliVM marked this pull request as ready for review October 22, 2024 07:05
Copy link
Contributor

@51-code 51-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elliVM elliVM requested a review from eemhu October 22, 2024 07:21
"Null field while creating bloom filter expected <{}>, fpp <{}>, pattern <{}>, search term <{}>",
expected, fpp, pattern, searchTerm
);
throw new RuntimeException("Object field was null");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this exception message could be a bit clearer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the exception messages, added tests for excetpions, removed use of .longValue() method in constructor which would lead to NPE.

@@ -78,7 +78,7 @@ public IndexStatementCondition(String value, ConditionConfig config, Condition c

public Condition condition() {
if (!config.bloomEnabled()) {
LOGGER.debug("Indexstatement reached with bloom disabled");
LOGGER.warn("Indexstatement reached with bloom disabled");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be more suitable to be a debug log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that is better I lowered to debug


public TokenizedValue(String value) {
this(
value,
new HashSet<>(new Tokenizer(32).tokenize(new ByteArrayInputStream(value.getBytes(StandardCharsets.UTF_8))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure if calling tokenizer in constructor is optimal, perhaps it should only be done when tokens are required?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored that tokenizer is called only when tokens are needed.

Table<?> target = DSL.table(DSL.name("target"));
String searchTerm = "Pattern";
BloomFilterFromRecord filter = new BloomFilterFromRecord(dynamicRecord, target, searchTerm);
Assertions.assertDoesNotThrow(filter::bytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no assertions for bytes() result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed since test was redundant, bytes() method is tested on the other tests.

RegexExtractedValue regexValue = new RegexExtractedValue(value, regex);
Set<String> tokens = regexValue.tokens();
Assertions.assertEquals(2, tokens.size());
Assertions.assertTrue(tokens.contains("(important)") && tokens.contains("(very important)"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separate into two assertions for clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated into two assertions.

@elliVM elliVM requested a review from eemhu October 22, 2024 09:45
public byte[] bytes() {
final BloomFilter filter = create();
final ByteArrayOutputStream filterBAOS = new ByteArrayOutputStream();
try {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not try-with-resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use try-with-resources

* Filter types of a table that can be inserted into the tables category table
*/
public final class TableFilters {
public final class FilterFromRecordToCategoryTableConsumer implements Consumer<Record> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use object way instead of functional way for producing an iterator meaning for loop instead of a consumer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use for loop


public final class RegexExtractedValue {

private final Matcher matcher;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matcher is stateful https://docs.oracle.com/javase/8/docs/api/java/util/regex/Matcher.html#find-- this objects is therefore mutable and mutability is a no-go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced Matcher with Pattern that is stateless and immutable.

}
final BloomFilter filter = BloomFilter.create(expected.longValue(), fpp);
// if no pattern use tokenized value (currently BLOOMDB.FILTERTYPE.PATTERN is NOT NULL)
if (pattern == null) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object is configurable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored object to be not configurable

@@ -64,18 +65,6 @@ void testSingleToken() {
Assertions.assertEquals(e, condition.toString());
}

@Test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing tests is not a good idea. please comment why tests were removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously depending on the input for the PatternMatchCondition(input) the result of condition() varied depending on the number of tokens generated from input. After change to support regex extracted tokens the input is no longer tokenized, making the removed test a duplicate of the first test.

…d and make it unconfigurable, make matcher immutable
@elliVM elliVM requested a review from kortemik October 23, 2024 10:14
"Trying to insert empty filter, pattern match joined table should always have tokens"
);
}
final BloomFilter filter = BloomFilter.create(1000, 0.01);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1000 and 0.01 why not expected and fpp?

Copy link
Contributor

@51-code 51-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some request changes. Also some classes are missing equals or hashcode methods, at least CategoryTableImpl and CategoryTableWithFilters.

return filterBAOS.toByteArray();
}
catch (IOException e) {
throw new UncheckedIOException(new IOException("Error writing filter bytes: " + e.getMessage()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating a new IOException here, the existing one could be passed to the UncheckedIOException using the other constructor: UncheckedIOException(String message, IOException cause)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to use constructor

DSL.val(record.getValue(BLOOMDB.FILTERTYPE.ID), ULong.class),
DSL.val(filter.bytes(), byte[].class)
};
ctx.insertInto(categoryTable).columns(insertFields).values(valueFields).execute();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into the docs, I wonder if it's possible to return a Query from this function (probably what the values() function returns before the execute function is called. Currently it feels wrong that the filters are executed here, which makes this kind of a utility class for the CategoryTable. If it would return the Query (in my mind this is the tablefilter itself) then it could be executed in the CategoryTableWithFilters object. Does this make any sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TableFilters now returns a SafeBatch class that wraps a jooq.Batch class, CategoryTableWithFilters will execute the SafeBatch.

@elliVM elliVM requested a review from 51-code October 29, 2024 13:12
@elliVM elliVM requested review from eemhu and kortemik October 31, 2024 12:21
pom.xml Outdated Show resolved Hide resolved
@elliVM elliVM requested a review from kortemik November 1, 2024 13:29
@kortemik
Copy link
Member

kortemik commented Nov 5, 2024

@eemhu @51-code please re-review

Copy link
Contributor

@eemhu eemhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes lgtm

Copy link
Contributor

@51-code 51-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elliVM elliVM requested a review from kortemik November 11, 2024 08:57
@kortemik kortemik merged commit 51c80ca into teragrep:main Nov 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants