Support more advanced conflict resolution strategies #22

njaard · 2023-02-07T20:02:56Z

Right now, when you have two records (identified by the same key+timestamp), the one from the most recent commit takes precedence. This issue is going to decide how to supporting aggregating those conflicts as opposed to just discarding the old one.

User stories

One common way is to count events. For example, if we record the number of events once per day. If you have multiple sources of this data, each source accumulates into the counter.
Maybe a user measures temperature. In this case, we want to store the minimum and maximum value, which means that min and max are the functions.
Maybe the user stores actual error messages. If you can receive more than one message per timestamp, you might want to just concatenate it to a string. Therefor, it'd be best to have a "join with delimiter"

By combining a record that has two sum fields, one with a count and one with a value, you also have enough information to produce the mean.

File format

I think it makes sense to store the aggregation method in the format string. This makes sense to me because the aggregation method should not ever change, and the format string is only stored once so it's efficient.

The format strings right now are single character codes like uff representing an unsigned 32-bit integer and 2 unsigned 32-bit floats. I propose a prefix or suffix after each one indicating the aggregation:

For example +u9f0f could represent "addition for the u", "maximum for the first f" and "minimum for the second f. I'm not too attached to the particular representation or even that it be constrained to single characters (in fact, it can't be if you need to specify the delimiter). A more complete list:

+ sum
9 maximum
1 minimum
| join with delimiter. The following character must then be " followed by the actual delimiter, backslash-escaped, and then another ". For example, |"," for delimiting with a comma.
No character all which means "replace".

API

Right now, you can make records with record. We would need a new function like record_agg which generates the format string with the appropriate marker. For example:

  record_agg(sonnerie::Aggregate::Max, 25u32)
    .record_agg(sonnerie::Aggregate::Sum, 25.0f64)
    .record_agg(sonnerie::Aggregate::Join(","), "one message")

Applying the aggregate

Right now, Merge::discard_repetitions will just keep on reading values from all the transactions until it gets the last one for a given key+timestamp. Instead, Merge should apply the correct aggregate for each column.

A compaction directly uses Merge so therefor compaction doesn't need special behavior.

When the aggregate is impossible to apply in some manner

What if the data types don't match, like you're using the "summation" operator but one field is integer and the other is float, or one is numeric and the other is a string? I think the solution is to "try to do the correct thing" and then fallback on replacing the value.

What this means is that if we can guarantee a lossless conversion, then the operator can still occur. For example, if you're doing addition on a f32 and an f64, we can convert that f32 into an f64 and still do a summation.

In the case of that lossless conversion, the datatype should then become the "wider" of the two. Even if the wider of the two is in the latter transaction. That is because if a program is running that takes a while to complete, it would be surprising if all of a sudden your data became corrupt because it committed its transaction later than newer processes.

When the aggregate itself conflicts

That is to say, the order of transactions isn't defined until commit-time. That means that if multiple transactions have different aggregate records, it's probably just user error, because there's no way to make mathematical sense of it. Practically speaking, when the merging occurs, there is a defined order to the records and so the aggregate can just be applied in that order. No special work needs to occur.

CLI

The CLI expects the user to enter valid format strings. We can just leave that as it is until we provide a more user-friendly UI.

sonnerie-serve

sonnerie-serve, like the CLI, accepts format strings in the stream. Therefor nothing special needs to be done there either.

Examples

Support for widening

If you create three separate transactions, the final value is the same as the values with the aggregate function:

key 2023-01-01T00:00:00 +f 1.0
key 2023-01-01T00:00:00 +F 2.0
key 2023-01-01T00:00:00 +f 3.0

You should read back one record:
key 2023-01-01T00:00:00 +F 6.0

Strings

Strings have their aggregate value joined with the delimiter:

key 2023-01-01T00:00:00 |","s One
key 2023-01-01T00:00:00 |","s Two
key 2023-01-01T00:00:00 |","s Three

Read back:
key 2023-01-01T00:00:00 |","s One,Two,Three

Multiple columns

Each column has its own aggregation:

key 2023-01-01T00:00:00 +u9f0f 3 32.0 19.0
key 2023-01-01T00:00:00 +u9f0f 5 48.0 21.0
key 2023-01-01T00:00:00 +u9f0f 7 23.0 6.0

Read back:
key 2023-01-01T00:00:00 +u9f0f 15 48.0 6.0

Conflicting data types

If there's a conflict in the data type and widening can't occur, then just retain the value from the newest transaction:

key 2023-01-01T00:00:00 +u 12
key 2023-01-01T00:00:00 +f 19.0

read back:
key 2023-01-01T00:00:00 +f 19.0

Retain old behavior

Without an aggregate column, just select the value from the most recent transaction:

key 2023-01-01T00:00:00 f+u 4.0 4
key 2023-01-01T00:00:00 f+u 2.0 6

read back:
key 2023-01-01T00:00:00 f+u 2.0 10

The text was updated successfully, but these errors were encountered:

db48x · 2023-02-17T01:10:44Z

Hmm. My first thought is that this makes the format strings nigh unreadable, and my second is to bikeshed.

But I think it might be more helpful to focus on one specific use case. Suppose we are collecting latencies, perhaps for database queries. How would we collect the mean, max, 99th percentile, 90th percentile, etc?

njaard added the enhancement New feature or request label Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support more advanced conflict resolution strategies #22

Support more advanced conflict resolution strategies #22

njaard commented Feb 7, 2023 •

edited

Loading

db48x commented Feb 17, 2023

Support more advanced conflict resolution strategies #22

Support more advanced conflict resolution strategies #22

Comments

njaard commented Feb 7, 2023 • edited Loading

User stories

File format

API

Applying the aggregate

When the aggregate is impossible to apply in some manner

When the aggregate itself conflicts

CLI

sonnerie-serve

Examples

Support for widening

Strings

Multiple columns

Conflicting data types

Retain old behavior

db48x commented Feb 17, 2023

njaard commented Feb 7, 2023 •

edited

Loading