You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now, when you have two records (identified by the same key+timestamp), the one from the most recent commit takes precedence. This issue is going to decide how to supporting aggregating those conflicts as opposed to just discarding the old one.
User stories
One common way is to count events. For example, if we record the number of events once per day. If you have multiple sources of this data, each source accumulates into the counter.
Maybe a user measures temperature. In this case, we want to store the minimum and maximum value, which means that min and max are the functions.
Maybe the user stores actual error messages. If you can receive more than one message per timestamp, you might want to just concatenate it to a string. Therefor, it'd be best to have a "join with delimiter"
By combining a record that has two sum fields, one with a count and one with a value, you also have enough information to produce the mean.
File format
I think it makes sense to store the aggregation method in the format string. This makes sense to me because the aggregation method should not ever change, and the format string is only stored once so it's efficient.
The format strings right now are single character codes like uff representing an unsigned 32-bit integer and 2 unsigned 32-bit floats. I propose a prefix or suffix after each one indicating the aggregation:
For example +u9f0f could represent "addition for the u", "maximum for the first f" and "minimum for the second f. I'm not too attached to the particular representation or even that it be constrained to single characters (in fact, it can't be if you need to specify the delimiter). A more complete list:
+ sum
9 maximum
1 minimum
| join with delimiter. The following character must then be " followed by the actual delimiter, backslash-escaped, and then another ". For example, |"," for delimiting with a comma.
No character all which means "replace".
API
Right now, you can make records with record. We would need a new function like record_agg which generates the format string with the appropriate marker. For example:
Right now, Merge::discard_repetitions will just keep on reading values from all the transactions until it gets the last one for a given key+timestamp. Instead, Merge should apply the correct aggregate for each column.
A compaction directly uses Merge so therefor compaction doesn't need special behavior.
When the aggregate is impossible to apply in some manner
What if the data types don't match, like you're using the "summation" operator but one field is integer and the other is float, or one is numeric and the other is a string? I think the solution is to "try to do the correct thing" and then fallback on replacing the value.
What this means is that if we can guarantee a lossless conversion, then the operator can still occur. For example, if you're doing addition on a f32 and an f64, we can convert that f32 into an f64 and still do a summation.
In the case of that lossless conversion, the datatype should then become the "wider" of the two. Even if the wider of the two is in the latter transaction. That is because if a program is running that takes a while to complete, it would be surprising if all of a sudden your data became corrupt because it committed its transaction later than newer processes.
When the aggregate itself conflicts
That is to say, the order of transactions isn't defined until commit-time. That means that if multiple transactions have different aggregate records, it's probably just user error, because there's no way to make mathematical sense of it. Practically speaking, when the merging occurs, there is a defined order to the records and so the aggregate can just be applied in that order. No special work needs to occur.
CLI
The CLI expects the user to enter valid format strings. We can just leave that as it is until we provide a more user-friendly UI.
sonnerie-serve
sonnerie-serve, like the CLI, accepts format strings in the stream. Therefor nothing special needs to be done there either.
Examples
Support for widening
If you create three separate transactions, the final value is the same as the values with the aggregate function:
Hmm. My first thought is that this makes the format strings nigh unreadable, and my second is to bikeshed.
But I think it might be more helpful to focus on one specific use case. Suppose we are collecting latencies, perhaps for database queries. How would we collect the mean, max, 99th percentile, 90th percentile, etc?
Right now, when you have two records (identified by the same key+timestamp), the one from the most recent commit takes precedence. This issue is going to decide how to supporting aggregating those conflicts as opposed to just discarding the old one.
User stories
min
andmax
are the functions.By combining a record that has two sum fields, one with a count and one with a value, you also have enough information to produce the mean.
File format
I think it makes sense to store the aggregation method in the format string. This makes sense to me because the aggregation method should not ever change, and the format string is only stored once so it's efficient.
The format strings right now are single character codes like
uff
representing an unsigned 32-bit integer and 2 unsigned 32-bit floats. I propose a prefix or suffix after each one indicating the aggregation:For example
+u9f0f
could represent "addition for theu
", "maximum for the firstf
" and "minimum for the secondf
. I'm not too attached to the particular representation or even that it be constrained to single characters (in fact, it can't be if you need to specify the delimiter). A more complete list:+
sum9
maximum1
minimum|
join with delimiter. The following character must then be"
followed by the actual delimiter, backslash-escaped, and then another"
. For example,|","
for delimiting with a comma.API
Right now, you can make records with
record
. We would need a new function likerecord_agg
which generates the format string with the appropriate marker. For example:Applying the aggregate
Right now,
Merge::discard_repetitions
will just keep on reading values from all the transactions until it gets the last one for a given key+timestamp. Instead,Merge
should apply the correct aggregate for each column.A compaction directly uses
Merge
so therefor compaction doesn't need special behavior.When the aggregate is impossible to apply in some manner
What if the data types don't match, like you're using the "summation" operator but one field is integer and the other is float, or one is numeric and the other is a string? I think the solution is to "try to do the correct thing" and then fallback on replacing the value.
What this means is that if we can guarantee a lossless conversion, then the operator can still occur. For example, if you're doing addition on a f32 and an f64, we can convert that f32 into an f64 and still do a summation.
In the case of that lossless conversion, the datatype should then become the "wider" of the two. Even if the wider of the two is in the latter transaction. That is because if a program is running that takes a while to complete, it would be surprising if all of a sudden your data became corrupt because it committed its transaction later than newer processes.
When the aggregate itself conflicts
That is to say, the order of transactions isn't defined until commit-time. That means that if multiple transactions have different aggregate records, it's probably just user error, because there's no way to make mathematical sense of it. Practically speaking, when the merging occurs, there is a defined order to the records and so the aggregate can just be applied in that order. No special work needs to occur.
CLI
The CLI expects the user to enter valid format strings. We can just leave that as it is until we provide a more user-friendly UI.
sonnerie-serve
sonnerie-serve
, like the CLI, accepts format strings in the stream. Therefor nothing special needs to be done there either.Examples
Support for widening
If you create three separate transactions, the final value is the same as the values with the aggregate function:
key 2023-01-01T00:00:00 +f 1.0
key 2023-01-01T00:00:00 +F 2.0
key 2023-01-01T00:00:00 +f 3.0
You should read back one record:
key 2023-01-01T00:00:00 +F 6.0
Strings
Strings have their aggregate value joined with the delimiter:
key 2023-01-01T00:00:00 |","s One
key 2023-01-01T00:00:00 |","s Two
key 2023-01-01T00:00:00 |","s Three
Read back:
key 2023-01-01T00:00:00 |","s One,Two,Three
Multiple columns
Each column has its own aggregation:
key 2023-01-01T00:00:00 +u9f0f 3 32.0 19.0
key 2023-01-01T00:00:00 +u9f0f 5 48.0 21.0
key 2023-01-01T00:00:00 +u9f0f 7 23.0 6.0
Read back:
key 2023-01-01T00:00:00 +u9f0f 15 48.0 6.0
Conflicting data types
If there's a conflict in the data type and widening can't occur, then just retain the value from the newest transaction:
key 2023-01-01T00:00:00 +u 12
key 2023-01-01T00:00:00 +f 19.0
read back:
key 2023-01-01T00:00:00 +f 19.0
Retain old behavior
Without an aggregate column, just select the value from the most recent transaction:
key 2023-01-01T00:00:00 f+u 4.0 4
key 2023-01-01T00:00:00 f+u 2.0 6
read back:
key 2023-01-01T00:00:00 f+u 2.0 10
The text was updated successfully, but these errors were encountered: