storage: switch sources to columnated `DataflowError` #27833

petrosagg · 2024-06-24T11:46:25Z

Motivation

This change enables storage dataflows to make use of columnated/lgalloced containers when moving data between operators. We could have gone the other way and implemented a columnation region for the SourceReaderError type but since there aren't any actual downsides to switching to DataflowError we go for this simpler approach.

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:

This change enables storage dataflows to make use of columnated/lgalloced containers when exchanging data between operators. We could have gone the other way and implemented a columnation region for the `SourceReaderError` type but since there aren't any actual downsides to switching to `DataflowError` we go for this simpler approach. Signed-off-by: Petros Angelatos <petrosagg@gmail.com>

petrosagg · 2024-06-24T11:50:28Z

src/storage-types/src/errors.proto

@@ -38,7 +37,7 @@ message ProtoSourceErrorDetails {
 }

 message ProtoSourceError {
-    mz_repr.global_id.ProtoGlobalId source_id = 1;
+    reserved 1;


Since persist consolidates based on the byte contents but at runtime we consolidate based on the rust Eq implementation this change has the potential to leave permanently uncompacted values in shards that contain errors before the upgrade that get retracted after the upgrade.

This is purely storage overhead, nothing changes as far as dataflow results are concerned. The expectation is that the number of errors is too small to be something to worry about. Eventually we want persist's compaction to work identically to the in-memory one

The expectation is that the number of errors is too small to be something to worry about.

History tells us this is a wrong assumption (a while ago a customer had 50+GiB of errors.)

A more important question: Did we ever write this down in persist? If yes, this might be a scary change to make because it has the potential to change how we surface errors. Or at least, we need to show how it changes.

History tells us this is a wrong assumption (a while ago a customer had 50+GiB of errors.)

I should have been more specific. Those 50+GB are most likely decode errors (i.e not this error variant). This error variant is almost always produced when a source hits a fatal error. After that the source advances to the empty frontier and no more output is produced. Beyond those fatal cases some sources (mysql and pg) have the ability to retract this specific variant of error in some cases (e.g if we encounter malformed utf8 data). These cases are very rare if they happen at all.

So it is true that true that this specific error variant is both:

Not produced in large quantities

When it is produced it is not retracted anyway

A more important question: Did we ever write this down in persist? If yes, this might be a scary change to make because it has the potential to change how we surface errors. Or at least, we need to show how it changes.

We do. The only change is that the error message of this particular error variant will not show the global id of the source, which is in line with every other error that can happen.

petrosagg requested a review from a team as a code owner June 24, 2024 11:46

petrosagg commented Jun 24, 2024

View reviewed changes

petrosagg requested a review from rjobanp June 24, 2024 11:50

antiguru approved these changes Jun 24, 2024

View reviewed changes

petrosagg merged commit 6b86f93 into MaterializeInc:main Jun 24, 2024
76 checks passed

petrosagg deleted the sources-with-dataflow-error branch June 24, 2024 15:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: switch sources to columnated `DataflowError` #27833

storage: switch sources to columnated `DataflowError` #27833

petrosagg commented Jun 24, 2024 •

edited

Loading

petrosagg Jun 24, 2024

antiguru Jun 24, 2024

petrosagg Jun 24, 2024

storage: switch sources to columnated DataflowError #27833

storage: switch sources to columnated DataflowError #27833

Conversation

petrosagg commented Jun 24, 2024 • edited Loading

Motivation

Tips for reviewer

Checklist

petrosagg Jun 24, 2024

Choose a reason for hiding this comment

antiguru Jun 24, 2024

Choose a reason for hiding this comment

petrosagg Jun 24, 2024

Choose a reason for hiding this comment

storage: switch sources to columnated `DataflowError` #27833

storage: switch sources to columnated `DataflowError` #27833

petrosagg commented Jun 24, 2024 •

edited

Loading