changefeedccl: Async flushes may produce out of order events #89683

miretskiy · 2022-10-10T19:05:10Z

Failure observed on master:

Failed
=== RUN   TestChangefeedNemeses/cloudstorage
    helpers_test.go:743: making server as system tenant
    helpers_test.go:830: making cloudstorage feed factory
    nemeses_test.go:40: topic foo partition : saw new row timestamp 1665420895637060333.0000000000 after 1665420896550554241.0000000000 was seen
    nemeses_test.go:40: topic foo partition : saw new row timestamp 1665420895561568242.0000000000 after 1665420896550554241.0000000000 was seen
    --- FAIL: TestChangefeedNemeses/cloudstorage (11.99s)

The test ran with parallel workers, and async flushing enabled.

My best guess is that async flushing could produce the following:
k@t1 is written to sink and causes flush f1
k@t2 is written to sink and causes flush f2
f2 completes before f1

(thanks @ajwerner for reporting this)

Jira issue: CRDB-20371

Epic CRDB-11732

The text was updated successfully, but these errors were encountered:

blathers-crl · 2022-10-10T19:05:12Z

cc @cockroachdb/cdc

miretskiy · 2022-10-12T14:47:49Z

Come to think of it; I don't know if the above analysis is correct:

destination file name generated synchronously; so in the above scenario:

k@t1 is written to sink and causes flush f1
k@t2 is written to sink and causes flush f2
f2 completes before f1

f2, even if it completes before f1, would have a (lexicographically) later file name.
Hmm... The plot thickens.

miretskiy · 2022-10-12T14:48:20Z

/cc @jayshrivastava The above failure definitely occurred w/ multiple workers active.

jayshrivastava · 2022-10-12T14:51:44Z

Is the nemeses test aware of file names? It looks like it just validates ordering in terms of rows.

miretskiy · 2022-10-12T14:58:48Z

I haven't looked at the code recently -- do you know if what it's doing is correct (i.e. do we know that it reads the files in order)?

ajwerner · 2022-10-12T15:01:58Z

I think that is does know about the file names, but it's wrong if a later file shows up before an earlier file, no?

miretskiy · 2022-10-12T15:05:25Z

Well, it's not wrong if it happens before RESOLVED, isn't it?
That is: I think it's fine for an earlier file to show up after the later one if they show up before RESOLVED?
@ajwerner ?

miretskiy · 2022-10-12T15:06:33Z

I don't know; I think making things appear in order is good regardless.

ajwerner · 2022-10-12T15:07:28Z

Well, it's not wrong if it happens before RESOLVED, isn't it?

I think it is wrong. Regardless of RESOLVED we say that you won't see a newer version of a row before seeing an unseen older version.

jayshrivastava · 2022-10-12T15:33:05Z

I think correctness depends on how you define a new row and how you implement the end consumer. Before, you could consume files in the order they came. Post #88395, you need to check for monotonically increasing file IDs (which may not be so easy due to retries / skipped IDs) and consume them in that order.

I don't believe we added this information to our docs, so we should add it. We could push some updates on our end to make implementing a consumer easier.

ajwerner · 2022-10-12T17:00:15Z

Post #88395, you need to check for monotonically increasing file IDs (which may not be so easy due to retries / skipped IDs) and consume them in that order.

I don't think your language is sufficiently precise. If the sequence is 1, 3, the, IDs are monotonically increasing. I don't think we should accept this redefinition of correctness.

miretskiy · 2022-10-12T17:04:15Z

I think it is wrong.

Things might not be as simple: Adding a queue or some such is not a problem. That problem is that
such a change will largely remove the benefits of having async flushes in the first place -- you only gain
an ability to buffer next set of data, but you won't be able to write multiple files concurrently.

Our docs list resolved option as:

Resolved timestamp notifications on every Kafka partition can be used to provide strong ordering and global consistency guarantees by buffering records in between timestamp closures. Use the "resolved" timestamp to see every row that changed at a certain time.

The resolve option is what guided me in the first place: that's the real oracle I was concerned about.
I don't know if we implied stronger guarantees in our docs.
I think at best, it's a grey area.

renatolabs · 2022-10-12T17:25:18Z

I don't know if we implied stronger guarantees in our docs.

FWIW, I have been reading the docs on this recently as part of working on cdc/mixed-versions, and the ordering guarantee (regardless of resolved) seems pretty explicit to me:

Once a row has been emitted with some timestamp, no previously unseen versions of that row will be emitted with a lower timestamp. That is, you will never see a new change for that row at an earlier timestamp.

ajwerner · 2022-10-12T17:32:36Z

I agree with Renato.

That problem is that
such a change will largely remove the benefits of having async flushes in the first place -- you only gain
an ability to buffer next set of data, but you won't be able to write multiple files concurrently.

I think the ability to overlay a file while filling another is worth something. If you want more concurrency, then you need to create more ChangeAggregators or do something else. I don't feel okay giving up this correctness claim without at least opting into it. It seems pretty major to me.

ajwerner · 2022-10-12T17:48:09Z

@miretskiy and I spoke a bit offline. It seems like in the initial scan/export stage of things, we can safely flush in parallel because we know that the files have disjoint key sets. In that case, we can avoid any queuing. That's the high-throughput case we know about. In the steady state, where it's unsafe, we almost certainly can't sustain a higher throughput than a single write to the blob store anyway, so it probably doesn't matter.

There's some details to be worked out with regards to pulling this off, but it seems like the right answer

ajwerner · 2022-10-12T17:58:19Z

One simple approach would be to track the minimum and maximum timestamp of rows in a file. Then, also, track the files which are currently being written. If the file you want to write only has data at exactly the same timestamp as all the files currently being written, then it can skip the queue. If all the data is at the same timestamp, there is no hazard. That also happens to be the situation when you are doing an initial scan or export.

When async flushing enabled, the following sequence of events is possible (even if very unlikely): k@t1 is emitted, causing async flush to write file f1 k@t2 is emitted, causing async flush to write file f2 f2 is written out before f1. In this unlikely scenario -- and the reason why it's unlikely is that we have to generate megabytes of data to cause a flush, unless, file_size parameter was pathologically small -- if a client were to read the contents of the directory right after step 2, and then read directory after step 3, the client will first observe k@t2, and then observe k@t1 -- which is a violation of ordering guarantees. This PR fixes this issue by adopting a queue so that if there is a pending flush in flight, the next flush is queued behind. It is possible that this simply approach may result in throughput decrease. This situation will occur if the underlying storage (s3 or gcp) cannot keep up with writing out data before the next file comes in. However, at this point, such situation is unlikely for two reasons: one, the rest of changefeed machinery must be able to generate one or more files worth of data before the previous flush completes -- and this task in of itself is not trivial, and two, changefeed must have enough memory allocated to it so that pushback mechanism does not trigger. The above assumption was validated in reasonably sized test -- i.e. the non-zero queue depth was never observed. Nonetheless, this PR also adds a log message which may be helpful to detect the situation when the sink might not keep up with the incoming data rate. A more complex solution -- for example, allow unlimited inflight requests during backfill -- may be revisited later, if the above assumption proven incorrect. Fixes cockroachdb#89683 Release note: None.

When async flushing enabled, the following sequence of events is possible (even if very unlikely): * k@t1 is emitted, causing async flush to write file f1 * k@t2 is emitted, causing async flush to write file f2 * f2 is written out before f1. In this unlikely scenario -- and the reason why it's unlikely is that we have to generate megabytes of data to cause a flush, unless, file_size parameter was pathologically small -- if a client were to read the contents of the directory right after step 2, and then read directory after step 3, the client will first observe k@t2, and then observe k@t1 -- which is a violation of ordering guarantees. This PR fixes this issue by adopting a queue so that if there is a pending flush in flight, the next flush is queued behind. It is possible that this simple approach may result in throughput decrease. This situation will occur if the underlying storage (s3 or gcp) cannot keep up with writing out data before the next file comes in. However, at this point, such situation is unlikely for two reasons: one, the rest of changefeed machinery must be able to generate one or more files worth of data before the previous flush completes -- and this task in of itself is not trivial, and two, changefeed must have enough memory allocated to it so that pushback mechanism does not trigger. The above assumption was validated in reasonably sized test -- i.e. the non-zero queue depth was never observed. Nonetheless, this PR also adds a log message which may be helpful to detect the situation when the sink might not keep up with the incoming data rate. A more complex solution -- for example, allow unlimited inflight requests during backfill -- may be revisited later, if the above assumption proven incorrect. Fixes cockroachdb#89683 Release note: None.

89966: sql: add tenant names r=postamar,stevendanna,ecwall a=knz As spelled out by #84700, this PR adds: - a new `name` column to `system.tenants`; - a new overload to `create_tenant()` which takes the tenant name as parameter; - a new built-in function `rename_tenant()`. See the enclosed commits for details. I am welcoming alternate approaches to create the name column as long as it properly enforces uniqueness, enables by-name point lookups and simplifies the task of keeping it coherent with the info payload. Epic: CRDB-14537 90011: changefeedccl: Ensure correct file ordering. r=miretskiy a=miretskiy When async flushing enabled, the following sequence of events is possible (even if very unlikely): * k@t1 is emitted, causing async flush to write file f1 * k@t2 is emitted, causing async flush to write file f2 * f2 is written out before f1. In this unlikely scenario -- and the reason why it's unlikely is that we have to generate megabytes of data to cause a flush, unless, file_size parameter was pathologically small -- if a client were to read the contents of the directory right after step 2, and then read directory after step 3, the client will first observe k@t2, and then observe k@t1 -- which is a violation of ordering guarantees. This PR fixes this issue by adopting a queue so that if there is a pending flush in flight, the next flush is queued behind. It is possible that this simple approach may result in throughput decrease. This situation will occur if the underlying storage (s3 or gcp) cannot keep up with writing out data before the next file comes in. However, at this point, such situation is unlikely for two reasons: one, the rest of changefeed machinery must be able to generate one or more files worth of data before the previous flush completes -- and this task in of itself is not trivial, and two, changefeed must have enough memory allocated to it so that pushback mechanism does not trigger. The above assumption was validated in reasonably sized test -- i.e. the non-zero queue depth was never observed. Nonetheless, this PR also adds a log message which may be helpful to detect the situation when the sink might not keep up with the incoming data rate. A more complex solution -- for example, allow unlimited inflight requests during backfill -- may be revisited later, if the above assumption proven incorrect. Fixes #89683 Release note: None. 90236: vendor: bump Pebble to 0090519bf0bb r=itsbilal a=coolcom200 ``` 0090519b metrics: expose fsync latency as prometheus histogram ``` Release note: None Epic: None 90284: multiregionccl: add a datadriven test for secondary region failover r=arulajmani a=rafiss fixes #90169 This test shows that the lease holder and voters fail over to the secondary region when the primary region is taken offline. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com> Co-authored-by: Leon Fattakhov <leon.fattakhov@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

When async flushing enabled, the following sequence of events is possible (even if very unlikely): * k@t1 is emitted, causing async flush to write file f1 * k@t2 is emitted, causing async flush to write file f2 * f2 is written out before f1. In this unlikely scenario -- and the reason why it's unlikely is that we have to generate megabytes of data to cause a flush, unless, file_size parameter was pathologically small -- if a client were to read the contents of the directory right after step 2, and then read directory after step 3, the client will first observe k@t2, and then observe k@t1 -- which is a violation of ordering guarantees. This PR fixes this issue by adopting a queue so that if there is a pending flush in flight, the next flush is queued behind. It is possible that this simple approach may result in throughput decrease. This situation will occur if the underlying storage (s3 or gcp) cannot keep up with writing out data before the next file comes in. However, at this point, such situation is unlikely for two reasons: one, the rest of changefeed machinery must be able to generate one or more files worth of data before the previous flush completes -- and this task in of itself is not trivial, and two, changefeed must have enough memory allocated to it so that pushback mechanism does not trigger. The above assumption was validated in reasonably sized test -- i.e. the non-zero queue depth was never observed. Nonetheless, this PR also adds a log message which may be helpful to detect the situation when the sink might not keep up with the incoming data rate. A more complex solution -- for example, allow unlimited inflight requests during backfill -- may be revisited later, if the above assumption proven incorrect. Fixes cockroachdb#89683 Release note: None.

miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Oct 10, 2022

amruss assigned miretskiy Oct 12, 2022

amruss added the backport-22.2.0 label Oct 12, 2022

miretskiy mentioned this issue Oct 14, 2022

changefeedccl: Ensure correct file ordering. #90011

Merged

craig bot closed this as completed in 8f7a011 Oct 20, 2022

exalate-issue-sync bot added the backport-22.2.0 label Mar 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

changefeedccl: Async flushes may produce out of order events #89683

changefeedccl: Async flushes may produce out of order events #89683

miretskiy commented Oct 10, 2022 •

edited by exalate-issue-sync bot

Loading

blathers-crl bot commented Oct 10, 2022

miretskiy commented Oct 12, 2022

miretskiy commented Oct 12, 2022

jayshrivastava commented Oct 12, 2022

miretskiy commented Oct 12, 2022 •

edited

Loading

ajwerner commented Oct 12, 2022

miretskiy commented Oct 12, 2022 •

edited

Loading

miretskiy commented Oct 12, 2022

ajwerner commented Oct 12, 2022

jayshrivastava commented Oct 12, 2022

ajwerner commented Oct 12, 2022

miretskiy commented Oct 12, 2022

renatolabs commented Oct 12, 2022

ajwerner commented Oct 12, 2022

ajwerner commented Oct 12, 2022

ajwerner commented Oct 12, 2022

changefeedccl: Async flushes may produce out of order events #89683

changefeedccl: Async flushes may produce out of order events #89683

Comments

miretskiy commented Oct 10, 2022 • edited by exalate-issue-sync bot Loading

blathers-crl bot commented Oct 10, 2022

miretskiy commented Oct 12, 2022

miretskiy commented Oct 12, 2022

jayshrivastava commented Oct 12, 2022

miretskiy commented Oct 12, 2022 • edited Loading

ajwerner commented Oct 12, 2022

miretskiy commented Oct 12, 2022 • edited Loading

miretskiy commented Oct 12, 2022

ajwerner commented Oct 12, 2022

jayshrivastava commented Oct 12, 2022

ajwerner commented Oct 12, 2022

miretskiy commented Oct 12, 2022

renatolabs commented Oct 12, 2022

ajwerner commented Oct 12, 2022

ajwerner commented Oct 12, 2022

ajwerner commented Oct 12, 2022

miretskiy commented Oct 10, 2022 •

edited by exalate-issue-sync bot

Loading

miretskiy commented Oct 12, 2022 •

edited

Loading

miretskiy commented Oct 12, 2022 •

edited

Loading