Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: Async flushes may produce out of order events #89683

Closed
miretskiy opened this issue Oct 10, 2022 · 16 comments · Fixed by #90011
Closed

changefeedccl: Async flushes may produce out of order events #89683

miretskiy opened this issue Oct 10, 2022 · 16 comments · Fixed by #90011
Assignees
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc

Comments

@miretskiy
Copy link
Contributor

miretskiy commented Oct 10, 2022

Failure observed on master:

Failed
=== RUN   TestChangefeedNemeses/cloudstorage
    helpers_test.go:743: making server as system tenant
    helpers_test.go:830: making cloudstorage feed factory
    nemeses_test.go:40: topic foo partition : saw new row timestamp 1665420895637060333.0000000000 after 1665420896550554241.0000000000 was seen
    nemeses_test.go:40: topic foo partition : saw new row timestamp 1665420895561568242.0000000000 after 1665420896550554241.0000000000 was seen
    --- FAIL: TestChangefeedNemeses/cloudstorage (11.99s)

The test ran with parallel workers, and async flushing enabled.

My best guess is that async flushing could produce the following:
k@t1 is written to sink and causes flush f1
k@t2 is written to sink and causes flush f2
f2 completes before f1

(thanks @ajwerner for reporting this)

Jira issue: CRDB-20371

Epic CRDB-11732

@miretskiy miretskiy added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-cdc Change Data Capture T-cdc labels Oct 10, 2022
@blathers-crl
Copy link

blathers-crl bot commented Oct 10, 2022

cc @cockroachdb/cdc

@miretskiy
Copy link
Contributor Author

Come to think of it; I don't know if the above analysis is correct:

destination file name generated synchronously; so in the above scenario:

k@t1 is written to sink and causes flush f1
k@t2 is written to sink and causes flush f2
f2 completes before f1

f2, even if it completes before f1, would have a (lexicographically) later file name.
Hmm... The plot thickens.

@miretskiy
Copy link
Contributor Author

/cc @jayshrivastava The above failure definitely occurred w/ multiple workers active.

@jayshrivastava
Copy link
Contributor

Is the nemeses test aware of file names? It looks like it just validates ordering in terms of rows.

@miretskiy
Copy link
Contributor Author

miretskiy commented Oct 12, 2022

I haven't looked at the code recently -- do you know if what it's doing is correct (i.e. do we know that it reads the files in order)?

@ajwerner
Copy link
Contributor

I think that is does know about the file names, but it's wrong if a later file shows up before an earlier file, no?

@miretskiy
Copy link
Contributor Author

miretskiy commented Oct 12, 2022

Well, it's not wrong if it happens before RESOLVED, isn't it?
That is: I think it's fine for an earlier file to show up after the later one if they show up before RESOLVED?
@ajwerner ?

@miretskiy
Copy link
Contributor Author

I don't know; I think making things appear in order is good regardless.

@ajwerner
Copy link
Contributor

Well, it's not wrong if it happens before RESOLVED, isn't it?

I think it is wrong. Regardless of RESOLVED we say that you won't see a newer version of a row before seeing an unseen older version.

@jayshrivastava
Copy link
Contributor

I think correctness depends on how you define a new row and how you implement the end consumer. Before, you could consume files in the order they came. Post #88395, you need to check for monotonically increasing file IDs (which may not be so easy due to retries / skipped IDs) and consume them in that order.

I don't believe we added this information to our docs, so we should add it. We could push some updates on our end to make implementing a consumer easier.

@ajwerner
Copy link
Contributor

Post #88395, you need to check for monotonically increasing file IDs (which may not be so easy due to retries / skipped IDs) and consume them in that order.

I don't think your language is sufficiently precise. If the sequence is 1, 3, the, IDs are monotonically increasing. I don't think we should accept this redefinition of correctness.

@miretskiy
Copy link
Contributor Author

I think it is wrong.

Things might not be as simple: Adding a queue or some such is not a problem. That problem is that
such a change will largely remove the benefits of having async flushes in the first place -- you only gain
an ability to buffer next set of data, but you won't be able to write multiple files concurrently.

Our docs list resolved option as:

Resolved timestamp notifications on every Kafka partition can be used to provide strong ordering and global consistency guarantees by buffering records in between timestamp closures. Use the "resolved" timestamp to see every row that changed at a certain time.

The resolve option is what guided me in the first place: that's the real oracle I was concerned about.
I don't know if we implied stronger guarantees in our docs.
I think at best, it's a grey area.

@renatolabs
Copy link
Contributor

I don't know if we implied stronger guarantees in our docs.

FWIW, I have been reading the docs on this recently as part of working on cdc/mixed-versions, and the ordering guarantee (regardless of resolved) seems pretty explicit to me:

Once a row has been emitted with some timestamp, no previously unseen versions of that row will be emitted with a lower timestamp. That is, you will never see a new change for that row at an earlier timestamp.

@ajwerner
Copy link
Contributor

I agree with Renato.

That problem is that
such a change will largely remove the benefits of having async flushes in the first place -- you only gain
an ability to buffer next set of data, but you won't be able to write multiple files concurrently.

I think the ability to overlay a file while filling another is worth something. If you want more concurrency, then you need to create more ChangeAggregators or do something else. I don't feel okay giving up this correctness claim without at least opting into it. It seems pretty major to me.

@ajwerner
Copy link
Contributor

@miretskiy and I spoke a bit offline. It seems like in the initial scan/export stage of things, we can safely flush in parallel because we know that the files have disjoint key sets. In that case, we can avoid any queuing. That's the high-throughput case we know about. In the steady state, where it's unsafe, we almost certainly can't sustain a higher throughput than a single write to the blob store anyway, so it probably doesn't matter.

There's some details to be worked out with regards to pulling this off, but it seems like the right answer

@ajwerner
Copy link
Contributor

One simple approach would be to track the minimum and maximum timestamp of rows in a file. Then, also, track the files which are currently being written. If the file you want to write only has data at exactly the same timestamp as all the files currently being written, then it can skip the queue. If all the data is at the same timestamp, there is no hazard. That also happens to be the situation when you are doing an initial scan or export.

miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 14, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

k@t1 is emitted, causing async flush to write file f1
k@t2 is emitted, causing async flush to write file f2
f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simply approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 14, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 15, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 15, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 17, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 18, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 20, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
craig bot pushed a commit that referenced this issue Oct 20, 2022
89966: sql: add tenant names r=postamar,stevendanna,ecwall a=knz

As spelled out by #84700, this PR adds:

-  a new `name` column to `system.tenants`;
-  a new overload to `create_tenant()` which takes the tenant name as parameter;
-  a new built-in function `rename_tenant()`.

See the enclosed commits for details. I am welcoming alternate approaches to create the name column as long as it properly enforces uniqueness, enables by-name point lookups and simplifies the task of keeping it coherent with the info payload.

Epic: CRDB-14537

90011: changefeedccl: Ensure correct file ordering. r=miretskiy a=miretskiy

When async flushing enabled, the following sequence of events is possible (even if very unlikely):

 * k@t1 is emitted, causing async flush to write file f1 
 * k@t2 is emitted, causing async flush to write file f2 
 * f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we have to generate megabytes of data to cause a flush, unless, file_size parameter was pathologically small -- if a client were to read the contents of the directory right after step 2, and then read directory after step 3, the client will first observe k@t2, and then observe k@t1 -- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput decrease.  This situation will occur if the underlying storage (s3 or gcp) cannot keep up with writing out data before the next file comes in.  However, at this point, such situation is unlikely for two reasons: one, the rest of changefeed machinery must be able to generate one or more files worth of data before the previous flush completes -- and this task in of itself is not trivial, and two, changefeed must have enough memory allocated to it so that pushback mechanism does not trigger.  The above assumption was validated in reasonably sized test -- i.e. the non-zero queue depth was never observed. Nonetheless, this PR also adds a log message which may be helpful to detect the situation when the sink might not keep up with the incoming data rate.

A more complex solution -- for example, allow unlimited inflight requests during backfill -- may be revisited later, if the above assumption proven incorrect.

Fixes #89683

Release note: None.

90236: vendor: bump Pebble to 0090519bf0bb r=itsbilal a=coolcom200

```
0090519b metrics: expose fsync latency as prometheus histogram
```

Release note: None
Epic: None

90284: multiregionccl: add a datadriven test for secondary region failover r=arulajmani a=rafiss

fixes #90169

This test shows that the lease holder and voters fail over to the secondary region when the primary region is taken offline.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Yevgeniy Miretskiy <yevgeniy@cockroachlabs.com>
Co-authored-by: Leon Fattakhov <leon.fattakhov@cockroachlabs.com>
Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>
@craig craig bot closed this as completed in 8f7a011 Oct 20, 2022
miretskiy pushed a commit to miretskiy/cockroach that referenced this issue Oct 31, 2022
When async flushing enabled, the following sequence of events is
possible (even if very unlikely):

* k@t1 is emitted, causing async flush to write file f1
* k@t2 is emitted, causing async flush to write file f2
* f2 is written out before f1.

In this unlikely scenario -- and the reason why it's unlikely is that we
have to generate megabytes of data to cause a flush, unless, file_size
parameter was pathologically small -- if a client were to read the
contents of the directory right after step 2, and then read directory
after step 3, the client will first observe k@t2, and then observe k@t1
-- which is a violation of ordering guarantees.

This PR fixes this issue by adopting a queue so that if there
is a pending flush in flight, the next flush is queued behind.

It is possible that this simple approach may result in throughput
decrease.  This situation will occur if the underlying
storage (s3 or gcp) cannot keep up with writing out data before
the next file comes in.  However, at this point, such
situation is unlikely for two reasons: one, the rest of
changefeed machinery must be able to generate one or more files
worth of data before the previous flush completes -- and this
task in of itself is not trivial, and two, changefeed must
have enough memory allocated to it so that pushback mechanism
does not trigger.  The above assumption was validated in reasonably
sized test -- i.e. the non-zero queue depth was never observed.
Nonetheless, this PR also adds a log message which may be helpful
to detect the situation when the sink might not keep up with the
incoming data rate.

A more complex solution -- for example, allow unlimited inflight
requests during backfill -- may be revisited later, if the above
assumption proven incorrect.

Fixes cockroachdb#89683

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cdc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants