Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest/cdc: increase cloud storage assume role acceptable latency #97394

Closed

Conversation

jayshrivastava
Copy link
Contributor

Previously, we would observe failures due to the latency jumping to ~1m15s.

The most likely explanation for this is network blips. Previously, the maximum acceptable latency was 1 minute. This change bumps it to 90 seconds to reduce how often we see flakes.

Fixes: #96330
Epic: none

Release note: None

Previously, we would observe failures due to the latency jumping to ~1m15s.
- https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/8757437
- https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_Nightlies_RoachtestNightlyGceBazel/8534805

The most likely explanation for this is network blips. Previously, the maximum acceptable latency was 1 minute. This change
bumps it to 90 seconds to reduce how often we see flakes.

Fixes: cockroachdb#96330
Epic: none

Release note: None
@jayshrivastava jayshrivastava requested a review from a team February 21, 2023 16:38
@jayshrivastava jayshrivastava marked this pull request as ready for review February 21, 2023 16:38
@jayshrivastava jayshrivastava requested a review from a team as a code owner February 21, 2023 16:38
@jayshrivastava jayshrivastava requested review from herkolategan and srosenberg and removed request for a team February 21, 2023 16:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@miretskiy
Copy link
Contributor

I'm okay bumping limit for now; but I'm not convinced if there isn't some underlying issue at play.
WHy would assume role ned 1 minute to run? is there some throttling happening? There has to be an underlying explanation.

@jayshrivastava
Copy link
Contributor Author

We've been seeing such error messages since 2019: #37307. It's failed over a variety of roachtests and sinks (kafka, cloudstorage, pubsub).

For this particular roachtest, we added it in July 2022. The oldest failure is August 2022.

@jayshrivastava
Copy link
Contributor Author

I thought catchup scans were the reason, but this failure over the weekend shows it can happen any time. I feel like every time we see this kind of issue occur, we close the issue as a flake. If we really think they are flakes, then maybe we should allow some percentage error in the latency verifier. Ex. if you specify 60 seconds, it allows max 120 seconds before failing.

@jayshrivastava
Copy link
Contributor Author

On second thought, we should add more logging to better understand the issue rather than naively increase the limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: cdc/cloud-sink-gcs/assume-role failed
3 participants