Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DefaultClientRetryPolicy overflows to past #697

Open
hypnoglow opened this issue Dec 20, 2024 · 2 comments
Open

DefaultClientRetryPolicy overflows to past #697

hypnoglow opened this issue Dec 20, 2024 · 2 comments

Comments

@hypnoglow
Copy link

hypnoglow commented Dec 20, 2024

In retry mechanism, if we increase MaxAttempts, then DefaultClientRetryPolicy will overflow at some point and NextRetry() will return a date in the past:

"message": "jobExecutor: Retry policy returned invalid next retry before current time; using default retry policy instead",
"next_retry_scheduled_at": "1732-09-09T07:24:19.414Z",
"now": "2024-12-19T07:11:36.269Z",

UPD: changed the previous hypothesis.

It seems that it only overflows in the logs, where conversion to Unix time happens.

I checked the database and the date is fine:

id           | 201477
state        | retryable
attempt      | 305
max_attempts | 1024
attempted_at | 2024-11-20 19:54:13.52567+00
created_at   | 2024-11-08 08:30:37.197066+00
finalized_at | <null>
scheduled_at | 2274-04-14 17:57:33.177419+00

But then I also don't understand why our custom policy that is designed to cap retry time doesn't work. Here's the code:

func (p *LimitingRetryPolicy) NextRetry(job *rivertype.JobRow) time.Time {
	nextTime := p.origin.NextRetry(job).  // p.origin is &DefaultClientRetryPolicy{}
	maxTime := p.now().Add(p.threshold) // p.threshold is e.g. time.Hour

	if maxTime.Before(nextTime) {
		return maxTime
	}

	return nextTime
}

With this policy we still get retry time that is far beyond our threshold (such as in the logs above).

@brandur
Copy link
Contributor

brandur commented Dec 21, 2024

I looked into this a bit, and although I couldn't get the next retry time to overflow, there are some oddities around handling of very large durations that we should at least be explicit about it. The maximum length of the time that can be stored in a duration is about 292 years, and once you start going over that, you're in danger of overflowing it and can get some odd behavior.

I'm not sure how you even got there though. You start hit duration limits at retry 310, but before 310 you're already well into exponential territory, so you'll never actually be hitting those next retries because they're 100s of years out. I see that you got to attempt no. 305 in your record above, but if you using the default policy isn't the distance between attempt 304 and 305 still measured in months?

brandur added a commit that referenced this issue Dec 21, 2024
… counts

This one's here to address #697, where's it been reported that it may be
possible for retry schedules to overflow to the past when reaching
degenerately high numbers of attempts. I couldn't reproduce the reported
problem, but it is possible for `time.Duration` to overflow, so here we
shore up `DefaultClientRetryPolicy` so that we're explicitly defining
what behavior should occur under these conditions.

The maximum length of time that can go in a `time.Duration` is about 292
years. Here's a sample program that demonstrates an overflow happening:

    func main() {
        const maxDuration time.Duration = 1<<63 - 1
        var maxDurationSeconds = float64(maxDuration / time.Second)

        notOverflowed := time.Duration(maxDurationSeconds) * time.Second
        fmt.Printf("not overflowed: %+v\n", notOverflowed)

        overflowed := time.Duration(maxDurationSeconds+1) * time.Second
        fmt.Printf("overflowed: %+v\n", overflowed)
    }

    not overflowed: 2562047h47m16s
    overflowed: -2562047h47m16.709551616s

Here, modify `DefaultClientRetryPolicy` so that if it were to return a
next retry duration greater than what would fit in `time.Duration`, we
just return 292 years instead. The maximum bound occurs at 310 retries,
so every retry after 310 increments by 292 years.

I didn't bother putting a maximum bound on the time returned by
`NextRetry` because the maximum `time.Time` in Go is somewhere in the
year 219250468, so even in 292 year increments, you'll never get there.
brandur added a commit that referenced this issue Dec 21, 2024
… counts

This one's here to address #697, where's it been reported that it may be
possible for retry schedules to overflow to the past when reaching
degenerately high numbers of attempts. I couldn't reproduce the reported
problem, but it is possible for `time.Duration` to overflow, so here we
shore up `DefaultClientRetryPolicy` so that we're explicitly defining
what behavior should occur under these conditions.

The maximum length of time that can go in a `time.Duration` is about 292
years. Here's a sample program that demonstrates an overflow happening:

    func main() {
        const maxDuration time.Duration = 1<<63 - 1
        var maxDurationSeconds = float64(maxDuration / time.Second)

        notOverflowed := time.Duration(maxDurationSeconds) * time.Second
        fmt.Printf("not overflowed: %+v\n", notOverflowed)

        overflowed := time.Duration(maxDurationSeconds+1) * time.Second
        fmt.Printf("overflowed: %+v\n", overflowed)
    }

    not overflowed: 2562047h47m16s
    overflowed: -2562047h47m16.709551616s

Here, modify `DefaultClientRetryPolicy` so that if it were to return a
next retry duration greater than what would fit in `time.Duration`, we
just return 292 years instead. The maximum bound occurs at 310 retries,
so every retry after 310 increments by 292 years.

I didn't bother putting a maximum bound on the time returned by
`NextRetry` because the maximum `time.Time` in Go is somewhere in the
year 219250468, so even in 292 year increments, you'll never get there.
@brandur
Copy link
Contributor

brandur commented Dec 21, 2024

Added #698 to make sure we never overflow time.Duration. That should help somewhat, but I suspect you may have some other bug in your stack somewhere ...

brandur added a commit that referenced this issue Dec 21, 2024
… counts

This one's here to address #697, where's it been reported that it may be
possible for retry schedules to overflow to the past when reaching
degenerately high numbers of attempts. I couldn't reproduce the reported
problem, but it is possible for `time.Duration` to overflow, so here we
shore up `DefaultClientRetryPolicy` so that we're explicitly defining
what behavior should occur under these conditions.

The maximum length of time that can go in a `time.Duration` is about 292
years. Here's a sample program that demonstrates an overflow happening:

    func main() {
        const maxDuration time.Duration = 1<<63 - 1
        var maxDurationSeconds = float64(maxDuration / time.Second)

        notOverflowed := time.Duration(maxDurationSeconds) * time.Second
        fmt.Printf("not overflowed: %+v\n", notOverflowed)

        overflowed := time.Duration(maxDurationSeconds+1) * time.Second
        fmt.Printf("overflowed: %+v\n", overflowed)
    }

    not overflowed: 2562047h47m16s
    overflowed: -2562047h47m16.709551616s

Here, modify `DefaultClientRetryPolicy` so that if it were to return a
next retry duration greater than what would fit in `time.Duration`, we
just return 292 years instead. The maximum bound occurs at 310 retries,
so every retry after 310 increments by 292 years.

I didn't bother putting a maximum bound on the time returned by
`NextRetry` because the maximum `time.Time` in Go is somewhere in the
year 219250468, so even in 292 year increments, you'll never get there.
brandur added a commit that referenced this issue Dec 21, 2024
… counts (#698)

This one's here to address #697, where's it been reported that it may be
possible for retry schedules to overflow to the past when reaching
degenerately high numbers of attempts. I couldn't reproduce the reported
problem, but it is possible for `time.Duration` to overflow, so here we
shore up `DefaultClientRetryPolicy` so that we're explicitly defining
what behavior should occur under these conditions.

The maximum length of time that can go in a `time.Duration` is about 292
years. Here's a sample program that demonstrates an overflow happening:

    func main() {
        const maxDuration time.Duration = 1<<63 - 1
        var maxDurationSeconds = float64(maxDuration / time.Second)

        notOverflowed := time.Duration(maxDurationSeconds) * time.Second
        fmt.Printf("not overflowed: %+v\n", notOverflowed)

        overflowed := time.Duration(maxDurationSeconds+1) * time.Second
        fmt.Printf("overflowed: %+v\n", overflowed)
    }

    not overflowed: 2562047h47m16s
    overflowed: -2562047h47m16.709551616s

Here, modify `DefaultClientRetryPolicy` so that if it were to return a
next retry duration greater than what would fit in `time.Duration`, we
just return 292 years instead. The maximum bound occurs at 310 retries,
so every retry after 310 increments by 292 years.

I didn't bother putting a maximum bound on the time returned by
`NextRetry` because the maximum `time.Time` in Go is somewhere in the
year 219250468, so even in 292 year increments, you'll never get there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants