Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout and restart #72

Open
MOZGIII opened this issue Sep 1, 2022 · 6 comments · May be fixed by #75
Open

Timeout and restart #72

MOZGIII opened this issue Sep 1, 2022 · 6 comments · May be fixed by #75

Comments

@MOZGIII
Copy link

MOZGIII commented Sep 1, 2022

GitHub cache server sometimes crashes, and sometimes it gets super slow, emitting the cache (which is about 2 GB for us) at the <1mbps.

In both of these behaviors, the step crashes eventually (with an error in the first case, and with an overall job timeout in the second).

It would be nice to extend the logic of this action with:

  • timeout on cache downloading / upload - throw an error if the process takes more than the configured amount of time
  • retry on error - an in-action retry on networking errors or timeout from the above, such that the rest of the work in the job is not lost; retry can be implemented with a configured amount of attempts.

Thanks for the great action btw, it does wonders for us already!

@Swatinem
Copy link
Owner

Swatinem commented Sep 4, 2022

I would say that any kind of logic related to this may need to be developed upstream as part of the github cache action / library.
I’m also unsure if a retry would help in these cases. If something between the runner and the artifacts server is wrong, I doubt a retry would help in that case.

@MOZGIII
Copy link
Author

MOZGIII commented Sep 4, 2022

From practice, retying the job manually actually works. These issues can be caused by a variety of reasons, and a good portion of them is solvable by retry (since HTTP retry will likely land on a different load-balancer, might connect to a healthier upstream, etc).


I've also considered these features to be implemented at upstream, but, in the end, the implementation on their end will be what, like a loop and a racing timer? And the configuration for both will have to be provided by the user of the code - thus it will have to pass through this action's config. Might make sense to just implement it here, it'll also certainly be quicker.

Ultimately, I figured it'd be easier to address this here first, just in case you'd be ok with the implementation being here, at least at first. It can then be moved out into an external re-usable package, for instance.

My main rationale for going here instead of the upstream is because I figured it would be quicker to land.

@MOZGIII MOZGIII linked a pull request Sep 4, 2022 that will close this issue
@MOZGIII
Copy link
Author

MOZGIII commented Sep 4, 2022

Figured I'd go ahead and try to do it, ended up with #75.

The downsides of this implementation are that:

  • We can't cancel an in-flight save/restore operations, which is relevant for the timeouts.
    However, there are some timeouts implemented at https://github.com/actions/toolkit/blob/main/packages/cache/src/options.ts now, so we can expose these options.
    We may (or may not) keep the overall timeout as an overall failsafe scenario (like when the download is not stuck, but is just very slow) - this will leak the outgoing connection though, but it might not be that bad, given that it's "stuck" anyway.

  • The logic to tell if the error is retriable or not is left out, as it will likely take some fault injection or some statistics collection to tell if an error is retriable and how to tell. I've set all errors to be retriable for now.


I also noticed my editor reformats the source a bit via prettier (it does so automatically on save), I've omitted these changes from the PR, but I suggest employing prettier in this project for code formatting consistency.

@Swatinem
Copy link
Owner

Swatinem commented Sep 5, 2022

I also noticed my editor reformats the source a bit via prettier (it does so automatically on save), I've omitted these changes from the PR, but I suggest employing prettier in this project for code formatting consistency.

I have that as well, so I’m surprised it uses different formatting?

@thomaseizinger
Copy link

I think I am running into the same problem. Here is an example:

Screenshot from 2023-01-24 15-27-50

Link here: https://github.com/libp2p/rust-libp2p/actions/runs/3992478158/jobs/6848975863

The restore job in the screenshot has been running for ~45minutes.

@tisonkun
Copy link

tisonkun commented Nov 1, 2023

Ditto - https://github.com/streamnative/pulsar-rs/actions/runs/6709247198/job/18231865369

In my case the cache should be optional (best effort) so I'd like a timeout to give up restore cache and just start building.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants