Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

publish_sdks workflow needs to be retry-able by language #1043

Open
t0yv0 opened this issue Jul 22, 2024 · 9 comments
Open

publish_sdks workflow needs to be retry-able by language #1043

t0yv0 opened this issue Jul 22, 2024 · 9 comments
Labels
kind/enhancement Improvements or new features

Comments

@t0yv0
Copy link
Member

t0yv0 commented Jul 22, 2024

AWS v6.45.0 failed to publish Java artifacts to Maven central due to a CI issue.

https://github.com/pulumi/pulumi-aws/actions/runs/9962834625

AS maintainers we would like to retry Java SDK publishing and only that, now that the credentials are fixed. However, currently the SDK publishing is a monolithic step involving every language.

@pulumi-bot pulumi-bot added the needs-triage Needs attention from the triage team label Jul 22, 2024
@guineveresaenger guineveresaenger added kind/enhancement Improvements or new features and removed needs-triage Needs attention from the triage team labels Jul 23, 2024
@guineveresaenger
Copy link
Contributor

@t0yv0 can you link a bit more context to the CI issue here? presumably this was not a Maven issue?

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 23, 2024

Maven Central credentials required rotation.

@danielrbradley
Copy link
Member

danielrbradley commented Jul 23, 2024

Oh hey there .. I think this overlaps with this issue:

The intention within pulumi-package-publisher is that creating earch release is idempotent and can safely be retried because it'll just skip if already created. However this then interacts badly with the fact that we don't fail when Java fails because of historical flakeyness. This means that Java just gets skipped and the job marked as complete meaning we can't then retry the failure.

I think the solution here is to:

  • Stop skipping Java errors.
  • Build in retries for Java to alleviate the pain when it does flake.
  • Just use the GHA failed job retry mechanism for retries.

I think this work could be included in the epic to cut a GA of the pulumi-package-publisher action.

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 23, 2024

To me as a user it seems like a separate issue from silently ignoring failures. I need to be able to retry Java publishing manually without retrying other SDKs that successfully published. I don't think publishing is idempotent in general, it 100% is not for Maven and I'd love us not to count on it being idempotent.

@guineveresaenger
Copy link
Contributor

I believe most publishing processes do allow to be idempotently retried at this point: PyPI and npm do so out of the box, and we run nuget push with --skip-duplicate. I'm not sure about Go but think Go is idempotent too.

I think the two issues are related, and maybe it boils down to a design decision on whether we're able/willing to have separate publishing runs for each relevant language.

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 23, 2024

What's the reason these are coupled currently? Even if other languages are idempotent, rerunning them just to get Java to publish is not ideal.

@guineveresaenger
Copy link
Contributor

I believe if we publish them with the same runner, we a) save runners and b) cut down on artifact download time overall, but I may be overestimating how much of an issue that would be.

@danielrbradley
Copy link
Member

I think its reasonable to assume we can implement idempotent behaviour here even if the service doesn't support it directly. Checking if a package version exists should be possible in all package managers, and failing that it should just be a first write wins and the re-pushed package should be ignored.

Publishing in a single job is almost certainly going to be faster overall for us than using separate parallel jobs due to runner contention and the overheads.

What we've got is pretty good and working well so we should just focus on making the Java release reliable, auto retryable, not ignoring errors and allow retying of the whole job when one or more fails.

@t0yv0
Copy link
Member Author

t0yv0 commented Jul 24, 2024

It's not reasonable for Maven Central. There's hours of delay in the OSSRH<->Central publishing pipe. The only chance to make an idempotent solution is trying to publish and then interpreting error codes as "already published" to count them as success, or else use a side channel such as an S3 sentinel to make the step artificially idempotent.

I concede that reliability pulumi/pulumi-package-publisher#16 is more important to work on in the first place, but I'm really wondering why are we prioritizing runner contention over usability. I am guessing in an ideal world GHA would allow steps to be scheduled on a single runner but independently retryable so this could be decided to a win-win. However as we stand, does adding 4 more steps to a 30-step workflow really have any observable effect on runner contention? I think having separate GHA steps could be so much easier for the operator to locate errors and logs in as well. At the very least maybe break the languages into separate steps, e.g. see how they all go in a single step https://github.com/pulumi/pulumi-aws/actions/runs/10064400756/job/27825467506#step:4:82 mixing up the logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Improvements or new features
Projects
None yet
Development

No branches or pull requests

4 participants