Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC fails with "invalid checksum digest format" from registry when deleting manifests #15970

Closed
dkulchinsky opened this issue Nov 8, 2021 · 30 comments

Comments

@dkulchinsky
Copy link
Contributor

If you are reporting a problem, please make sure the following information are provided:

Expected behavior and actual behavior:

GC should delete all manifests & blobs marked for removal and if an error is encountered should skip the offending manifest/blob and continue with the rest, logging the issue, instead GC fails to delete some manifest because registry returns 500 and the following error message:

err.message="invalid checksum digest format"

and the whole GC job fails and stops.

Steps to reproduce the problem:
Don't know what causes it, so not sure how to reproduce.

Versions:
Please specify the versions of following systems.

  • harbor version: 2.3.3
  • docker engine version: N/A
  • docker-compose version: N/A

Additional context:

Registry logs

time="2021-11-08T00:41:24.515110036Z" level=info msg="authorized request" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=c5e9fa5c-804e-46b8-a610-1c523bda67ad http.request.method=DELETE http.request.remoteaddr="127.0.0.1:50242" http.request.uri="/v2/<project>/<repo>/manifests/sha256:875c0867bae86b06e7c4e098eff8877663c13859eb760f36b451354c755e591d" http.request.useragent=harbor-registry-client vars.name="<project>/<repo>" vars.reference="sha256:875c0867bae86b06e7c4e098eff8877663c13859eb760f36b451354c755e591d" 

time="2021-11-08T00:43:20.098404428Z" level=error msg="response completed with error" auth.user.name="harbor_registry_user" err.code=unknown err.message="invalid checksum digest format" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=c5e9fa5c-804e-46b8-a610-1c523bda67ad http.request.method=DELETE http.request.remoteaddr="127.0.0.1:50242" http.request.uri="/v2/<project>/<repo>/manifests/sha256:875c0867bae86b06e7c4e098eff8877663c13859eb760f36b451354c755e591d" http.request.useragent=harbor-registry-client http.response.contenttype="application/json; charset=utf-8" http.response.duration=1m55.650187865s http.response.status=500 http.response.written=70 vars.name="<project>/<repo>" vars.reference="sha256:875c0867bae86b06e7c4e098eff8877663c13859eb760f36b451354c755e591d" 

jobservice log:

2021-11-08T00:43:20Z [ERROR] [/jobservice/runner/redis.go:113]: Job 'GARBAGE_COLLECTION:4d498cc8dea08115a5d2a92d' exit with error: run error: failed to delete manifest with v2 API: <project>/<repo>, sha256:875c0867bae86b06e7c4e098eff8877663c13859eb760f36b451354c755e591d: retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}
@dkulchinsky
Copy link
Contributor Author

@wy65701436 came across this issue (distribution/distribution#3018) and PR (distribution/distribution#3019) in the upstream Docker Distribution that seem to have identified the same issue that is causing these errors we're seeing.

AFAICT the PR was abandoned and the issue is still open with now resolution, GitLab seem to have forked distribution and started using their own version due to inability to push changes.

With distribution still (?) using Google SDK from 2015, it's really worrying and looks like it's contributing to a lot of the issues we're seeing, including this one?

@wy65701436
Copy link
Contributor

thanks @dkulchinsky , I'll update the Google SDK for upstream distribution. Given this, Harbor still cannot leverage it till we get an new distribution release.

@wy65701436 wy65701436 self-assigned this Nov 11, 2021
@dkulchinsky
Copy link
Contributor Author

Thanks @wy65701436!

can we consider using gitlab's fork of the distribution? seems like it's in a much better shape in terms of reliability, perfromance and overall maintenance, Docker's distribution last release was in January 2019 😱

Is there anything you can suggest for my situation? we already have over 10,000 artifacts waiting for GC (and the number grows daily) and the GC job keeps failing either due to this issue or the other GC issues I've reported (mostly #15822)

What can we do? getting really desperate with this 😞

@wy65701436
Copy link
Contributor

We have no plan to leverage other forked distribution. We're(distribution maintainers) working on issue new release for upstream distribution, but I cannot give an date.

  • For this issue, I'll raise the PR to fix in the upstream. Harbor needs to wait for distribution 3.0 release.
  • For other issues on GC, I'll go through them and maybe we can do some enhancement on Harbor side. But, so far, most of the problems are raised by the backend storage.

@dkulchinsky
Copy link
Contributor Author

Thanks again for replying @wy65701436! I appreciate it 👍🏼

We have no plan to leverage other forked distribution. We're(distribution maintainers) working on issue new release for upstream distribution, but I cannot give an date.

  • For this issue, I'll raise the PR to fix in the upstream. Harbor needs to wait for distribution 3.0 release.

Looking forward to it 👍🏼

  • For other issues on GC, I'll go through them and maybe we can do some enhancement on Harbor side. But, so far, most of the problems are raised by the backend storage.

I think allowing GC to skip blobs/manifests that fail to be removed due to persistent errors such as 404 & 500 can help mitigate this issue considerably, at least allow us to GC the majority of artifacts.

perhaps this behaviour can be an optional configuration, so that it won't be considered as a breaking change and will be opt-in.

@wy65701436
Copy link
Contributor

yes, to allow skip failure could be an option. BTW, for the performance issue, we maybe will do some enhancement on distribution side.

@dkulchinsky
Copy link
Contributor Author

dkulchinsky commented Nov 12, 2021

yes, to allow skip failure could be an option.

That would be great @wy65701436 🤝 this will be greatly appreciated as it would really help our current situation, I hope this can be implemented sooner than later 🙏🏼

BTW, for the performance issue, we maybe will do some enhancement on distribution side.

👏🏼

@stale
Copy link

stale bot commented Apr 16, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Stale label Apr 16, 2022
@dkulchinsky
Copy link
Contributor Author

this is still being tracked I believe?

@stale stale bot removed the Stale label Apr 17, 2022
@github-actions
Copy link

github-actions bot commented Jul 5, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jul 5, 2022
@dkulchinsky
Copy link
Contributor Author

still relevant

@github-actions github-actions bot removed the Stale label Jul 7, 2022
@github-actions
Copy link

github-actions bot commented Sep 6, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Sep 6, 2022
@dkulchinsky
Copy link
Contributor Author

still relevant

@github-actions github-actions bot removed the Stale label Sep 7, 2022
@github-actions
Copy link

github-actions bot commented Nov 6, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Nov 6, 2022
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Nov 7, 2022
@github-actions
Copy link

github-actions bot commented Jan 7, 2023

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jan 7, 2023
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Mar 9, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions
Copy link

github-actions bot commented May 9, 2023

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label May 9, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jul 10, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Sep 10, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Sep 11, 2023
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Nov 11, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Nov 12, 2023
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jan 12, 2024
Copy link

This issue was closed because it has been stalled for 30 days with no activity. If this issue is still relevant, please re-open a new issue.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 12, 2024
@github-project-automation github-project-automation bot moved this from Issues to Completed in GC Improvement Activities Feb 12, 2024
@dkulchinsky
Copy link
Contributor Author

Looks like I missed the stale notifications and didn't post here.

his is definitely still an issue and should be tracked.

@wy65701436 can you please re-open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Completed
Development

No branches or pull requests

3 participants