Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC fails after deleting many tags from same repository #15807

Open
dkulchinsky opened this issue Oct 18, 2021 · 37 comments
Open

GC fails after deleting many tags from same repository #15807

dkulchinsky opened this issue Oct 18, 2021 · 37 comments

Comments

@dkulchinsky
Copy link
Contributor

dkulchinsky commented Oct 18, 2021

Expected behavior and actual behavior:
We deleted over 20,000 tags from a repository (these tags are auto generated during our periodic CI job to test the registry and CI), we expected to run GC to get the related blobs and manifests cleaned up, but now GC fails consistently with the following error:

2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:236]: 46566 blobs and 23266 manifests eligible for deletion
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:237]: The GC could free up 4094 MB space, the size is a rough estimation.
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>/demo-go-app, application/vnd.docker.distribution.manifest.v2+json, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8, retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8: retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}

we caught the following log in the registry.

DELETE request:

time="2021-10-18T12:59:19.582953425Z" level=info msg="authorized request" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8"

and the 500 error ~20 minutes later:

time="2021-10-18T13:22:10.756233562Z" level=error msg="response completed with error" auth.user.name="harbor_registry_user" err.code=unknown err.message="invalid checksum digest format" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client http.response.contenttype="application/json; charset=utf-8" http.response.duration=22m51.241153516s http.response.status=500 http.response.written=70 vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" 

Steps to reproduce the problem:

  1. Use GCS for registry backend storage
  2. generate many (thousands?) of tags for the same repository
  3. delete all/most tags (we use a retention policy)
  4. run GC and observe the above errors

Versions:
Please specify the versions of following systems.

  • harbor version: v2.3.3
  • docker engine version: N/A
  • docker-compose version: N/A

Additional context:

  • We use GCS for registry backend storage
@dkulchinsky
Copy link
Contributor Author

this is potentially related to #12948 which we already hit in the past, but looks like there's no progress there, so hoping there's some new insight.

@wy65701436
Copy link
Contributor

The performance issue was happended at the distribution side to lookup and remove tags, we will do some investigation on this, but no specific plan so far.

@dkulchinsky
Copy link
Contributor Author

The performance issue was happended at the distribution side to lookup and remove tags, we will do some investigation on this, but no specific plan so far.

Thanks @wy65701436, is there a workaround? I was thinking to delete these artifacts from artifact_trash table so GC won't pick them up? I realize that we will end with these manifests and blobs in GCS, but given the situation it looks like it's better than the alternative where GC is broken completely.

I realize this is sort of a corner case, but I can imagine that others may end up in a similar situation, so hopefully you folks can find some cycles soon to take a look at fixing this 🙏🏼

@dkulchinsky
Copy link
Contributor Author

@wy65701436 on another instance of Harbor we run, we noticed that a repository with ~4000 tags takes about 2 minutes to delete a single manifest.

It seems to me like there's a significant performance issue during GC for repositories that have several thousand tags and more.

for example:

2021-10-20T11:10:52Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>, application/vnd.docker.distribution.manifest.v2+json, sha256:fe582557fdb5eb00ca114e263784be44661f35ba1f7f15c764f0f43567a69939
2021-10-20T11:12:36Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:273]: delete manifest from storage: sha256:fe582557fdb5eb00ca114e263784be44661f35ba1f7f15c764f0f43567a69939

2021-10-20T11:12:37Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>, application/vnd.docker.distribution.manifest.v2+json, sha256:c7faa9c6517dd640432b9172b832284b19a10324cde9782c1f16a793d8a9d041
2021-10-20T11:14:20Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:273]: delete manifest from storage: sha256:c7faa9c6517dd640432b9172b832284b19a10324cde9782c1f16a793d8a9d041

it also appears that these operations are sequential, perhaps some form of parallelism can be introduced to speed this up? though I think the root constraint needs to be addressed to be able to support any significant deployments.

@heww
Copy link
Contributor

heww commented Oct 25, 2021

An option to resolve this problem: the artifact in the distribution is untagged, the tag is managed in the harbor core side.

@wy65701436
Copy link
Contributor

since in v2.5, we introduce the skip for deletion failure. This could workaround for the timeout.

@github-actions
Copy link

github-actions bot commented Jul 5, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jul 5, 2022
@dkulchinsky
Copy link
Contributor Author

still relevant

@github-actions github-actions bot removed the Stale label Jul 6, 2022
@github-actions
Copy link

github-actions bot commented Sep 5, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Sep 5, 2022
@dkulchinsky
Copy link
Contributor Author

still an issue

@github-actions
Copy link

github-actions bot commented Nov 6, 2022

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Nov 6, 2022
@dkulchinsky
Copy link
Contributor Author

definitely not stale

@github-actions github-actions bot removed the Stale label Nov 7, 2022
@github-actions
Copy link

github-actions bot commented Jan 7, 2023

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jan 7, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Sep 27, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Sep 28, 2023
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Nov 28, 2023
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Nov 29, 2023
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jan 29, 2024
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Jan 30, 2024
Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Mar 30, 2024
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Mar 31, 2024
@kingnarmer
Copy link

I am getting same error with 2.10.2.

Copy link

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Jun 15, 2024
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Jun 16, 2024
@kingnarmer
Copy link

not stale.

Copy link

github-actions bot commented Sep 6, 2024

This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.

@github-actions github-actions bot added the Stale label Sep 6, 2024
@dkulchinsky
Copy link
Contributor Author

not stale

@github-actions github-actions bot removed the Stale label Sep 7, 2024
@Thesuperkingofsnakes22
Copy link

Expected behavior and actual behavior:
We deleted over 20,000 tags from a repository (these tags are auto generated during our periodic CI job to test the registry and CI), we expected to run GC to get the related blobs and manifests cleaned up, but now GC fails consistently with the following error:

2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:236]: 46566 blobs and 23266 manifests eligible for deletion
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:237]: The GC could free up 4094 MB space, the size is a rough estimation.
2021-10-18T12:59:19Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:261]: delete the manifest with registry v2 API: <project>/<repo>/demo-go-app, application/vnd.docker.distribution.manifest.v2+json, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:264]: failed to delete manifest with v2 API, <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8, retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}
2021-10-18T13:22:10Z [ERROR] [/jobservice/job/impl/gc/garbage_collection.go:166]: failed to execute GC job at sweep phase, error: failed to delete manifest with v2 API: <project>/<repo>/demo-go-app, sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8: retry timeout: http status code: 500, body: {"errors":[{"code":"UNKNOWN","message":"unknown error","detail":{}}]}

we caught the following log in the registry.

DELETE request:

time="2021-10-18T12:59:19.582953425Z" level=info msg="authorized request" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8"

and the 500 error ~20 minutes later:

time="2021-10-18T13:22:10.756233562Z" level=error msg="response completed with error" auth.user.name="harbor_registry_user" err.code=unknown err.message="invalid checksum digest format" go.version=go1.15.12 http.request.host="harbor-registry:5000" http.request.id=3c90b874-a989-41c1-b60f-e8c72c447002 http.request.method=DELETE http.request.remoteaddr="127.0.0.1:38784" http.request.uri="/v2/<project>/<repo>/demo-go-app/manifests/sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" http.request.useragent=harbor-registry-client http.response.contenttype="application/json; charset=utf-8" http.response.duration=22m51.241153516s http.response.status=500 http.response.written=70 vars.name="<project>/<repo>/demo-go-app" vars.reference="sha256:b86d808fd22197eb01f4aeecff490a5a7c50c06db7828afb00f7fc06d40172a8" 

Steps to reproduce the problem:

  1. Use GCS for registry backend storage
  2. generate many (thousands?) of tags for the same repository
  3. delete all/most tags (we use a retention policy)
  4. run GC and observe the above errors

Versions:
Please specify the versions of following systems.

  • harbor version: v2.3.3
  • docker engine version: N/A
  • docker-compose version: N/A

Additional context:

  • We use GCS for registry backend storage

@wy65701436 wy65701436 added the never-stale Do not stale label Oct 10, 2024
@Reyzorcat
Copy link

Reyzorcat commented Oct 14, 2024

@wy65701436 We faced same issue after moved to GCS, any PR or possible workaround here?

@wy65701436
Copy link
Contributor

@dkulchinsky do you mind that we close this issue and use #12948 to track the performance problem if they are the same.

We are seeing some users reported the same performance problem on artifact deletion of GC. I will continue updating at #12948.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

6 participants