Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve sync performance for pull-mirrors #19125

Merged
merged 25 commits into from
Mar 31, 2022

Conversation

petergardfjall
Copy link
Contributor

@petergardfjall petergardfjall commented Mar 18, 2022

This PR addresses #18352

It aims to improve performance (and resource use) of the SyncReleasesWithTags operation for pull-mirrors.

For large repositories with many tags, SyncReleasesWithTags can be a costly operation (taking several minutes to complete). The reason is two-fold:

  1. on sync, every upstream repo tag is compared (for changes) against existing local entries in the release table to ensure that they are up-to-date.

  2. the procedure for getting each tag involves a series of git operations

     git show-ref --tags -- v8.2.4477
     git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f

    of which the git rev-list --count can be particularly heavy.

This PR optimizes performance for pull-mirrors. We utilize the fact that a pull-mirror is always identical to its upstream and rebuild the entire release table on every sync and use a batch git for-each-ref .. refs/tags call to retrieve all tags in one go.

For large mirror repos, with hundreds of annotated tags, this brings down the duration of the sync operation from several minutes to a few seconds. A few unscientific examples run on my local machine:

I added a foreachref package which contains a flexible way of specifying which reference fields are of interest (git-for-each-ref(1)) and to produce a parser for the expected output. These could be reused in other places where for-each-ref is used. I'll add unit tests for those if the overall PR looks promising.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Mar 18, 2022
@6543 6543 added the performance/speed performance issues with slow downs label Mar 18, 2022
@6543 6543 added this to the 1.17.0 milestone Mar 18, 2022
@lunny
Copy link
Member

lunny commented Mar 23, 2022

I think tests for Parser is necessary.

modules/git/repo_tag.go Outdated Show resolved Hide resolved
@petergardfjall
Copy link
Contributor Author

I think tests for Parser is necessary.

  • Added tests for Parser here: a8020d0c0
  • Added tests for Format here: db3550383

@petergardfjall petergardfjall requested a review from lunny March 24, 2022 11:40
modules/git/repo_tag.go Outdated Show resolved Hide resolved
@zeripath
Copy link
Contributor

#19235 might make this unnecessary.

In particular the call to git commit-graph write might just represent the biggest performance improvement in Gitea's mirror handling for a long time.

@petergardfjall
Copy link
Contributor Author

#19235 might make this unnecessary.

In particular the call to git commit-graph write might just represent the biggest performance improvement in Gitea's mirror handling for a long time.

I haven't studied the referenced PR in detail. The main merit of this PR is that it, for pull-mirrors, reduces the number of git calls needed to sync tags with releases to one (instead of C * <number of tags> where C is a constant like 5 or so presently). This has a huge impact when syncing repositories with lots of annotated tags.

@petergardfjall
Copy link
Contributor Author

petergardfjall commented Mar 29, 2022

I did take #19235 for a spin and the mirror cloning/sync does indeed look better with that PR (nice work!), but it still does a number of git calls that is directly proportional to the number of tags in the repo.

I tried a few of the repos benchmarked in the description of this PR and noted that although the situation is better after #19235, this PR still gives an additional 6 time improvement (roughly speaking). So for the case of pull-mirrors it feels like there is still quite a lot to be gained, both in time and in system resources.

@petergardfjall petergardfjall requested a review from lunny March 29, 2022 12:36
@GiteaBot GiteaBot added lgtm/need 1 This PR needs approval from one additional maintainer to be merged. and removed lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. labels Mar 31, 2022
For large repositories with many tags, SyncReleasesWithTags can be a costly
operation (taking several minutes to complete). The reason is two-fold

1. on sync, every upstream repo tag is compared (for changes) against existing
local entries in the release table to ensure that they are up-to-date.

2. the procedure for getting each tag involves several git operations

     git show-ref --tags -- v8.2.4477
     git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f

  of which the 'git rev-list --count' can be particularly heavy.

This commit optimizes performance for pull-mirrors. We utilize the fact that a
pull-mirror is always identical to its upstream and rebuild the entire release
table on every sync and use a batch 'git for-each-ref .. refs/tags' call to
retrieve all tags in one go.

For large mirror repos, with hundreds of annotated tags, this brings down the
duration of the sync operation from several minutes to a few seconds.

Signed-off-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
For large repositories with many tags, SyncReleasesWithTags can be a costly
operation (taking several minutes to complete). The reason is two-fold

1. on sync, every upstream repo tag is compared (for changes) against existing
local entries in the release table to ensure that they are up-to-date.

2. the procedure for getting each tag involves several git operations

     git show-ref --tags -- v8.2.4477
     git cat-file -t 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git cat-file -p 29ab6ce9f36660cffaad3c8789e71162e5db5d2f
     git rev-list --count 29ab6ce9f36660cffaad3c8789e71162e5db5d2f

  of which the 'git rev-list --count' can be particularly heavy.

This commit optimizes performance for pull-mirrors. We utilize the fact that a
pull-mirror is always identical to its upstream and rebuild the entire release
table on every sync and use a batch 'git for-each-ref .. refs/tags' call to
retrieve all tags in one go.

For large mirror repos, with hundreds of annotated tags, this brings down the
duration of the sync operation from several minutes to a few seconds.

Signed-off-by: Peter Gardfjäll <peter.gardfjall.work@gmail.com>
@petergardfjall petergardfjall force-pushed the pull-mirror-tag-sync-optimization branch from b10d9f3 to ef6352c Compare March 31, 2022 06:40
Copy link
Contributor

@wxiaoguang wxiaoguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool. just some small questions from my side.

modules/git/foreachref/parser.go Outdated Show resolved Hide resolved
modules/git/foreachref/parser.go Outdated Show resolved Hide resolved
modules/git/repo_tag.go Outdated Show resolved Hide resolved
Copy link
Contributor

@wxiaoguang wxiaoguang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me.

The only nit is that the return len(data), bytes.TrimSpace(data), nil is really strange .....

If it is a must and the test case is correct (which means git has strange behaviors), here we need a clear comment to describe why the TrimSpace is needed.

@GiteaBot GiteaBot added lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. and removed lgtm/need 1 This PR needs approval from one additional maintainer to be merged. labels Mar 31, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #19125 (fd7f58c) into main (43332a4) will increase coverage by 0.05%.
The diff coverage is 82.92%.

@@            Coverage Diff             @@
##             main   #19125      +/-   ##
==========================================
+ Coverage   47.50%   47.56%   +0.05%     
==========================================
  Files         931      934       +3     
  Lines      130513   130681     +168     
==========================================
+ Hits        61998    62152     +154     
- Misses      61043    61056      +13     
- Partials     7472     7473       +1     
Impacted Files Coverage Δ
modules/git/repo_tag.go 66.66% <71.21%> (+5.91%) ⬆️
modules/repository/repo.go 45.67% <73.33%> (+1.06%) ⬆️
modules/git/foreachref/parser.go 93.84% <93.84%> (ø)
modules/git/foreachref/format.go 100.00% <100.00%> (ø)
modules/queue/workerpool.go 52.21% <0.00%> (-1.57%) ⬇️
models/release.go 48.00% <0.00%> (-1.46%) ⬇️
models/repo_list.go 75.66% <0.00%> (-0.49%) ⬇️
models/error.go 36.48% <0.00%> (-0.36%) ⬇️
models/issue.go 54.16% <0.00%> (-0.03%) ⬇️
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 43332a4...fd7f58c. Read the comment docs.

@6543 6543 changed the title improve sync performance for pull-mirrors Improve sync performance for pull-mirrors Mar 31, 2022
@6543 6543 merged commit e28cc79 into go-gitea:main Mar 31, 2022
@petergardfjall petergardfjall deleted the pull-mirror-tag-sync-optimization branch March 31, 2022 13:21
zjjhot added a commit to zjjhot/gitea that referenced this pull request Apr 1, 2022
* giteaoffical/main:
  Fix broken of team create (go-gitea#19288)
  Remove `git.Command.Run` and `git.Command.RunInDir*` (go-gitea#19280)
  Performance improvement for add team user when org has more than 1000 repositories (go-gitea#19227)
  [skip ci] Updated translations via Crowdin
  Update JS dependencies (go-gitea#19281)
  Fix container download counter (go-gitea#19287)
  go.mod: update kevinburke/ssh_config to v1.2.0 (go-gitea#19286)
  Fix global packages enabled avaiable (go-gitea#19276)
  Add Goroutine stack inspector to admin/monitor (go-gitea#19207)
  Move checks for pulls before merge into own function (go-gitea#19271)
  Restore user autoregistration with email addresses (go-gitea#19261)
  Improve sync performance for pull-mirrors (go-gitea#19125)
  Refactor `git.Command.Run*`, introduce `RunWithContextString` and `RunWithContextBytes` (go-gitea#19266)
  Move reaction to models/issues/ (go-gitea#19264)
@go-gitea go-gitea locked and limited conversation to collaborators Apr 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/done This PR has enough approvals to get merged. There are no important open reservations anymore. performance/speed performance issues with slow downs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants