Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make token deduplication optional for the repo indexer #7826

Closed
wants to merge 2 commits into from

Conversation

guillep2k
Copy link
Member

@guillep2k guillep2k commented Aug 11, 2019

This PR solves #7825 by optionally removing the unique filter from the repository indexer analyzer.

The thing is: the current behaviour directly tracks to #3452, where the unique filter was added for the sake of space saving. So, I guess there is a case where this is an acceptable trade-off. That's why I introduced an app.ini option (REPO_INDEXER_DEDUPLICATE) to make the change optional to avoid introducing problems to users that count on this.

However, I didn't notice much of a setback in the index size with this change in my tests (the resulting index was even a little smaller in my case):

  • With the unique filter active (no patch): 26,935,296
  • With the unique filter removed (with this patch): 26,136,576

I guess it really depends on the actual repository.

If the app.ini option is not desirable, I'll just remove it.

@guillep2k
Copy link
Member Author

guillep2k commented Aug 11, 2019

Pondering about why I didn't see the kind of improvements in size that @ethantkoenig (the author of #3452) found, one thought came to mind: could it be possible that gains he found were caused by him rebuilding the index, and not thanks to adding the unique filter? Perhaps Bleve has some kind of fragmenting issue that justifies considering rebuilding it from time to time to claim space.

@GiteaBot GiteaBot added the lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging. label Aug 11, 2019
@codecov-io
Copy link

Codecov Report

Merging #7826 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #7826      +/-   ##
=========================================
+ Coverage   41.39%   41.4%   +<.01%     
=========================================
  Files         474     474              
  Lines       63780   63785       +5     
=========================================
+ Hits        26404   26412       +8     
+ Misses      33940   33935       -5     
- Partials     3436    3438       +2
Impacted Files Coverage Δ
modules/setting/indexer.go 100% <100%> (ø) ⬆️
modules/indexer/repo.go 71.11% <100%> (+0.88%) ⬆️
modules/log/event.go 64.61% <0%> (-1.03%) ⬇️
models/gpg_key.go 55.83% <0%> (-0.84%) ⬇️
models/repo_list.go 73.09% <0%> (+1.01%) ⬆️
routers/repo/view.go 43.25% <0%> (+1.01%) ⬆️
models/unit.go 67.56% <0%> (+5.4%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e9bb75d...3ecbdbe. Read the comment docs.

@guillep2k
Copy link
Member Author

Or perhaps the gains came mostly from the fact that the _all mapping was removed as well. It would be nice to test this with a larger set of repositories.

@lunny
Copy link
Member

lunny commented Aug 14, 2019

@guillep2k I think the size improvement is because removed _all mapping

@guillep2k
Copy link
Member Author

@guillep2k I think the size improvement is because removed _all mapping

@lunny I think so too, but I couldn't find anyone with a large index to try this change. If that's the case, then making the "unique" token optional makes no sense: it just should be removed. What you think?

@lafriks
Copy link
Member

lafriks commented Aug 15, 2019

I also think that there is not much point in adding such things as configurable. We already have too many things in app.ini :)

@guillep2k
Copy link
Member Author

I also think that there is not much point in adding such things as configurable. We already have too many things in app.ini :)

@lunny, @lafriks Please check out my other PR, then: #7878. Much simpler.

@guillep2k guillep2k closed this Aug 15, 2019
@guillep2k guillep2k deleted the repo-idx-unique-optional branch August 16, 2019 01:07
@go-gitea go-gitea locked and limited conversation to collaborators Nov 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lgtm/need 2 This PR needs two approvals by maintainers to be considered for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants