-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sporadic test failure: scan::appmaker::scan_workflow_from_git_url
#179
Comments
I was able to reproduce locally! I ran the following in a loop on a 14-core M3 MacBook Pro, and after 99 attempts, saw a similar error message:
|
This is difficult to reproduce. I've had to run hundreds or thousands of times in a loop to see that error message again. |
Interestingly, the
The version of
|
@Byron FYI, there appears to be a race condition somewhere in Specifically, when scanning a Git repository, Nosey Parker first enumerates all object IDs, building a vector of blob IDs. Then in a later stage, that vector of blob IDs is used in a Rayon pipeline to read each blob and scan it, in parallel. At the rate of around one in every few hundred attempts when using multiple worker threads, I see Last week I tried isolating where the behavior is coming from by trying to disable all caching mechanisms in This is not a super high priority Nosey Parker issue for me right now, but I wanted to make you aware of it. I'm happy to do further investigation if you have suggestions to better debug this. Thank you! |
Thanks so much for letting me know! I agree that it's most definitely a race condition where somehow a thread won't notice that a new pack-index file has become available that contains the object, probably because another thread has just loaded it. My plan here is to write a program to hopefully reproduce the issue, ideally with a repository with a lot of pack files to make its occurrence more likely. From there, it should be possible to fix it or at least improve the situation. |
@Byron Thank you! I could look at this more next week. Let me know if I can help. |
Thank you! I am slowly getting back into it and noticed something interesting: The code to get the initial set of objects also uses a parallel implementation that calls After the initial iteration, the whole ODB would be loaded/memmapped, indices and packs, but here it seems the repository is re-opened so it starts at zero. Now Chances are that the underlying issue in somewhere in the code that dynamically loads indices, as all of the sudden it fails to find an index with the object, at least in one thread (maybe another did successfully load it, or is in the process of doing so and somehow one thread doesn't get to wait for that). Hopefully I will have a repro soon. |
Yes, you're right — Nosey Parker enumerates repos in one phase, and then in the next phase reopens each repo and reads the blobs from it. This re-opening is done to avoid running out of memory, as otherwise it might need to keep thousands of repos open at once. I intend to rework this scheme eventually, to start scanning immediately rather than first enumerating everything and then scanning it. But that's a medium-term project for me. |
A motivating example is here: praetorian-inc/noseyparker#179
A motivating example is here: praetorian-inc/noseyparker#179
Great news! I could reproduce the issue and can do so consistently. Having less packs seems to be the key here, and from what I can tell, this situation will put threads into a waiting-loop and it's my guess that something there goes wrong so they don't actually pickup newly loaded indices.
|
A motivating example is here: praetorian-inc/noseyparker#179
A motivating example is here: praetorian-inc/noseyparker#179
…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179
…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179
With the latest modification, I cannot reproduce the issue anymore even after a thousand runs, where previously it would trigger after less than 50 iterations usually.
After a little more 'tuning' I think this issue is a thing of the past, as it can't be reproduced anymore even in 10.000 runs.
It's not impossible that there will be more races that are discovered in long-running servers maybe, but we get there when we get there - the trampling-herd problem here is now certainly resolved. |
…ng a single index. The motivating example is here: praetorian-inc/noseyparker#179 Previously, it was possible for a trampling herd of threads to consolidate the disk state. Most of them would be 'needs-init' threads which could notice that the initialization already happened, and just use that. But a thread might be late for the party and somehow manages to not get any newly loaded index, and thus tries to consolidate with what's on disk again. Then it would again determine no change, and return nothing, causing the caller to abort and not find objects it should find because it wouldn't see the index that it should have seen. The reason the thread got into this mess is that the 'is-load-ongoing' flagging was racy itself, so it would not wait for ongoing loads and just conclude nothing happened. An extra delay (by yielding) now assures it either seees the loading state and waits for it, sees the newly loaded indices. Note that this issue can be reproduced with: ``` './target/release/gix -r repo-with-one-pack -t10 --trace odb stats --extra-header-lookup' ```
Thank you @Byron for the speedy response! I will give the main branch of gitoxide a try in Nosey Parker. |
Gix v0.63 was just released, and it should fix this issue for good :). |
Yes, upgrading gix from 0.62 to 0.63 seems to fix this problem for me. Thank you @Byron! |
Describe the bug
Sometimes, rarely, the
scan::appmaker::scan_workflow_from_git_url
test case fails.To Reproduce
🥴
Expected behavior
The test case should always succeed
Actual behavior
See this CI run log:
Output of
noseyparker --version
Additional context
I think I've only ever seen this in GitHub Actions CI, though I'm not certain.
I suspect there is some race condition in the
gix
library.The text was updated successfully, but these errors were encountered: