Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mac: runtime: failed to create new OS thread #554

Closed
shubhindia opened this issue Jun 10, 2022 · 22 comments
Closed

mac: runtime: failed to create new OS thread #554

shubhindia opened this issue Jun 10, 2022 · 22 comments

Comments

@shubhindia
Copy link

shubhindia commented Jun 10, 2022

We are using latest release available.

2022/06/10 10:46:01 bazel-remote built with go1.18 from git commit f774f6dc82c22540511241a3db58462898edc2be.

We are using bazel-remote as a cache backend for ios-builds. It was running fine for about 3 weeks now but now we are getting below error after starting the server.

runtime: failed to create new OS thread

There aren't any logs available apart from the access logs. Only above error. We are running it on an m1-mac-mini with 512 GB

Host OS: Mac OS
Version: Mac OS 12.2.1
RAM: 16GB

@mostynb
Copy link
Collaborator

mostynb commented Jun 10, 2022

Hi, a quick web search for "runtime: failed to create new OS thread golang m1 mac" gives me several mentions of people having similar trouble with docker- are you using docker here too? Is that the entire log line, or does it have something like (have 2 already; errno=22) at the end?

If you are using docker, I suggest trying the bazel-remote-2.3.7-darwin-arm64 binary from https://github.com/buchgr/bazel-remote/releases without docker.

@shubhindia
Copy link
Author

shubhindia commented Jun 13, 2022

Hi, @mostynb no we are not using docker. And yes, that is the entire log line. There is nothing else before or after that. I even tried with latest merges (built one locally with latest changes) and still the issue persists. I am pretty certain this issue started happening when we reached the storage limit which we have set for cache. I was hoping #480 would fix it but issue still persists. Somehow, it is not able to cleanup the files as fast as cache is being is written.

Screenshot 2022-06-13 at 12 20 32 PM

For e.g. In above screenshot, you can see current cache size is around 400GB but our limit is set to 375GB. First crash happened when we crossed initial 300GB limit, hence we had increased it to 375GB it crossed that limit too.

@mostynb
Copy link
Collaborator

mostynb commented Jun 13, 2022

Is this a HDD or an SSD?

@shubhindia
Copy link
Author

@mostynb its an SSD

@shubhindia
Copy link
Author

Also the storage metrics which it is reporting are incorrect. I manually cleaned up some old files by running find . -type f -mtime +9 -delete and it went down to 304GB, but its still reporting 402GB.

Screenshot 2022-06-13 at 1 16 56 PM

Screenshot 2022-06-13 at 1 22 30 PM

@mostynb
Copy link
Collaborator

mostynb commented Jun 13, 2022

bazel-remote doesn't scan the data dir every time the disk usage metrics are updated (because that's really slow) - if you manually remove files from the data dir the metrics will be wrong until bazel-remote is restarted. I wonder if you can try restarting bazel-remote with a smaller size limit, say 300G?

@shubhindia
Copy link
Author

That's what I figured. We already had a limit of 300GB and it went above that. Hence, I had to increase the limit.

@mostynb
Copy link
Collaborator

mostynb commented Jun 13, 2022

We already had a limit of 300GB and it went above that. Hence, I had to increase the limit.

bazel-remote only uses an approximation of the actual disk space used (because some is lost due to filesystem overheads, which can vary between filesystems). If you set a size limit and bazel-remote let the actual usage go too far over that, then you should reduce bazel-remote's limit (not increase it).

Do you run anything else on this mac mini? Is it close to running out of space on the SSD?

@shubhindia
Copy link
Author

We already had a limit of 300GB and it went above that. Hence, I had to increase the limit.

bazel-remote only uses an approximation of the actual disk space used (because some is lost due to filesystem overheads, which can vary between filesystems). If you set a size limit and bazel-remote let the actual usage go too far over that, then you should reduce bazel-remote's limit (not increase it).

Do you run anything else on this mac mini? Is it close to running out of space on the SSD?

Reducing the limit introduced the mentioned error in first place. I reduced the limit from 400GB to 300GB because cache was way above the limit and I thought reducing the limit will clean up the LRU cache aggressively. Now we are at around 340GB and limit is set to 375GB and haven't seen the mentioned error yet, though I assume it will come once the size passes the set limit of 375GB.
Yes, apart from bazel-remote there is node-exporter and prometheus running on the mac-mini. We are at 85% disk usage currently.

@mostynb
Copy link
Collaborator

mostynb commented Jun 13, 2022

Looking into the go runtime source makes me think this might be a mac-specific limitation, like the one mentioned here:
https://stackoverflow.com/questions/52861928/max-number-of-threads-mac-allows/59520907#59520907

I wonder if you could try enabling performance mode on your mac mini and see if that helps? https://support.apple.com/en-us/HT202528

@shubhindia
Copy link
Author

Looking into the go runtime source makes me think this might be a mac-specific limitation, like the one mentioned here: https://stackoverflow.com/questions/52861928/max-number-of-threads-mac-allows/59520907#59520907

I wonder if you could try enabling performance mode on your mac mini and see if that helps? https://support.apple.com/en-us/HT202528

These threads are about mac-os server and we are running vanilla version of mac-os. Still tried running the suggested commands and it gives error. Since, we want to increase threads per_process, will check if there is any other way to to this.

@shubhindia
Copy link
Author

Ohk, managed to increase the limit by using some custom plists. Right now cache is at 340GB and limit is set to 375GB, so its not going to crash anytime soon. Will decrease the limit to 300 on weekend and test as currently builds are using this cache. If it successfully comes down to less than 300 or even near to 300 then we can assume the fix worked else will have to debug more.

@shubhindia
Copy link
Author

@mostynb CMIIW, according to https://github.com/buchgr/bazel-remote/blob/master/cache/disk/disk.go#L142 , bazel-remote might spawn 5000 threads to cleanup the disk in worst case scenario. Since this value is still more than allowed for a user (atleast in macOS), this might be the reason it is crashing for us. If it is so, will find a way to set this dynamically.

@shubhindia
Copy link
Author

Update: I reduced the limit to 300 and it failed again.

2022/06/14 20:54:53 bazel-remote built with go1.18 from git commit f774f6dc82c22540511241a3db58462898edc2be.
2022/06/14 15:24:53 Initial RLIMIT_NOFILE cur: 2560 max: 5000
2022/06/14 15:24:53 Setting RLIMIT_NOFILE cur: 5000 max: 5000
2022/06/14 15:24:53 Loading existing files in /Users/runner/builds/bazel_cache.
2022/06/14 15:28:53 Sorting cache files by atime.
2022/06/14 15:28:54 Building LRU index.
runtime: failed to create new OS thread

@mostynb
Copy link
Collaborator

mostynb commented Jun 14, 2022

CMIIW, according to https://github.com/buchgr/bazel-remote/blob/master/cache/disk/disk.go#L142 , bazel-remote might spawn 5000 threads to cleanup the disk in worst case scenario. Since this value is still more than allowed for a user (atleast in macOS), this might be the reason it is crashing for us. If it is so, will find a way to set this dynamically.

That's approximately correct, yes. bazel-remote attempts to create no more than 5,000 simultaneous file removal operations, because Go has a 10,000 os thread limit and each file removal might block and cause the creation of another os thread. Besides those potentially 5,000 blocks os threads, there will be some other os threads used by bazel-remote (but hopefully many fewer than 5k).

Your most recent log shows the crash occurring when bazel-remote was still building the LRU index. Since you have reduced your cache limit, I suspect that bazel-remote was removing a bunch of files, and hit some limit that is less than 5000 os threads. So I think we'll need to do a bit of research to figure out what that limit is on your machine, and how we might be able to raise it. This also means that your disk is having trouble keeping up with all these file removals.

Is this an intel mac mini or an m1 mac mini? Which version of Mac OS is it running? Can you share the output from launchctl limit ?

@shubhindia
Copy link
Author

shubhindia commented Jun 15, 2022

Screenshot 2022-06-15 at 2 05 44 PM

This is the output of launctl limit. I tried raising maxproc to more than 4000 but couldn't. I was also looking into a way to set the semaphore dynamically instead of just hardcoding it, but couldn't find proper syscall to make.
This is an m1 mac mini running macOS 12.2.1 with 16GB ram and 512GB ssd

@shubhindia
Copy link
Author

This also means that your disk is having trouble keeping up with all these file removals.

Yeah, when I check Disk I/O it is did spike when I reduced the limit.

Screenshot 2022-06-15 at 2 49 45 PM

@shubhindia
Copy link
Author

Update: I reduced the semaphore value to 3000, DIsk I/O did spike but it did not crash the server. It even cleaned up the cache as well.

Screenshot 2022-06-15 at 8 57 22 PM

Screenshot 2022-06-15 at 9 00 20 PM

@mostynb
Copy link
Collaborator

mostynb commented Jun 15, 2022

Interesting- what is the semaphore setting, and how did you change it?

@shubhindia
Copy link
Author

Interesting- what is the semaphore setting, and how did you change it?

The line I was talking about. shubhindia@8a072cb

mostynb added a commit to mostynb/bazel-remote that referenced this issue Jun 16, 2022
This also adds a startup log line mentioning the value, to help debugging.

This might avoid buchgr#554.
@mostynb
Copy link
Collaborator

mostynb commented Jun 16, 2022

Oh, right. If this works for you, how about we lower this to 3000 only for mac, and add a log line for debugging: #556 ?

@mostynb mostynb changed the title runtime: failed to create new OS thread mac: runtime: failed to create new OS thread Jun 16, 2022
mostynb added a commit to mostynb/bazel-remote that referenced this issue Jun 16, 2022
This also adds a startup log line mentioning the value, to help debugging.

This might avoid buchgr#554.
@shubhindia
Copy link
Author

Closing this as setting the semaphore limit to lower value fixes the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants