-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: time.Sleep problem on Windows 10, but not Windows 7 or Linux #61042
Comments
CC @golang/windows |
is this #44343 ? |
I think that it's possible that they might be related, but the symptoms seem pretty different. |
I suspect they are the same issue (especially given the Go version ranges). Could you elaborate on which of the symptoms are different? Thanks. Also, if you're so inclined, it should be straightforward to identify if it's the same issue by trying 8fdc79e and the commit before. |
I stand corrected. I didn't read it carefully enough. Sorry. My gut feeling is that there are 2 issues happening at the same time. One is the Windows auto-switch from 15.6ms to 1ms is broken. And the other is the "noisy" results, some even LESS than the period (that should only happen on a random sleep, not then they are consecutive (except for the first). Still, it's VERY strange that I don't see either issue when I run the test on Windows 7. |
No worries at all. Thanks for looking into the difference between Windows versions.
I'm not sure what the right mental model is for blocking on an event queue is on Windows (which is what happens in Go 1.20) but I suspect the way it works is that the queue only gets checked at specific time intervals. So you can get a short sleep if your timeout runs out and you were added to the queue just before a tick is about to fire. But I'm just guessing.
I agree, but I'm not really sure what to make of that. That suggests to me that the same Windows API we use to block a thread and wait for a timer had better resolution on Windows 7 than it does in Windows 10, which is somewhat surprising. |
I'll close this issue as a duplicate of #44343 and link to it. Let me know if you disagree! |
Remember that literally the same v1.20.5 exe works on Windows 7. Is there Windows version-specific code in runtime? If not, then perhaps the new method isn't well supported by Windows 10. I don't have a good understanding of how the new method works. If memory serves, the old method just depended on the Windows system timer - which normally runs either 15.6ms or 1ms. When I try to explain why a "random" (not sync'ed with the system timer) Sleep can be too short I use this analogy: Think of a HH:MM digital clock. When you first look at it the delay before the MM changes could be anything between 0 and 1 minute. But as you keep watching it, successive delays will always be 1 minute. For the old method - The claim in the doc's that a Sleep delay will always be greater than or equal to what it is called with is not always correct, because it doesn't take this into account. I believe it should be stated as the Sleep delay will always be greater than or equal to what it is called with minus one system clock tick. I proposed making this change, or avoiding the problem by adding 1 system clock tick to the desired delay. But it wasn't done. This isn't a big deal for Linux, for example. If your Sleep is 100ns short you will probably never notice it. Unless you wanted to Sleep for 100ns and ended up with 1ns instead. With Windows, on the other hand, if you are expecting a 1ms Sleep and end up with 1ns instead it could mess you up. The other Windows issue is when you use something like a 1us Sleep, not realizing that it's probably going to be lots longer than that. Even if your avoid this in your own code (always Sleep 2ms or longer to avoid both surprises), what about 3rd party stuff that you are running? Years ago, when I last looked, there were lots of 1us Sleeps in runtime, for example. Imagine if one of those was in a spin-lock, for example. Ouch!
Perhaps it might be best to go back to the old method for Windows till the new method can be made to work? For me, 15.6 ms is a breaking change for some of my code. I understand the desire to use the 15.6ms clock for low-power situations. The compromise which was finally adopted was the auto-switch - which is based on how often you use Sleep at runtime. And the easy way to avoid the slowdown on even the first Sleep was to init a goroutine which did nothing but a Sleep(50 ms) in a loop. That would keep the system clock at 1ms. |
Sounds good to me. |
I agree this looks like a duplicate of #44343 to me. @johnrs If you haven't seen them yet, these two comments from that issue should give enough information to understand what change caused the regression starting with Go 1.16 and a lot of detail about how the runtime timers are interacting with Windows at the moment:
I don't understand one of your points from above, though:
What I don't understand is when you are seeing sleeps shorter than the requested duration. I invested quite a lot of time understanding how the runtime timers code works a few years ago and I can't think of any way for that to actually happen. As I noted in #44343 (comment) I've actually seen the linux kernal wakeup earlier than Go asked it to, but the Go runtime knew it was too early and just went to sleep again. I don't see any examples of "short sleeps" in the example you gave in the top post. All the calls to If I understand right, you are suggesting that sometimes a call like
That's an extreme simplification, but it highlights the fact that |
About why I believe it works on Windows 7 and not on Windows 10. @johnrs mentioned this:
But, since Windows 10, the time resolution set by a process using timeBeginPeriod does not affect time resolution of other processes:
Therefore, Go applications running on Windows 10 are more prone to suffer #44343. The workaround is to call |
Thanks. It seems like the runtime timers have changed a lot in the last 6 years. :) It's just coincidental that the last time-sensitive Go code I worked with, until a few days ago, was when v1.15.13 was current - just before these changes. If memory serves, Austin Clements came up with the idea of auto-switching between 1ms and 15.6ms, and he made the changes, in mid 2017. It resolved the "which clock speed to use" issue which had been ongoing since late 2014.
It's like glancing at a digital clock. It probably won't take a full minute to change the first time, rather something between 0 and 1 minute. If you keep watching, then you will see changes every 1 minute. Please realize that the runtime was decrementing the sleep counters. The Windows tick just triggered the runtime to do this. If you request a sleep of x clock ticks, the actual sleep will be between x-1 and x clock ticks. Thus, if you sleep for 1ms on Windows, when the system clock is set to tick every 1ms, your sleep will last between 0 and 1ms. If you are doing this in a tight loop, you will only notice the partial 1 tick decrease on the first pass of the loop. After than the sleep requests will be pretty much in sync with the system clock, just a bit after it. So all of the remaining sleeps will be the requested length. If you are sleeping for 1 sec then you probably aren't going to notice that the first pass is between 0.999 sec and 1.0 sec. It's only when your sleep request is close to the actual tick rate that it shows up. Since the tick rate on Windows is so slow, it's more likely to be a problem than on a Linux system, for example. Note: I put a sleep just before the loop to "sync" with the clock to avoid this from showing up in the results. The first result being off can be distracting.
I now understand that a more complex scheme was used starting with v1.16. And, if I understand correctly, it has more than one source of time info. That sounds dangerous. Different time sources generally don't tick at the same time, even if they are set to the same rate.
Interesting point. Under the old scheme, 1 should yield between 0 and the shortest possible sleep - on the first pass of the loop. After that, however, the requests are made just after a tick, hence another almost-full tick will pass before the sleep is satisfied. I believe this was the case up to v1.15. With the scheme in place now I don't know what should happen in this case. So the shorter-than-1ms results in the middle of the results might not indicate a problem now. Indeed, it would be nice if they would all be smaller! :) But the ones around 15.6ms make it clear that 1ms sleeps wouldn't work.
I ran a test. You are correct. It didn't go below 20ms in the few dozen runs I did. Sleep( 20ms ) Just for grins, I also checked 15ms and 16ms. Sleep( 15ms ) Sleep( 16ms ) At first glance, these both look good, too. But there are a few places where the result was more than twice the request. For 15ms - 30.377ms, 30.5662ms, 30.1143ms. For 16ms - 34.8855ms.
I'm guessing that this primitive isn't very tightly in sync with the system clock, probably lagging. That does guarantee "at least" but sometimes could be +1 tick.
Agreed. It does sacrifice accuracy, resulting in some "jitter". The +1 tick cases are worse - but they are only going to be noticeable when you are sleeping close to the tick rate. It seems that this primitive is running at a 15.6ms rate, however. That reduces the resolution from the previous 1ms rate. Please correct me if I don't have this right... Overall the new scheme avoids the -1 tick problem - and that's good. But it has a +1 tick problem sometimes, a lot more jitter than before, and runs with a resolution of 15.6ms instead of 1ms - and these are all bad. Are there any other advantages to the new scheme? I saw some mention of scheduling changes. Are these tied to the new scheme, or could they work with the old scheme? |
Ah! One mystery solved.
I'll have to give that a try. In the past, this worked with some versions of Go, but not with others. It seems that the runtime was using it, itself. This resulted in conflicts. It was part of what led to having the runtime do it when requested by the auto-switch logic. |
I resurrected an old test program and revised it a bit to cover the current situation. I then ran it on both Windows 7 and on Windows 10 (3 modes). The chart below shows the 4 sets of results. The test modes are:
PLEASE NOTE: I am NOT suggesting that anyone use the Windows timeBeginPeriod method except for testing. In the past this method has worked with some versions of Go, but not with others. Go's runtime uses it internally. Also, Windows wants all Begins to have Ends. What I did leaves Begin on, hence when the runtime wants to turn it on also, there will be 2 Begins in a row. Windows claims to not like that. But perhaps Go checks the current state before changing it. If so, it would avoid turning it on then, but would turn it off later - thus defeating our use of it.
My observations ... Windows 7: OK. Windows 10, in all 3 modes: Shows that time.Now(), both wall clock time and monotonic time, no longer ticks at a steady rate. The times are still "correct", just not "periodic", as they used to be. I don't if this will cause any problems or not, but it sure is strange! Windows 10, Default mode: Shows the problem: 15ms instead of 1ms resolution. There seems to be less jitter than my previous test's results, but that's because this test uses the QPC clock, not the jittery time.Now(), to measure periods. Windows 10, Go Auto Switch mode: Doing a continuous stream of 25ms sleeps in a separate goroutine did not trigger the auto switch. Looking at the Windows runtime code, this is handled by line 426: osRelax(), but only if there is no High Resolution Timer. Line 464 says that all 386 and AMD64 architectures do have it. So, if I understand correctly, this prevents auto switch from engaging. Windows 10, Windows timeBeginPeriod mode: Bingo! This makes 1ms timing work. Thus, it seems that the problem isn't the code's handling of timing events properly, but that the Windows clock isn't being set to 1ms to start with (or someone is turning it off when they shouldn't). I'm sure that there is plenty I don't know about or understand, but right now it seems to me like either Auto Switch needs to be allowed to engage or the Windows clock needs to be set-and-left at 1ms. |
@johnrs I believe your data agrees with the best information from #44343. The fact osRelax isn't called in the presence of a high res timer is identified, including the commit when that change was made and how it interacts poorly with the contribution I made to reduce worst case time latency in Go 1.16. There are suggestions to restore the use of osRelax, and counter arguments against that. A good way to thread the needle has been proven recently, but it depends on an undocumented Windows API and I believe there are attempts to get that documented underway. |
@ChrisHines Interesting. When I search #44343 for osRelax I don't see much mention of it. Could you point me to where the counter arguments against restoring it are? Thanks. |
@johnrs Looking for mentions of Here are a few comments worth reading:
I admit there is less debate about it in that thread than I remembered. I think I was confusing it with this other thread that I had come across contemporary to my previous work on runtime timers. You will probably remember this thread, as you commented there more than once. The part I think is most relevant starts here: |
@ChrisHines Thanks! The relevant part that you referred to is what the auto switch (os.Relax()) mostly resolved. Also, given the current Windows 10, a Go program's Windows clock setting shouldn't affect any other programs. I certainly like the goal of avoid using the Windows clock entirely, but I don't understand why os.Relax() hasn't been put back in use till the eventual goal is achieved. |
I can't answer that, but I think it's a valid question. |
What's the procedure to make it happen? Open a new issue proposing it? |
Perhaps. Or add a new comment to #44343 with your suggestion. I suspect a new issue would get marked as a duplicate of that one again but maybe it wouldn't if it was written as a proposal for a specific change rather than a problem report. |
Thanks. Will post on #44343 and see what happens. |
No response. What's the procedure for a proposal? Thanks. |
@johnrs The procedure is documented here: https://github.com/golang/proposal |
Thanks! |
I'm working in an autoclicker in Windows 11 and I'm having a similar problem. If I execute mousedown and mouseup without a |
|
What version of Go are you using (
go version
)?go version go1.20.5 windows/amd64 - "new"
go version go1.15.13 windows/amd64 - "old"
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (
go env
)?windows_amd64 - Windows 10 Pro, 22H2, Build 19045.2965
windows_amd64 - Windows 7 Ultimate, 6.1.7601, Service Pack 1, Build 7601
What did you do?
What did you expect to see?
Results using Go v1.15.13 or v1.20.5 code on Windows 7.
Time Resolution is fine. Sleep Resolution is fine.
What did you see instead?
Results using Go v1.20.5 code on Windows 10.
Time Resolution is fine. Sleep Resolution is bad. It should be ~1ms.
Results using Go v1.15.13 code on Windows 10.
Time Resolution is ok. Sleep Resolution is ok. Neither are as good as when run on Windows 7, but they are certainly usable.
=== Comments
Comments:
A problem I was having with a program I was working on led me to this issue.
So yes, it makes a difference in the real world.
I don't have a Windows 11 PC.
It would be interesting to see if it has the problem or not.
The "new" code was built on the Windows 10 PC.
The "old" code was build on the Windows 7 PC.
Both Windows PC clocks were running at 1ms due to other software on them.
This was verified using TimerTool.exe.
But this is moot, since the very first Sleep should have caused Go's auto-clock-switch to force it to 1ms and keep it there till 50ms after the last Sleep.
The Windows 10 PC is about 5 times faster than the much older Windows 7 PC.
Yet the results from the older PC are much "cleaner" (less latency error).
The "new" (I didn't try old) code was fine on both a Linux server and Raspberry Pi 4.
The text was updated successfully, but these errors were encountered: