-
Notifications
You must be signed in to change notification settings - Fork 579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify consistency guarantees of file operations on Windows #8840
Comments
There's a report of corrupted state files after unclean shutdowns of Windows in the community forum: https://community.icinga.com/t/icinga2-sevice-does-not-start-after-unclean-shutdown/7692/4 So I'd say it's plausible that there's actually a problem to find here. |
Thank you Julian for pointing me the right direction. I can confirm we have full file of empty characters in case of windows crash/unclean shutdown, that prevents service start. After removing this file, new file is created and service starts normally. This situation arrised many times, affects random monitored hosts. Our environment consist of Windows Server 2019 DTC, 2016 STD, 2012R2, all operating systems are affected. |
Or probably even a better approach than checking what we currently do (I think we can tell by now that it's unreliable): figure out how to atomically replace a file on Windows and just implement this instead of what we currently do (Boost file system operations where you have to look through the boost source to even know which syscalls they use). |
We are currently investigating an suspicious Icinga Agent (2.11.7).
|
When we did a config-check with the agent it looked all good. Maybe it whould be a good idea to verify the integrity of the statefile as part of |
Did the startup issue with the corrupted log file appear shortly after that log message? Or is that just an older message you found in the logs? I wouldn't expect that to lead to a corrupted state file. From what we know, errors like that might appear when for example anti-virus software scans the corresponding files while we want to rename then. So with version 2.11.9, we added a workaround for that (#8770). Also was there an unclean shutdown of the machine recently? If I remember correctly, that happened in all cases where we've seen this type of file corruption so far. |
This was indeed an older log-message. The problem occured a few days ago. There were no corresponding log messages. So maybe its not related. Thanks for the hint with the workaround! 2.11.10 agents are allready on the todo list. We are investigating if a unclean shutdown occured. |
Random thought I had I want to write down so that I don't forget: The rename operation might be fine as is (at least I read documentation on that and what we're doing didn't immediately sound broken) but maybe we're missing something like |
Since 2.11.11 we had 2 occurrences of a corrupted state-file with a base of 3200 windows agents. The workaround seems to work. |
It is still an issue with agent-version v2.12.3 |
Wild guess: Adding icinga2/lib/base/configobject.cpp Lines 504 to 508 in 18c8b4a
might improve things. But I'm just guessing that the rename operation is fine (or good enough for our requirements here) and the actual file contents are corrupt, so chances are good that some flushing of the file to disk is missing and FlushFileBuffers() sounds like it performs that operation.
However, easier said than done: The |
What about std::flush? |
One could give it a try, after all I'm just guessing at this point as well. But I wouldn't expect |
The change I just merged should fix this and is also planned to be released in 2.13.5. Please report back if you're still seeing this issue after that release is out. |
I will report back. Thanks a lot for the effort. |
Icinga uses the pattern of writing a new version of a file to a temporary file location and once that's done, moving it to the final location replacing the old version. We should evaluate if this also gives the desired consistency on Windows, given that things might work differently and there's Boost in between potentially hiding which syscalls are actually used behind the scenes.
Rationale: #8528 reported that an agent on Windows didn't start because the state file contained all spaces, which sounds unlikely to be caused a hardware fault or user error.
The text was updated successfully, but these errors were encountered: