Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify consistency guarantees of file operations on Windows #8840

Closed
julianbrost opened this issue Jun 22, 2021 · 15 comments · Fixed by #9451
Closed

Verify consistency guarantees of file operations on Windows #8840

julianbrost opened this issue Jun 22, 2021 · 15 comments · Fixed by #9451
Labels
area/windows Windows agent and plugins core/crash Shouldn't happen, requires attention core/evaluate Analyse/Evaluate features and problems
Milestone

Comments

@julianbrost
Copy link
Contributor

Icinga uses the pattern of writing a new version of a file to a temporary file location and once that's done, moving it to the final location replacing the old version. We should evaluate if this also gives the desired consistency on Windows, given that things might work differently and there's Boost in between potentially hiding which syscalls are actually used behind the scenes.

Rationale: #8528 reported that an agent on Windows didn't start because the state file contained all spaces, which sounds unlikely to be caused a hardware fault or user error.

@julianbrost
Copy link
Contributor Author

There's a report of corrupted state files after unclean shutdowns of Windows in the community forum: https://community.icinga.com/t/icinga2-sevice-does-not-start-after-unclean-shutdown/7692/4

So I'd say it's plausible that there's actually a problem to find here.

@nocturneop15
Copy link

Thank you Julian for pointing me the right direction. I can confirm we have full file of empty characters in case of windows crash/unclean shutdown, that prevents service start. After removing this file, new file is created and service starts normally.

This situation arrised many times, affects random monitored hosts. Our environment consist of Windows Server 2019 DTC, 2016 STD, 2012R2, all operating systems are affected.

@julianbrost julianbrost added the core/crash Shouldn't happen, requires attention label Jul 16, 2021
@julianbrost
Copy link
Contributor Author

Or probably even a better approach than checking what we currently do (I think we can tell by now that it's unreliable): figure out how to atomically replace a file on Windows and just implement this instead of what we currently do (Boost file system operations where you have to look through the boost source to even know which syscalls they use).

@K0nne
Copy link
Contributor

K0nne commented Aug 4, 2021

We are currently investigating an suspicious Icinga Agent (2.11.7).
When I looked in its logs I found the following errror, which looks like the statefile-problem:

[2021-06-26 07:49:57 +0200] information/ConfigObject: Dumping program state to file 'C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state'
[2021-06-26 07:49:58 +0200] critical/ThreadPool: Exception thrown in event handler:
Error: boost::filesystem::rename: Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: "C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state.yG7o8p", "C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state"

@K0nne
Copy link
Contributor

K0nne commented Aug 4, 2021

Its statefile was corrupted:
04-08-_2021_11-03-03

When we did a config-check with the agent it looked all good. Maybe it whould be a good idea to verify the integrity of the statefile as part of daemon -C. We can't monitor the statefile if the agent is not working. A malevolent person could sabotage the monitoring by corrupting this file.

@julianbrost
Copy link
Contributor Author

Did the startup issue with the corrupted log file appear shortly after that log message? Or is that just an older message you found in the logs? I wouldn't expect that to lead to a corrupted state file. From what we know, errors like that might appear when for example anti-virus software scans the corresponding files while we want to rename then. So with version 2.11.9, we added a workaround for that (#8770).

Also was there an unclean shutdown of the machine recently? If I remember correctly, that happened in all cases where we've seen this type of file corruption so far.

@K0nne
Copy link
Contributor

K0nne commented Aug 4, 2021

This was indeed an older log-message. The problem occured a few days ago. There were no corresponding log messages. So maybe its not related. Thanks for the hint with the workaround! 2.11.10 agents are allready on the todo list.

We are investigating if a unclean shutdown occured.

@julianbrost
Copy link
Contributor Author

Random thought I had I want to write down so that I don't forget: The rename operation might be fine as is (at least I read documentation on that and what we're doing didn't immediately sound broken) but maybe we're missing something like fsync(2) on the file before renaming it?

@julianbrost julianbrost removed their assignment Oct 13, 2021
@Al2Klimov Al2Klimov added the area/windows Windows agent and plugins label Oct 26, 2021
@K0nne
Copy link
Contributor

K0nne commented Nov 5, 2021

Since 2.11.11 we had 2 occurrences of a corrupted state-file with a base of 3200 windows agents. The workaround seems to work.

@tectumopticum
Copy link

It is still an issue with agent-version v2.12.3

@julianbrost
Copy link
Contributor Author

Wild guess: Adding FlushFileBuffers() right before that RenameFile() in

sfp->Close();
fp.close();
Utility::RenameFile(tempFilename, filename);

might improve things. But I'm just guessing that the rename operation is fine (or good enough for our requirements here) and the actual file contents are corrupt, so chances are good that some flushing of the file to disk is missing and FlushFileBuffers() sounds like it performs that operation.

However, easier said than done: The std::fstream in use there provides no access to the OS file handle, so one would have to figure out if opening a second handle and flushing on that handle also flushes writes done using other handles. If not, more rework of the code would be needed to use a write mechanism that allows access to the underlying handles.

@Al2Klimov
Copy link
Member

What about std::flush?

@julianbrost
Copy link
Contributor Author

One could give it a try, after all I'm just guessing at this point as well. But I wouldn't expect flush to do much more than close is doing already.

@julianbrost
Copy link
Contributor Author

The change I just merged should fix this and is also planned to be released in 2.13.5. Please report back if you're still seeing this issue after that release is out.

@K0nne
Copy link
Contributor

K0nne commented Aug 1, 2022

I will report back. Thanks a lot for the effort.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows Windows agent and plugins core/crash Shouldn't happen, requires attention core/evaluate Analyse/Evaluate features and problems
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants