Verify consistency guarantees of file operations on Windows #8840

julianbrost · 2021-06-22T09:49:37Z

Icinga uses the pattern of writing a new version of a file to a temporary file location and once that's done, moving it to the final location replacing the old version. We should evaluate if this also gives the desired consistency on Windows, given that things might work differently and there's Boost in between potentially hiding which syscalls are actually used behind the scenes.

Rationale: #8528 reported that an agent on Windows didn't start because the state file contained all spaces, which sounds unlikely to be caused a hardware fault or user error.

julianbrost · 2021-07-16T08:26:35Z

There's a report of corrupted state files after unclean shutdowns of Windows in the community forum: https://community.icinga.com/t/icinga2-sevice-does-not-start-after-unclean-shutdown/7692/4

So I'd say it's plausible that there's actually a problem to find here.

nocturneop15 · 2021-07-16T10:24:19Z

Thank you Julian for pointing me the right direction. I can confirm we have full file of empty characters in case of windows crash/unclean shutdown, that prevents service start. After removing this file, new file is created and service starts normally.

This situation arrised many times, affects random monitored hosts. Our environment consist of Windows Server 2019 DTC, 2016 STD, 2012R2, all operating systems are affected.

julianbrost · 2021-07-16T12:06:41Z

Or probably even a better approach than checking what we currently do (I think we can tell by now that it's unreliable): figure out how to atomically replace a file on Windows and just implement this instead of what we currently do (Boost file system operations where you have to look through the boost source to even know which syscalls they use).

K0nne · 2021-08-04T08:21:52Z

We are currently investigating an suspicious Icinga Agent (2.11.7).
When I looked in its logs I found the following errror, which looks like the statefile-problem:

[2021-06-26 07:49:57 +0200] information/ConfigObject: Dumping program state to file 'C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state'
[2021-06-26 07:49:58 +0200] critical/ThreadPool: Exception thrown in event handler:
Error: boost::filesystem::rename: Der Prozess kann nicht auf die Datei zugreifen, da sie von einem anderen Prozess verwendet wird: "C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state.yG7o8p", "C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state"

K0nne · 2021-08-04T09:08:18Z

Its statefile was corrupted:

When we did a config-check with the agent it looked all good. Maybe it whould be a good idea to verify the integrity of the statefile as part of daemon -C. We can't monitor the statefile if the agent is not working. A malevolent person could sabotage the monitoring by corrupting this file.

julianbrost · 2021-08-04T09:21:27Z

Did the startup issue with the corrupted log file appear shortly after that log message? Or is that just an older message you found in the logs? I wouldn't expect that to lead to a corrupted state file. From what we know, errors like that might appear when for example anti-virus software scans the corresponding files while we want to rename then. So with version 2.11.9, we added a workaround for that (#8770).

Also was there an unclean shutdown of the machine recently? If I remember correctly, that happened in all cases where we've seen this type of file corruption so far.

K0nne · 2021-08-04T09:30:56Z

This was indeed an older log-message. The problem occured a few days ago. There were no corresponding log messages. So maybe its not related. Thanks for the hint with the workaround! 2.11.10 agents are allready on the todo list.

We are investigating if a unclean shutdown occured.

julianbrost · 2021-09-03T09:30:27Z

Random thought I had I want to write down so that I don't forget: The rename operation might be fine as is (at least I read documentation on that and what we're doing didn't immediately sound broken) but maybe we're missing something like fsync(2) on the file before renaming it?

K0nne · 2021-11-05T09:56:15Z

Since 2.11.11 we had 2 occurrences of a corrupted state-file with a base of 3200 windows agents. The workaround seems to work.

tectumopticum · 2022-05-06T13:39:29Z

It is still an issue with agent-version v2.12.3

julianbrost · 2022-06-07T12:21:47Z

Wild guess: Adding FlushFileBuffers() right before that RenameFile() in

icinga2/lib/base/configobject.cpp

Lines 504 to 508 in 18c8b4a

    
           sfp->Close(); 
        
           fp.close(); 
        
           Utility::RenameFile(tempFilename, filename);

might improve things. But I'm just guessing that the rename operation is fine (or good enough for our requirements here) and the actual file contents are corrupt, so chances are good that some flushing of the file to disk is missing and FlushFileBuffers() sounds like it performs that operation.

However, easier said than done: The std::fstream in use there provides no access to the OS file handle, so one would have to figure out if opening a second handle and flushing on that handle also flushes writes done using other handles. If not, more rework of the code would be needed to use a write mechanism that allows access to the underlying handles.

Al2Klimov · 2022-06-14T07:26:25Z

What about std::flush?

julianbrost · 2022-06-14T07:38:38Z

One could give it a try, after all I'm just guessing at this point as well. But I wouldn't expect flush to do much more than close is doing already.

julianbrost · 2022-08-01T15:24:45Z

The change I just merged should fix this and is also planned to be released in 2.13.5. Please report back if you're still seeing this issue after that release is out.

K0nne · 2022-08-01T17:15:35Z

I will report back. Thanks a lot for the effort.

julianbrost added the core/evaluate Analyse/Evaluate features and problems label Jun 22, 2021

julianbrost self-assigned this Jun 22, 2021

julianbrost mentioned this issue Jun 22, 2021

Icinga agent (windows) is unable to start when the state file is corrupted #8528

Closed

julianbrost added the core/crash Shouldn't happen, requires attention label Jul 16, 2021

julianbrost removed their assignment Oct 13, 2021

Al2Klimov added the area/windows Windows agent and plugins label Oct 26, 2021

julianbrost mentioned this issue Nov 5, 2021

Icinga DB: decouple environment from Icinga 2 Environment constant #9036

Merged

julianbrost mentioned this issue Jul 18, 2022

Windows: explicitly flush state file #9446

Closed

julianbrost linked a pull request Aug 1, 2022 that will close this issue

Dump state file atomically not to corrupt it #9451

Merged

julianbrost closed this as completed in #9451 Aug 1, 2022

icinga-probot bot added this to the 2.14.0 milestone Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify consistency guarantees of file operations on Windows #8840

Verify consistency guarantees of file operations on Windows #8840

julianbrost commented Jun 22, 2021

julianbrost commented Jul 16, 2021

nocturneop15 commented Jul 16, 2021

julianbrost commented Jul 16, 2021

K0nne commented Aug 4, 2021 •

edited

Loading

K0nne commented Aug 4, 2021 •

edited

Loading

julianbrost commented Aug 4, 2021

K0nne commented Aug 4, 2021

julianbrost commented Sep 3, 2021

K0nne commented Nov 5, 2021 •

edited

Loading

tectumopticum commented May 6, 2022

julianbrost commented Jun 7, 2022

Al2Klimov commented Jun 14, 2022

julianbrost commented Jun 14, 2022

julianbrost commented Aug 1, 2022

K0nne commented Aug 1, 2022

Verify consistency guarantees of file operations on Windows #8840

Verify consistency guarantees of file operations on Windows #8840

Comments

julianbrost commented Jun 22, 2021

julianbrost commented Jul 16, 2021

nocturneop15 commented Jul 16, 2021

julianbrost commented Jul 16, 2021

K0nne commented Aug 4, 2021 • edited Loading

K0nne commented Aug 4, 2021 • edited Loading

julianbrost commented Aug 4, 2021

K0nne commented Aug 4, 2021

julianbrost commented Sep 3, 2021

K0nne commented Nov 5, 2021 • edited Loading

tectumopticum commented May 6, 2022

julianbrost commented Jun 7, 2022

Al2Klimov commented Jun 14, 2022

julianbrost commented Jun 14, 2022

julianbrost commented Aug 1, 2022

K0nne commented Aug 1, 2022

K0nne commented Aug 4, 2021 •

edited

Loading

K0nne commented Aug 4, 2021 •

edited

Loading

K0nne commented Nov 5, 2021 •

edited

Loading