mission record is not properly cleaned up (when missions fail) #256

DaveyBiggers · 2016-08-04T10:23:24Z

Example in ALE:
Exception in closing of MissionRecord: boost::filesystem::remove: Directory not empty: "./mission_records/c8ba55fd-cf22-49e5-b20f-82e5f7677d66"
Similarly, in Malmo, mission records are not properly destroyed when a mission fails ("Call to write failed").

DaveyBiggers · 2016-08-05T18:05:47Z

Here's one scenario I've managed to reproduce:

Create a MissionRecordSpec. This creates a temp directory and chooses path names.
Call AgentHost.startMission, passing the MissionRecordSpec. This creates a new MissionRecord object from the MissionRecordSpec, which takes its paths.
Mission fails to start
Rety: call AgentHost.startMission passing the same MissionRecordSpec as before. This creates a new MissionRecord object but with the same temp folder as before
The AgentHost's old MissionRecord object gets destroyed as it creates the new one.
The MissionRecord destructor calls close(), which tars and zips everything up, and deletes the temp folder
The mission now runs with the new MissionRecord object
The mission ends, the new MissionRecord object closes, and attempts to zip up a temp folder that no longer exists, resulting in the error "Attempt to write to non-existent directory"

The work-around is to make sure there is a fresh MissionRecordSpec object for every call to startMission - our samples are wrong in this regard.

So far I can't reproduce the "Call to write failed" error. Still looking...

DaveyBiggers · 2016-08-08T12:35:42Z

The "Call to write failed" message comes from PosixFrameWriter::doWrite, so this appears to be something to do with attempting to write frames after the ffmpeg pipe has been closed. Not currently sure what effect this has on anything - frames arriving after the mission end are to be expected, and the system ought to be resilient to them...

DaveyBiggers · 2016-08-09T10:07:09Z

Building on the above scenario, where we end up using a temp folder that no longer exists:

On windows, the ffmpeg process and the pipe are both created with no errors, even though ffmpeg has been told to write to a non-existent file.
The mission will start running, frames will start being pumped to ffmpeg
Eventually ffmpeg's buffer fills and it attempts to write to file. At this point it fails, and the process ends.
The pipe seems to stay active, but there is now nothing taking data out of the other end.
WindowsFrameWriter::doWrite calls WriteFile, but the pipe is now full. WriteFile helpfully waits until the pipe clears, which it never will since ffmpeg has fled in betrayal - so doWrite never returns.
Meanwhile, the mission ends, and a MissionEnded message is sent to the AgentHost.
AgentHost::onMissionControlMessage is called, and acquires the world_state_mutex
onMissionControlMessage then calls AgentHost::close
AgentHost::close calls VideoServer::stopRecording...
...which calls WindowsFrameWriter::close...
...which calls VideoFrameWriter::close...
...which calls join on the frame writer thread...
but the frame writer thread is still stuck waiting for doWrite to return.
So AgentHost::onMissionControlMessage never returns - and never releases the world_state_mutex.
Meanwhile the agent still wants to know what is going on, so calls getWorldState...
...which waits for the world_state_mutex.

And then everything is stuck.

DaveyBiggers · 2016-08-09T14:50:53Z

Have changed the way things work. The MissionRecordSpec is now just that - a spec. It's now safe to reuse as often as you like. The temp folder is now created and owned by the MissionRecord object, so there will always be a new folder for each call to StartMission. This should fix many of the problems we've been having.
The API for getting the location of the temp folder has been moved to the AgentHost, since the MissionRecordSpec no longer knows anything about it.

DaveyBiggers self-assigned this Aug 5, 2016

DaveyBiggers added this to the Dolphin milestone Aug 5, 2016

DaveyBiggers added the P1 label Aug 5, 2016

DaveyBiggers mentioned this issue Aug 9, 2016

Client sends MALMOOK, but mission never starts #236

Open

DaveyBiggers mentioned this issue Aug 9, 2016

Off the record #270

Merged

DaveyBiggers closed this as completed in #270 Aug 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mission record is not properly cleaned up (when missions fail) #256

mission record is not properly cleaned up (when missions fail) #256

DaveyBiggers commented Aug 4, 2016

DaveyBiggers commented Aug 5, 2016

DaveyBiggers commented Aug 8, 2016

DaveyBiggers commented Aug 9, 2016

DaveyBiggers commented Aug 9, 2016

mission record is not properly cleaned up (when missions fail) #256

mission record is not properly cleaned up (when missions fail) #256

Comments

DaveyBiggers commented Aug 4, 2016

DaveyBiggers commented Aug 5, 2016

DaveyBiggers commented Aug 8, 2016

DaveyBiggers commented Aug 9, 2016

DaveyBiggers commented Aug 9, 2016