Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mission record is not properly cleaned up (when missions fail) #256

Closed
DaveyBiggers opened this issue Aug 4, 2016 · 4 comments · Fixed by #270
Closed

mission record is not properly cleaned up (when missions fail) #256

DaveyBiggers opened this issue Aug 4, 2016 · 4 comments · Fixed by #270
Assignees
Labels
Milestone

Comments

@DaveyBiggers
Copy link
Member

Example in ALE:
Exception in closing of MissionRecord: boost::filesystem::remove: Directory not empty: "./mission_records/c8ba55fd-cf22-49e5-b20f-82e5f7677d66"
Similarly, in Malmo, mission records are not properly destroyed when a mission fails ("Call to write failed").

@DaveyBiggers DaveyBiggers self-assigned this Aug 5, 2016
@DaveyBiggers DaveyBiggers added this to the Dolphin milestone Aug 5, 2016
@DaveyBiggers DaveyBiggers added the P1 label Aug 5, 2016
@DaveyBiggers
Copy link
Member Author

Here's one scenario I've managed to reproduce:

  • Create a MissionRecordSpec. This creates a temp directory and chooses path names.
  • Call AgentHost.startMission, passing the MissionRecordSpec. This creates a new MissionRecord object from the MissionRecordSpec, which takes its paths.
  • Mission fails to start
  • Rety: call AgentHost.startMission passing the same MissionRecordSpec as before. This creates a new MissionRecord object but with the same temp folder as before
  • The AgentHost's old MissionRecord object gets destroyed as it creates the new one.
  • The MissionRecord destructor calls close(), which tars and zips everything up, and deletes the temp folder
  • The mission now runs with the new MissionRecord object
  • The mission ends, the new MissionRecord object closes, and attempts to zip up a temp folder that no longer exists, resulting in the error "Attempt to write to non-existent directory"

The work-around is to make sure there is a fresh MissionRecordSpec object for every call to startMission - our samples are wrong in this regard.

So far I can't reproduce the "Call to write failed" error. Still looking...

@DaveyBiggers
Copy link
Member Author

The "Call to write failed" message comes from PosixFrameWriter::doWrite, so this appears to be something to do with attempting to write frames after the ffmpeg pipe has been closed. Not currently sure what effect this has on anything - frames arriving after the mission end are to be expected, and the system ought to be resilient to them...

@DaveyBiggers
Copy link
Member Author

Building on the above scenario, where we end up using a temp folder that no longer exists:

  • On windows, the ffmpeg process and the pipe are both created with no errors, even though ffmpeg has been told to write to a non-existent file.
  • The mission will start running, frames will start being pumped to ffmpeg
  • Eventually ffmpeg's buffer fills and it attempts to write to file. At this point it fails, and the process ends.
  • The pipe seems to stay active, but there is now nothing taking data out of the other end.
  • WindowsFrameWriter::doWrite calls WriteFile, but the pipe is now full. WriteFile helpfully waits until the pipe clears, which it never will since ffmpeg has fled in betrayal - so doWrite never returns.
  • Meanwhile, the mission ends, and a MissionEnded message is sent to the AgentHost.
  • AgentHost::onMissionControlMessage is called, and acquires the world_state_mutex
  • onMissionControlMessage then calls AgentHost::close
  • AgentHost::close calls VideoServer::stopRecording...
  • ...which calls WindowsFrameWriter::close...
  • ...which calls VideoFrameWriter::close...
  • ...which calls join on the frame writer thread...
  • but the frame writer thread is still stuck waiting for doWrite to return.
  • So AgentHost::onMissionControlMessage never returns - and never releases the world_state_mutex.
  • Meanwhile the agent still wants to know what is going on, so calls getWorldState...
  • ...which waits for the world_state_mutex.

And then everything is stuck.

@DaveyBiggers
Copy link
Member Author

Have changed the way things work. The MissionRecordSpec is now just that - a spec. It's now safe to reuse as often as you like. The temp folder is now created and owned by the MissionRecord object, so there will always be a new folder for each call to StartMission. This should fix many of the problems we've been having.
The API for getting the location of the temp folder has been moved to the AgentHost, since the MissionRecordSpec no longer knows anything about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant