Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

podman-machine-default.json gets corrupted on improper Windows shutdown #18011

Closed
nefarius opened this issue Apr 2, 2023 · 12 comments · Fixed by containers/common#1397 or #18035
Closed
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine remote Problem is in podman-remote windows issue/bug on Windows

Comments

@nefarius
Copy link

nefarius commented Apr 2, 2023

Issue Description

The file C:\Users\<username>\.config\containers\podman\machine\wsl\podman-machine-default.json gets corrupted (basically filled with a bunch of NULLs) on an improper Windows shutdown, like a BSOD. I had like three of them due to debugging a driver project and the json file got consistently corrupted despite me not using any podman command or changed anything in Podman Desktop during the session.

Steps to reproduce the issue

Steps to reproduce the issue

  1. Simply set up podman on Windows as per guide
  2. Crash the system (e.g. using NotMyFault)
  3. After a reboot, podman won't work anymore due to corrupted podman-machine-default.json

Describe the results you received

Podman is basically completely dead until it is re-initialized.

Describe the results you expected

I'd like to understand how a file that is not actively written to during the crash can get mangled so badly.

podman info output

host:
  arch: amd64
  buildahVersion: 1.29.0
  cgroupControllers:
  - cpuset
  - cpu
  - cpuacct
  - blkio
  - memory
  - devices
  - freezer
  - net_cls
  - perf_event
  - net_prio
  - hugetlb
  - pids
  - rdma
  cgroupManager: cgroupfs
  cgroupVersion: v1
  conmon:
    package: conmon-2.1.5-1.fc36.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.5, commit: '
  cpuUtilization:
    idlePercent: 99.91
    systemPercent: 0.07
    userPercent: 0.02
  cpus: 16
  distribution:
    distribution: fedora
    variant: container
    version: "36"
  eventLogger: journald
  hostname: BLYAT-PC
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 5.10.102.1-microsoft-standard-WSL2
  linkmode: dynamic
  logDriver: journald
  memFree: 53134725120
  memTotal: 53831020544
  networkBackend: netavark
  ociRuntime:
    name: crun
    package: crun-1.8.1-1.fc36.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.1
      commit: f8a096be060b22ccd3d5f3ebe44108517fbf6c30
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +WASM:wasmedge +YAJL
  os: linux
  remoteSocket:
    exists: true
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: true
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.0-0.2.beta.0.fc36.x86_64
    version: |-
      slirp4netns version 1.2.0-beta.0
      commit: 477db14a24ff1a3de3a705e51ca2c4c1fe3dda64
      libslirp: 4.6.1
      SLIRP_CONFIG_VERSION_MAX: 3
      libseccomp: 2.5.3
  swapFree: 13958643712
  swapTotal: 13958643712
  uptime: 0h 28m 16.00s
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  volume:
  - local
registries:
  search:
  - docker.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 269490393088
  graphRootUsed: 815177728
  graphStatus:
    Backing Filesystem: extfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "true"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.4.1
  Built: 1676629882
  BuiltTime: Fri Feb 17 11:31:22 2023
  GitCommit: ""
  GoVersion: go1.18.10
  Os: linux
  OsArch: linux/amd64
  Version: 4.4.1

Podman in a container

No

Privileged Or Rootless

Privileged

Upstream Latest Release

Yes

@nefarius nefarius added the kind/bug Categorizes issue or PR as related to a bug. label Apr 2, 2023
@github-actions github-actions bot added the remote Problem is in podman-remote label Apr 2, 2023
@vrothberg
Copy link
Member

Thanks for reaching out, @nefarius!

@ashley-cui @baude PTAL

@Luap99 Luap99 added machine windows issue/bug on Windows labels Apr 3, 2023
@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

@n1hility ideas?

@vrothberg
Copy link
Member

The WSL backend writes the file atomically (i.e., into a tmp file and then does a rename). The QEMU backend does not and writes directly (should be fixed).

@nefarius which backend are you using?

@Luap99
Copy link
Member

Luap99 commented Apr 3, 2023

@vrothberg windows only supports the WSL backend at the moment.

@nefarius
Copy link
Author

nefarius commented Apr 3, 2023

Using WSL2.

@vrothberg
Copy link
Member

vrothberg commented Apr 3, 2023

Ah, "Even within the same directory, on non-Unix platforms Rename is not an atomic operation" (see os.Rename docs).

@vrothberg
Copy link
Member

@n1hility, is it even possible to atomically-rename a file on Windows? golang/go#22397 (comment)

@vrothberg
Copy link
Member

vrothberg commented Apr 3, 2023

OK, Windows has a syscall for it that is being used in https://github.com/natefinch/atomic

@n1hility
Copy link
Member

n1hility commented Apr 3, 2023

@vrothberg golang uses MOVEFILE_REPLACE_EXISTING since golang 1.5: https://github.com/golang/go/blob/master/src/internal/syscall/windows/syscall_windows.go#L317

(same mechanism as in that library)

If the underlying file system is NTFS, and its on the same volume, its a metadata write which is atomic.

However its not atomic/transacted relative to the previous tmp write. Looking at it we aren't flushing there so thats a problem, will do a patch there.

@nefarius
Copy link
Author

nefarius commented Apr 3, 2023

It just happened again 🤣 but hey, good news, I found the faulty driver in the meantime; it's not my own but the GPU drivers causing the crash, yay! 🎉

@n1hility
Copy link
Member

n1hility commented Apr 3, 2023

Glad to here @nefarius!

Note the linked PR is only one of two parts. While testing various system failures, I was able to trigger a similar problem to containers.conf which is written to during init (less common than the vm conf updates, which happen on all life cycle operations, but still a concern). The second issue will require a patch to containers/common

@n1hility
Copy link
Member

n1hility commented Apr 8, 2023

Not fully fixed yet, one more PR to go

@n1hility n1hility reopened this Apr 8, 2023
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Aug 27, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 27, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. machine remote Problem is in podman-remote windows issue/bug on Windows
Projects
None yet
4 participants