Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"mv: replace 'blah', overriding mode 0007?" hidden prompt seemingly hangs opam download #4516

Closed
kwshi opened this issue Jan 30, 2021 · 7 comments

Comments

@kwshi
Copy link

kwshi commented Jan 30, 2021

I'll preface this with a concession that I am trying to use opam in a very non-standard setting, and that may be the reason behind this bug. But I wanted to open an issue anyway to see if people more familiar with the internal workings of opam have better insight on why this is happening, or at least what is happening under the hood to help me find a more direct way to reproduce this error.

The issue in question: while trying to build a Podman container containing opam and a few packages, opam appears to hang indefinitely on

Processing  4/4: [dune-configurator.2.8.2: dl]

until I press the Enter key, which promptly causes it to proceed and fail with

[ERROR] The sources of the following couldn't be obtained, aborting:
          - dune-configurator.2.8.2

Error: error building at STEP "RUN opam install -y dune-configurator": error while running runtime: exit status 40

Per advice from #3970, I reran the install command with -vvv and found that the command causing the "hang" was a prompt from mv, saying:

+ /usr/sbin/mv "/home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp.part" "/home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp"
- /usr/sbin/mv: replace '/home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp', overriding mode 0007 (------rwx)?

Indeed, simply pressing Enter causes it to fail, but typing y and pressing Enter actually allows the install to succeed. But clearly this isn't intended behavior, since (1) I ran opam with -y and (2) this confirmation prompt is coming from mv, not opam.

It's worth noting that dune-configurator is not the only package that causes this issue; I've also encountered it, depending on the order in which I install packages, the same issue with menhirLib, menhirSdk, and ocp-index. Installing dune-configurator first just happens to be the easiest way I've found to reproduce this message. That's why I think this issue is not specific to a single package but might have something to do with opam in general.

There are two really strange things about this:

  • There actually is no file named /home/opam/.opam/download-cache-sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp, so it totally baffles me why mv thinks it's "replacing" it.
  • The error may or may not appear, depending on the order in which you install packages. If you first run opam install -y ocp-index, and then run opam install -y dune-configurator, then the prompt/hang/error doesn't appear at all.

I am of the belief that the specific cause of this issue comes from Podman's rootless filesystem implementation. Nevertheless, my capacity to narrow down the exact source and find a clearer way to reproduce this bug is limited by the fact that I have no idea what exactly opam is doing under the hood to possibly induce this error, so I am seeking help here on figuring out exactly what is going on and why this error's appearance seems to vary depending on the order of package installations.

I also filed containers/buildah#2951 describing the same issue (describing observations on the Podman side of things).

opam config report

# opam config report
# opam-version      2.0.7
# self-upgrade      no
# system            arch=x86_64 os=linux os-distribution=void os-version=rolling
# solver            builtin-mccs+glpk
# install-criteria  -removed,-count[version-lag,request],-count[version-lag,changed],-changed
# upgrade-criteria  -removed,-count[version-lag,solution],-new
# jobs              11
# repositories      1 (http) (default repo at 0c15cf29)
# pinned            0
# current-switch    default

For reference, a complete log of the Podman build (together with the opam install -vvvy dune-configurator log) is available here: https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log

@dra27
Copy link
Member

dra27 commented Jan 30, 2021

Thank you for the extremely detailed log report! The detail and checking you've done leads very quickly to the problem.

The issue is a race condition, combined with unusual permissions.

There actually is no file named /home/opam/.opam/download-cache-sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp, so it totally baffles me why mv thinks it's "replacing" it.

If I may be so bold as to offer debugging advice! Faced with this message:

replace '/home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp', overriding mode 0007 (------rwx)?

There are only two possibilities: /home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp does exist, or coreutils mv has a bug... I think it's better to begin assuming that it's the first one!

The issue here is a race condition (the file isn't apparent afterwards because one of the racing processes has renamed it again). When opam downloads files, the downloader is instructed to download to a .tmp.part file (see https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log#L62). On completion, this is renamed to .tmp and the checksums are verified. On success, this file has the .tmp extension removed and is formally cached.

dune-configurator and dune (and menhir and menhirSdk - I don't think ocp-index is involved here, other than helping the timing of the race condition) come from the same upstream package. opam is attempting to do the same download (see https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log#L62 and https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log#L81) and the mv commands are racing (https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log#L71 and https://github.com/kwshi/buildah-bug-reports/blob/main/opam-help/fail-log#L88).

You can see the mv effect fairly simply with:

touch foo bar
chmod 6 bar
mv foo bar

I haven't been able to persuade my system to reproduce the issue (but your logs mean I have no doubt that it's real!). opam 2.1 generalises the scheduler so that builds and downloads can take place at the same time - if you're able to experiment with the 2.1 beta4 binary that would be quite handy, although I expect that it may make the race condition even harder to surface.

This issue has always been present, but we wouldn't see this normally because with the expected 664 permissions on the files, mv would just succeed. There are two fixes required to opam:

  1. The mv calls should be mv -f to prevent the risk of prompts. This is easy and should certainly go in 2.1 and 2.0.8 (possibly 2.0.9 at this stage).
  2. When assembling the job list, opam should join the identical jobs together (i.e. dune and dune-configurator should depend on one download job, not each have their own racing ones)

xref #3741

@kwshi
Copy link
Author

kwshi commented Jan 30, 2021

Thanks for the response!

There are only two possibilities: /home/opam/.opam/download-cache/sha256/e2/e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp does exist, or coreutils mv has a bug... I think it's better to begin assuming that it's the first one!

the file isn't apparent afterwards because one of the racing processes has renamed it again

On that note, it's worth mentioning that even when the file isn't apparent afterwards, mv continues to give the same prompt (this is what happens if you launch the container before it runs the opam install -vvvy dune-configurator command, then run it manually from within the shell, then suspend the process and manually cd to the path in question and type the following):

$ ls -la
total 2152
drwxr-xr-x 2 opam opam    4096 Jan 30 16:33 .
drwxr-xr-x 5 opam opam    4096 Jan 30 16:33 ..
-rw-r--r-- 1 opam opam 1097153 Jan 30 16:33 e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e
-rw-r--r-- 1 opam opam 1097153 Jan 30 16:33 e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp.part
$ mv e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp.part e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp
mv: replace 'e2c4e8230f7c96236503fd75f22bdbc263639971bf104509e446855ded35ae1e.tmp', overriding mode 0007 (------rwx)?

I do wonder if that teeters us a little bit closer to the "coreutils mv has a bug" possibility (or, for that matter, the underlying filesystem implementation)...

@kwshi
Copy link
Author

kwshi commented Jan 30, 2021

touch foo bar
chmod 6 bar
mv foo bar

Ah!! Your description, along with that example, is exactly what I needed to find a minimal example reproducing this bug, independent of opam. Here's how it works:

  • Create a file a and give it 006 permissions.
  • Rename a to something else.
  • Try to mv another file to a. This should work, because a has already been renamed away, but mv prompts anyway about replacing the permissions on a.

Here is a sequence of commands implementing the above:

touch a b
chmod 6 a
mv a x
ls -la
mv b a

and output:

$ cd /tmp
$ touch a b
$ chmod 6 a
$ mv a x
$ ls -la
total 8
drwxrwxrwt  2 root root 4096 Jan 30 16:49 .
drwxr-xr-x 17 root root 4096 Jan 30 16:49 ..
-rw-rw-r--  1 test test    0 Jan 30 16:49 b
-------rw-  1 test test    0 Jan 30 16:49 x
$ mv b a
mv: replace 'a', overriding mode 5710 (rws--x--T)?

I have written some containerfiles demonstrating this issue on various base systems. (To be clear, this is not a bug I can reproduce on my host system either--only from within a podman/buildah container. Which is why I'm led to believe there is an error in the underlying filesystem implementation.) What's especially interesting is that, among those examples, Alpine is the only OS not exhibiting the bug, and incidentally Alpine is the only OS not using coreutils mv (instead they use busybox). Hmm...

@AltGr
Copy link
Member

AltGr commented Feb 1, 2021

@rjbou will confirm, but I think that in 2.1 she already implemented that part:

2 When assembling the job list, opam should join the identical jobs together (i.e. dune and dune-configurator should depend on one download job, not each have their own racing ones)

which should avoid the issue.

@dra27
Copy link
Member

dra27 commented Feb 2, 2021

@kwshi - how interesting that a bug in opam surfaced a bug in fuse-overlayfs!

You ought to be seeing a similar message from busybox mv. It turns out that POSIX mandates a message in that situation if stdin is a tty (and the busybox code includes it: https://git.busybox.net/busybox/tree/coreutils/mv.c#n98).

@kwshi
Copy link
Author

kwshi commented Feb 6, 2021

Apart from alpine, what's another distro that uses busybox? Also, perhaps it has to do with differences in how the "already exists" check is done, namely

access(dest, W_OK) < 0

that enables one to be more robust against the bug than the other? (After all, the correct behavior is that it should not think the file already exists and therefore should not give the message.)

@dra27
Copy link
Member

dra27 commented Jun 25, 2021

The multiple download issue still exists, but as this is tracked in another issue I'm closing this one

@dra27 dra27 closed this as completed Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants