Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build Hangs for libxml2 with MSYS/MinGW in Windows Docker Container #77

Closed
wtchappell opened this issue Dec 21, 2021 · 40 comments
Closed

Comments

@wtchappell
Copy link

I'm noticing an odd hang while trying to build libxml2 with MSYS + MinGW that I'm really struggling to dig into...

I'm using MSYS/MinGW within a Windows Docker container - place both of these files into a directory and build with Windows Docker Desktop using docker build -t test-image . while in said directory.
Dockerfile
mirrorupgrade.hook

Once I'm running that image as a container I'm doing:

cd /tmp
wget http://xmlsoft.org/download/libxml2-2.9.12.tar.gz
tar xf libxml2-2.9.12.tar.gz
cd libxml2-2.9.12
./configure --prefix=/usr/local
make -j4

The build reliably hangs trying to link:

  <snip>
  CCLD     testdso.la
  CCLD     libxml2.la
C:\tools\msys64\mingw64\bin\ar.exe: `u' modifier ignored since `D' is the default (see `U')

Here's output from a build using make V=1 instead in order to show the actual compilation commands:
make_output_libxml2.txt

I suspect the issue is something happening within libtool or is related to that warning message from ar when building non-verbosely - if I build libxml2 with CMake instead, everything works out fine. I've browsed the patches at https://github.com/msys2/MINGW-packages/tree/master/mingw-w64-libxml2, but none of them seem relevant to this failure mode - and I've seen a very similar failure mode happen in the same container while trying to build sqlite3 as well.

I'll note that building other Autotools packages on the same image - notably libffi, libyaml, and openssl seem to work fine.

Any ideas on what could be going on? I'm not sure how to proceed on debugging this further.

@mmuetzel
Copy link
Contributor

Are you sure linking hangs indefinitely for you? Linking can take pretty long on mingw. Especially, if you are low on RAM and the OS starts swapping to disc...

@wtchappell
Copy link
Author

Are you sure linking hangs indefinitely for you? Linking can take pretty long on mingw. Especially, if you are low on RAM and the OS starts swapping to disc...

LibXML2 normally builds in just a minute or so on the host using a non-MSYS build of the MinGW toolchain, and building LibXML2 on the same image but using CMake with the MSYS toolchain takes a similarly small amount of time - I've let the build process go overnight in the case described above without seeing any progress.

If it were to succeed after more than 12 hours I'd say that something is still not working quite right if building the same source with a different build system happens several orders of magnitude more quickly.

@maxnasonov
Copy link

I'm sorry if my issue is not relevant to the topic but it seems to me that it might have the same root cause - we've started experiencing very similar hangs at least for libapr builds after updating msys2 build environment a week ago. It hangs on one of the libtool executions and here is a part of the script when the hang is happening:

+ eval func_win32_libid '"$potlib"'
++ func_win32_libid C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
++ :
++ win32_libid_type=unknown
+ /usr/bin/sed -e 10q
+++ file -L C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
+ /usr/bin/grep -E '^x86 archive import|^x86 DLL'
++ win32_fileres='C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a: current ar archive'
++ case $win32_fileres in
++ eval objdump -f C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
+++ objdump -f C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
++ /usr/bin/sed -e 10q
++ /usr/bin/grep -E 'file format (pei*-i386(.*architecture: i386)?|pe-arm-wince|pe-x86-64|coff-arm|coff-arm64|coff-i386|coff-x86-64)'
++ case $nm_interface in
++ func_to_tool_file C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a func_convert_file_msys_to_w32
++ :
++ case ,$2, in
++ func_to_tool_file_result=C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
+++ eval /mingw64/bin/nm -B -f posix -A '"C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a"'
++++ /mingw64/bin/nm -B -f posix -A C:/tools/msys64/mingw64/x86_64-w64-mingw32/lib/libshell32.a
+++ /usr/bin/sed -n -e '
	    1,100{
		/ I /{
		    s|.*|import|
		    p
		    q
		}
	    }'

But for some reason this only happens for builds triggered by Jenkins - manual executions work fine. The builds are run directly on Windows VMs, without Docker. The previous build environment built from scratch on December 3rd works with Jenkins without problems.

@jeremyd2019
Copy link
Member

There seem to be some pipe deadlocks going on in cygwin. Their pipe code has been getting some overhaul lately, I'm not sure if all of the current fixes are in 3.3.3 or if there's been more since then.

@wtchappell
Copy link
Author

Is there any way I could help track down if this is indeed related to pipe deadlocks in the runtime?

@jeremyd2019
Copy link
Member

If you install msys2-runtime-devel you'll have debug symbols for the runtime. With msys2 gdb you could attach to a hung process (use ps -e to get its pid and run gdb --pid <pid>) and run thread apply all bt to see where it's stuck.

@wtchappell
Copy link
Author

Here's the pstree image during the hang from using make without any -j flag - uploading as an image because it doesn't paste very well as text in GitHub:
image

And then these are the results of running gdb against each of the processes descended from the initial make command:
3606-make.txt
3609-make.txt
3610-sh.txt
3625-sh.txt
3626-make.txt
4510-sh.txt
4671-sh.txt
4672-sed.txt
4674-grep.txt
4680-sh.txt
4681-sh.txt
4683-nm.txt

The leaf processes have threads with either ntdll!ZwWaitForMultipleObjects or ntdll!ZwWaitForSingleObject at the top of their stacks.

@jeremyd2019
Copy link
Member

Well, a cursory look at the last 5 files shows sed and grep apparently blocked on reads from pipes. nm, being native, doesn't show much but appears to be blocked on a write. The intervening sh processes are just waiting on child processes.

@wtchappell
Copy link
Author

That does sound like its related to the changes in msys2-runtime you alluded to earlier, right? I've never really looked at the Cygwin code before - definitely reads like a pretty specialized codebase.

I might build an msys2-runtime from master and see if the issue still happens.

@jeremyd2019
Copy link
Member

there is no real master from msys2-runtime, each time a cygwin release comes out the msys2-specific patches are rebased on top of that. You could maybe rebase the msys2-specific patches on top of cygwin/master I guess

@jeremyd2019 jeremyd2019 transferred this issue from msys2/MINGW-packages Dec 25, 2021
@jeremyd2019
Copy link
Member

/cc @tyan0 any thoughts on this?

@wtchappell
Copy link
Author

there is no real master from msys2-runtime, each time a cygwin release comes out the msys2-specific patches are rebased on top of that. You could maybe rebase the msys2-specific patches on top of cygwin/master I guess

I'll take a look and will give that a shot if there's been any work since the last msys2-runtime release as far as pipe handling is concerned.

@jeremyd2019
Copy link
Member

jeremyd2019 commented Dec 31, 2021

I don't know if this is related or not, but I just saw some hangs with g-ir-scanner calling bash -c 'cmd //c echo something'. Not only is that a horribly convoluted way to try to get a string, it seems to reliably hang.

/clangarm64/lib/gobject-introspection/giscanner/utils.py: p = subprocess.Popen([shell, '-c', 'cmd //C echo ' + arg], stdout=subprocess.PIPE)

@jeremyd2019
Copy link
Member

No, I think this may be arm64 specific.

@jeremyd2019
Copy link
Member

The g-ir-scanner thing seems to have gotten better after a reboot.

@Marc-Aldorasi-Imprivata

I am having the same issue when building mingw. I managed to reduce the issue to the following shell script:

#!/bin/sh
seq 1 99999 > big_file
eval '$(eval cmd.exe //c "type big_file" | : )'

When running as a normal user this completes immediately, but when run as a system service it hangs forever. The issue appears to be that when running under the SYSTEM account, the third sh process holds open the read end of the pipe. Since cmd has too much output to write all at once, it waits until the pipe's buffer has room to write more, but since sh isn't actually reading from the pipe, this hangs forever. When running as a normal user the read end of the pipe is not kept open, and so cmd.exe gets an error when attempting to write and exits immediately.

My suspicion is that this is caused by f79a461 (which keeps the read end of the pipe open) and b531d6b (which changes the behavior depending on whether or not the program is running as a service).

@jeremyd2019
Copy link
Member

Thanks for the detailed investigation! That could explain why I've never seen this (and I've never managed to find time to learn how to set up/use docker on Windows, was thinking about trying to leverage GHA to test it...). I think the next step would be to verify that this dupes on upstream cygwin (I expect it would), and report this there.

I wonder why they avoided the better code if running as SYSTEM... I could see service accounts being sandboxed so they wouldn't have access, but SYSTEM should have all the rights...

@jeremyd2019
Copy link
Member

jeremyd2019 commented Mar 21, 2022

Dupes on upstream cygwin. I've sent a report to them: https://cygwin.com/pipermail/cygwin/2022-March/251097.html

@wtchappell
Copy link
Author

Interesting. I'm running things as a regular user, but I wonder if the fact that Docker is also involved is triggering the same SYSTEM account behavior somewhere in the chain.

@jeremyd2019
Copy link
Member

Does that test case hang in a Docker container? Can you run (Windows) whoami /all in a Docker container?

@jeremyd2019
Copy link
Member

https://cygwin.com/pipermail/cygwin/2022-March/251100.html

Question is: Does the docker invoke the command using SYSTEM
account? Or is the processes in docker determined as running
as a service?

I confirmed the processes in Windows docker are running as
well_known_service_sid. Let me consider a while.

And a proposed patch:
https://cygwin.com/pipermail/cygwin-patches/2022q1/011856.html

@jeremyd2019
Copy link
Member

if it would be helpful, I can open a PR with that patch applied, so that a binary for testing will be available in the CI artifacts.

@jeremyd2019
Copy link
Member

#88 has the proposed patch applied

@jeremyd2019
Copy link
Member

@jeremyd2019
Copy link
Member

As requested upstream, a repo with a Github action that reproduces the hang (despite the proposed patch applied in #88):
https://github.com/jeremyd2019/msys2-pipe-hang-test

@jeremyd2019
Copy link
Member

The latest proposed patch (https://cygwin.com/pipermail/cygwin-patches/2022q1/011859.html) was applied to #88. This worked as expected in my test action.

@pananton (or anyone else who experiences this issue): please test with the current msys-2.0.dll from the artifacts of #88 (this would be https://github.com/msys2/msys2-runtime/suites/5802411523/artifacts/193963886 assuming the URL is stable)

@pananton
Copy link

pananton commented Mar 26, 2022

I can confirm that my problem is fixed with this patch.

aharpervc added a commit to veracross/tiny_tds that referenced this issue Mar 28, 2022
@jeremyd2019
Copy link
Member

A patch for this issue has landed upstream: e9c96f0.

@Pro
Copy link

Pro commented Apr 12, 2022

I ran into the exact same issue as @pananton when trying to build a package with conan in a Windows Docker Gitlab CI container.

In my case I have issues with the m4 package:
https://github.com/conan-io/conan-center-index/tree/master/recipes/m4/all

I already tried to use the provided msys-2.0.dll (from previous comments), but that then leads to a new problem.
I.e., the configure step is not stuck anymore, but it produces an error when calling make.

make  all-am
make[3]: Entering directory '/c/.conan/ec4526/1/lib'
  CC       asyncsafe-spin.obj
  CC       openat-proc.obj
openat-proc.c
asyncsafe-spin.c
  CC       gl_avltree_oset.obj
  CC       basename-lgpl.obj
gl_avltree_oset.c
basename-lgpl.c
  CC       binary-io.obj
  CC       bitrotate.obj
/c/.conan/ec4526/1/source_subfolder/build-aux/depcomp: line 526: : No such file or directory
grep: : No such file or directory
make[3]: *** [Makefile:2877: bitrotate.obj] Error 1
make[3]: *** Waiting for unfinished jobs....
binary-io.c
make[3]: Leaving directory '/c/.conan/ec4526/1/lib'
make[2]: *** [Makefile:2481: all] Error 2
make[2]: Leaving directory '/c/.conan/ec4526/1/lib'
make[1]: *** [Makefile:2018: all-recursive] Error 1
make[1]: Leaving directory '/c/.conan/ec4526/1'
make: *** [Makefile:1974: all] Error 2
m4/1.4.19: 
m4/1.4.19: ERROR: Package '01edd76db8e16db9b38c3cca44ec466a9444c388' build failed

Locally on a Windows PC the same m4 conan recipe with the same msys2 conan package (cci.latest) works without any issue.

The version of this msys2 conan package is this one: http://repo.msys2.org/distrib/x86_64/msys2-base-x86_64-20220118.tar.xz

My interpretation of the error message above is, that the file bitrotate.obj does not exist. Maybe it's also related to some broken piping, and therefore expected content is just not piped to that file?

If I restart the job, then this is not deterministic, which file fails, but always some of the first files it tries to compile.

I will now try to use an older msys2 conan package, as mentioned by @pananton here:
msys2/MSYS2-packages#2893 (comment)

@pananton
Copy link

pananton commented Apr 12, 2022

@Pro I can confirm that I also had problems with building m4, but used libiconv as example. And sadly simply replacing msys2.dll did not help. As you've mentioned, build does not stuck with new version but then it fails during make. Actually at least for libiconv it fails from time to time, but sometimes succeeds - that's why I mistakenly commented that problem is solved earlier. It's pretty annoying that m4 conan recipe build fails because it is used for building some other recipes.

@jeremyd2019
Copy link
Member

If there's some other issue besides the hang, I recommend opening a new issue here with details/standalone steps to reproduce like in this issue.

grep: : No such file or directory

My interpretation of the error message above is, that the file bitrotate.obj does not exist.

My interpretation of the message above is, that something is trying to access an empty filename:

$ grep foo ''
grep: : No such file or directory

@Pro
Copy link

Pro commented Apr 12, 2022

My interpretation of the message above is, that something is trying to access an empty filename:

Jep, but this output is misleading. It tricked me too. After looking into the depcomp script, you will see the following lines:

https://git.savannah.gnu.org/cgit/gnulib.git/tree/build-aux/depcomp#n525

  if test "$libtool" = yes; then
    showIncludes=-Wc,-showIncludes
  else
    showIncludes=-showIncludes
  fi
  "$@" $showIncludes > "$tmpdepfile"
  stat=$?
  grep -v '^Note: including file: ' "$tmpdepfile"

And the "$@" is a call to the cl.exe with some additional parameters. This cl.exe is somehow failing, and therefore there is nothing produced and redirected to the file. Therefore, the succeeding grep is failing.

Nonetheless, thanks for your hint! I was able to solve my issue:


I was finally able to build the m4 package on a Windows Docker runner via Gitlab CI.

There are two things which were the final solution:

  1. As mentioned in this issue, there is a problem with the current msys2-runtime package. I modified the conan recipe of the msys2/cci.latest package to downgrade the msys2-runtime package to 3.2.0 using pacman (patch see below).

  2. The second problem (i.e., this make error / problem) was likely caused by an additional installation of the mingw package inside the Docker Container. I.e., it has choco install mingw -y --version 11.2.0.07112021 preinstalled.
    After I removed this line, and rebuilt the Windows Docker image, the build for m4 ran through.

---
 recipes/msys2/all/conanfile.py | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/recipes/msys2/all/conanfile.py b/recipes/msys2/all/conanfile.py
index 797bec9..440a3cf 100644
--- a/recipes/conan-center/msys2/all/conanfile.py
+++ b/recipes/conan-center/msys2/all/conanfile.py
@@ -74,6 +74,8 @@ class MSYS2Conan(ConanFile):
                 self._kill_pacman()
                 self.run('bash -l -c "pacman --debug --noconfirm --ask 20 -Syuu"')  # Normal update
                 self._kill_pacman()
+                self.run('bash -l -c "pacman --debug --noconfirm --ask 20 -U https://repo.msys2.org/msys/x86_64/msys2-runtime-3.2.0-8-x86_64.pkg.tar.zst https://repo.msys2.org/msys/x86_64/msys2-runtime-devel-3.2.0-8-x86_64.pkg.tar.zst"')
+                self._kill_pacman()
                 self.run('bash -l -c "pacman --debug -Rc dash --noconfirm"')
             except ConanException:
                 self.run('bash -l -c "cat /var/log/pacman.log || echo nolog"')
@@ -179,6 +181,6 @@ class MSYS2Conan(ConanFile):
 
         self.output.info("Appending PATH env var with : " + msys_bin)
         self.env_info.path.append(msys_bin)
-        
+
         self.conf_info["tools.microsoft.bash:subsystem"] = "msys2"
         self.conf_info["tools.microsoft.bash:path"] = os.path.join(msys_bin, "bash.exe")
-- 
2.17.1

@jeremyd2019
Copy link
Member

It should only fail like that if the $tmpdepfile variable is empty.

@Pro
Copy link

Pro commented Apr 12, 2022

It should only fail like that if the $tmpdepfile variable is empty.

Ah, correct!
Just checked the script again, tmpdepfile is set here:

https://git.savannah.gnu.org/cgit/gnulib.git/tree/build-aux/depcomp#n126

depfile=${depfile-`echo "$object" |
  sed 's|[^\\/]*$|'${DEPDIR-.deps}'/&|;s|\.\([^.]*\)$|.P\1|;s|Pobj$|Po|'`}
tmpdepfile=${tmpdepfile-`echo "$depfile" | sed 's/\.\([^.]*\)$/.T\1/'`}

Not sure why, but I guess one of these commands (maybe sed) was in conflict with the pre-installed mingw package from chocolatey, and it did not use the one from msys2.

Can only guess, but now it's solved for me, at lease when downgrading msys2-runtime and making sure there is no mingw installed.

@tyan0
Copy link
Contributor

tyan0 commented Apr 13, 2022

@Pro Is the msys-2.0.dll you used applied the latest patch e9c96f0? From where did you download the countermeasure version of msys-2.0.dll? The binaries at https://github.com/msys2/msys2-runtime/suites/5802411523/artifacts/193963886 seems to be applied the older version patch (perhaps v3 patch). The latest patch is v6.

@jeremyd2019
Copy link
Member

It doesn't answer your question as to which was used, but I have kept #88 up to date with the iterations (it should currently be sitting with the committed version cherry-picked). I was thinking about trying to cherry-pick this and maybe some of the other console patches currently on the cygwin-3_3-branch, since there are some important fixes there, but it sounded like we were going to wait for a cygwin release and rebase onto that instead.

@tyan0
Copy link
Contributor

tyan0 commented Apr 13, 2022

It doesn't answer your question as to which was used, but I have kept #88 up to date with the iterations (it should currently be sitting with the committed version cherry-picked).

How can I know the URL of the latest artifact of #88?

@jeremyd2019
Copy link
Member

How can I know the URL of the latest artifact of #88?

Go to the 'Checks' tab, hit the 'Artifacts' dropdown near the top right, and 'install' is the only artifact. (that's https://github.com/msys2/msys2-runtime/suites/5900014396/artifacts/200749613)

@tyan0
Copy link
Contributor

tyan0 commented Apr 13, 2022

@jeremyd2019 I can confirm the latest artifact seems to be applied the v6 patch. Thanks.

@jeremyd2019
Copy link
Member

This should be fixed now in 3.3.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants