Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

windows-2022 lose network connection when using wsl1 (regression from windows-2019) #5151

Closed
1 of 7 tasks
ssbarnea opened this issue Feb 28, 2022 · 13 comments
Closed
1 of 7 tasks

Comments

@ssbarnea
Copy link

ssbarnea commented Feb 28, 2022

Description

We were able to identify that certain commands run inside wsl can crash the windows-2022 runners without any possible way to debug it.

It seems to always happen with windows-2022 and never happened with windows-2019.

I spend few DAYS trying to narrow down the issues that causes windows-2022 runner to stop responding and I started to believe that it is crash, causing by some instability.

I even recently got another one stuck at https://github.com/ssbarnea/bug-gha-windows-2022/runs/5362695105?check_suite_focus=true which is a task that works without problems normally.

Please note that I am not the only engineer that faces these problems and I know at least two others that reported similar issues.

Virtual environments affected

  • Ubuntu 18.04
  • Ubuntu 20.04
  • macOS 10.15
  • macOS 11
  • Windows Server 2016
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

Environment: windows-2022
8
  Version: 20220220.1
9
  Included Software: actions/virtual-environments@win22/20220220.1/images/win/Windows2022-Readme.md
10
  Image Release: actions/virtual-environments@win22%2F20220220.1 (release)

Build link https://github.com/ssbarnea/bug-gha-windows-2022/runs/5362695105?check_suite_focus=true

Is it regression?

YES

Expected behavior

Not a crash.

Actual behavior

Runner stops responding and the job seems stuck at the step that was running until the timeouts come into place.

It should be noted that the step timeout does not work in this case, only the job level timeouts seems to be working.

Any attempt to cancel the workflow will not do anything.

Repro steps

Try https://github.com/ssbarnea/bug-gha-windows-2022/pull/1/files which has a simple workflow that was used as a way to reproduce the problems with minimal amount of code.

Keep in mind: same actions with older windows-2019 seem to be working.

Related thread:

@briantist
Copy link

I have also spent a lot of time trying to figure this out with windows-2022 runners + WSL and the hard crashes make it impossible to really do any troubleshooting. The debug logs don't have any information, there's nowhere left to go from the outside, so we really need some engineering help here.

@al-cheb
Copy link
Contributor

al-cheb commented Mar 1, 2022

Hey @ssbarnea,
We will take a look at it.

@miketimofeev
Copy link
Contributor

@ssbarnea quick update — we were unable to investigate the issue in the last three weeks due to unpredictable circumstances. We will continue the investigation soon. Sorry for the delay!

@ssbarnea
Copy link
Author

@miketimofeev Thanks for the update. Luckily for us the old windows-2019 is still running. Hopefully you will narrow down the source of these issues.

@al-cheb
Copy link
Contributor

al-cheb commented Apr 13, 2022

@ssbarnea , I am able to reproduce this issue on a self-hosted agent. After activating WSL and msys2 we are getting BSOD(The computer has rebooted from a bugcheck. The bugcheck was: 0x00000139 (0x0000000000000003, 0xfffff6039e476b60, 0xfffff6039e476ab8, 0x0000000000000000). A dump was saved in: C:\Windows\MEMORY.DMP. Report Id: 30a9fc97-44f7-494a-81c7-329d38b6899f.) and the runner stops responding.

Maybe, it makes sense to ask mxschmitt/action-tmate maintainers to add native WSL support without running tmate in msys2 - mxschmitt/action-tmate#86

As a workaround you can run tmate in WSLv1 - https://github.com/mxschmitt/action-tmate/blob/master/src/index.js:

    runs-on: windows-2022
    steps:
           
      - name: Activate WSL1
        uses: Vampire/setup-wsl@v1

      - name: Install tools
        shell: "wsl-bash {0}"
        run: |
          sudo apt-get remove -y ansible pipx || true
          sudo apt-get update -y
          sudo apt-get install -y --no-install-recommends -o=Dpkg::Use-Pty=0 git python3-venv python3-pip
          pip3 install ansible-core
          ansible --version
      
      - name: Run tmate
        shell: "wsl-bash {0}"
        run: |
          sudo apt-get -y install tmate
          tmate -S /tmp/tmate.sock new-session -d
          tmate -S /tmp/tmate.sock wait tmate-ready
          while true; do
              tmate -S /tmp/tmate.sock display -p '#{tmate_ssh}'
              tmate -S /tmp/tmate.sock display -p '#{tmate_web}'
              sleep 20
          done

image

1
2

@al-cheb
Copy link
Contributor

al-cheb commented Apr 13, 2022

I am going to close the thread as external issue.

@al-cheb al-cheb closed this as completed Apr 13, 2022
@ssbarnea
Copy link
Author

Let me be clear, we attempted to add tmate because the runner was crashing anyway, originally we did not had any tmate on it because we did not need one.

@al-cheb
Copy link
Contributor

al-cheb commented Apr 13, 2022

Let me be clear, we attempted to add tmate because the runner was crashing anyway, originally we did not had any tmate on it because we did not need one.

In that case you should create an issue in the repo - https://github.com/microsoft/WSL . Or provide how to reproduce the issue without using tmate step.

@ssbarnea
Copy link
Author

Not really because using WSL2 on my own Windows machine on azure works fine, without these problems.

AFAIK, this issue is 100% related to github runners, and it is a clear regression. Keep in mind that the many users might get WSL2 automatically as a side effect of environments being upgraded.

Also, due to the nature of the service our hands are tied as we cannot debug the lost network connectivity ourselves.

@miketimofeev
Copy link
Contributor

miketimofeev commented Apr 14, 2022

@ssbarnea the issue is reproducible when installing GitHub runner agent on the windows-2022 VM, so it can be debugged locally I believe

@al-cheb
Copy link
Contributor

al-cheb commented Apr 14, 2022

Not really because using WSL2 on my own Windows machine on azure works fine, without these problems.

AFAIK, this issue is 100% related to github runners, and it is a clear regression. Keep in mind that the many users might get WSL2 automatically as a side effect of environments being upgraded.

Also, due to the nature of the service our hands are tied as we cannot debug the lost network connectivity ourselves.

Windows Server 2022 doesn't support WSLv2 - microsoft/WSL#6301 (comment)

@ssbarnea
Copy link
Author

Is not supported an euphemism for being broken? I think we need some more clear messaging here. All users will prefer to know if there is a team working or planning to fix it, or not really.

https://docs.microsoft.com/en-us/windows/wsl/install-on-server does not indicate in any way that this is not supported for Windows Server editions. AFAIK, WSL2 is implied on newer OS when you do wsl --install, is not as you would get v1 instead of 2.

If you run install on 2022, you get v2, not v1 and without any red warning about this being an unsupported platform.

https://github.com/actions/virtual-environments#available-environments does not list any Windows non-server options.

Collaborating these, should we expect that Microsoft/GitHub do not provide any hosted runners that can run WSL2 under Github action? If that is true, maybe it is time to specify this clear on that page, preferably with bold letters.

To clarify, WSL2 is required by any POSIX tools that use containers as container engines (podman or docker) would not run under WSL1.

Somehow I do have the impression that the dead-cat is send back and forth between between virtual-environments and WSL teams, none being willing to address the issue or at least to ack as working to address it in a way, either by adding a runner like windows-10 or windows-11 which apparently are not affected by these issues.

As a maintainer of a VsCode Ansible extension, I find harder and harder to support use of Microsoft Windows Operating System because it is impossible to run GitHub Action CI/CD under it. Should we start pushing everyone to avoid using Windows and god for either Linux or MacOS?... I really do not want to endup having to pup a big popup that say "Use of this extension under Windows is unsupported, please do not file any bug reports about it.".

I hope we can find a solution for this.

@al-cheb
Copy link
Contributor

al-cheb commented Apr 14, 2022

Is not supported an euphemism for being broken? I think we need some more clear messaging here. All users will prefer to know if there is a team working or planning to fix it, or not really. - We could try to help if you provide steps to reproduce the issue without using msys2 subsystem with WSLv1.

If you run install on 2022, you get v2, not v1 and without any red warning about this being an unsupported platform. - Currently, only WSLv1 is supported on Window Server 2022 microsoft/WSL#6301 (comment)

image

Collaborating these, should we expect that Microsoft/GitHub do not provide any hosted runners that can run WSL2 under Github action? If that is true, maybe it is time to specify this clear on that page, preferably with bold letters. - We have never mentioned about WSLv2 support on GitHub Actions.

To clarify, WSL2 is required by any POSIX tools that use containers as container engines (podman or docker) would not run under WSL1. - In that case you should use a self-hosted runners which support WSLv2. Nothing we can do from our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants