Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

Closed
2 of 14 tasks
lelegard opened this issue Jun 8, 2024 · 34 comments
Closed
2 of 14 tasks

Broken C++ runtime on windows-2022 version 20240603.1.0 #10020

lelegard opened this issue Jun 8, 2024 · 34 comments

Comments

@lelegard
Copy link

lelegard commented Jun 8, 2024

Description

Something was broken in runner image windows-2022 version 20240603.1.0 , maybe in the VC++ runtime.

When running test programs in a workflow, using a C++ mutex (std::mutex) immediately terminates the application with error 1.

As a consequence, all continuous integration pipelines / non-regression test suites for C++ applications are broken, unusable, as soon as the application uses a mutex.

This is a very serious issue which should be addressed with a high priority.

The problem appears after the upgrade of windows-2022 (aka "windows-latest")

  • from runner version 2.316.1, image version 20240514.3.0
  • to runner version 2.317.0, image version 20240603.1.0

Platforms affected

  • Azure DevOps
  • GitHub Actions - Standard Runners
  • GitHub Actions - Larger Runners

Runner images affected

  • Ubuntu 20.04
  • Ubuntu 22.04
  • Ubuntu 24.04
  • macOS 11
  • macOS 12
  • macOS 13
  • macOS 13 Arm64
  • macOS 14
  • macOS 14 Arm64
  • Windows Server 2019
  • Windows Server 2022

Image version and build link

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

Is it regression?

Last worked on windows-2022 runner version 2.316.1, image version 20240514.3.0

Expected behavior

The C++ applications which are built as part of a workflow should not crash during the subsequent test phase.

Actual behavior

See repro section.

Repro steps

The problem is demonstrated in the following simple repo: https://github.com/lelegard/gh-runner-lock

The log with the sample failure: https://github.com/lelegard/gh-runner-lock/actions/runs/9431982526/job/25981214253

The C++ program is quite simple:

#include <mutex>
#include <iostream>
int main()
{
    std::cout << "1" << std::flush << std::endl;
    std::mutex m;
    std::cout << "2" << std::flush << std::endl;
    std::lock_guard<std::mutex> lock(m);
    std::cout << "3" << std::flush << std::endl;
    return EXIT_SUCCESS;
}

Of course, being so simple, this program works well everywhere, including on local Windows development systems.

When executed in a GitHub workflow, starting with Windows runner version 2.317.0, it fails in the lock step. The above-mentioned log contains this:

1
2
Error: Process completed with exit code 1.

Background

The problem initially appears after that upgrade on the project TSDuck where all workflows suddenly failed on Windows platforms.

The project is quite big (~ 350,000 loc, C++). Everything works fine on local Windows development systems. Only the GitHub CI failed. Because of the size of the project and the absence of direct interaction with the GitHub runner, identifying the reason for the failure was quite hard. I spent hours of test repos and run 52 versions of the CI workflow to understand the nature of the problem.

@grafikrobot
Copy link

We worked around the problem by switching to using the static msvc runtime (bfgroup/b2@0075644). And plan on staying with the static runtime to avoid this problem again and again. As it's not the first time such botched updates have happened.

@rzblue
Copy link

rzblue commented Jun 9, 2024

Looks like the same as #10004, there are some workarounds in that issue.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@grafikrobot

We worked around the problem by switching to using the static msvc runtime

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

@rzblue

Looks like the same as #10004, there are some workarounds in that issue.

Thanks for pointing this. This is a very recent one too. They opened it while I was chasing the reason for the problem (it took a long time to identify the mutex issue).

@mmomtchev
Copy link

Glad it works for you. However, I have many DLL's and executables in the project. We can't link with the static runtime. If we did so, each process would end up with as many instances of the runtime as DLL's (+1 for the .exe), which does not work.

In fact, unless you have total control over your users' machines, you should be grateful to Github for showing you the problem. The expression the Windows DLL hell exists for a reason and the only viable solution when shipping binaries for Windows is to either go static, either ship this DLL yourself. At least when it comes for me, this is the last time I got burned by the MSVC runtime.

@paulhiggs
Copy link

There has definately been some changes in the 'default construction' of a mutex. Extracts from various recent versions...

14.32.31326

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

14.39.35519

class _Mutex_base { // base class for all mutex types
public:
#ifdef _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#else // ^^^ _ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR / !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR vvv
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#endif // ^^^ !_ENABLE_CONSTEXPR_MUTEX_CONSTRUCTOR ^^^

14.40.33807 - this seems to be the latest

#ifdef _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR
    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }
#else // ^^^ defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) / !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) vvv
    constexpr _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_storage._Critical_section = {};
        _Mtx_storage._Thread_id        = -1;
        _Mtx_storage._Type             = _Flags | _Mtx_try;
        _Mtx_storage._Count            = 0;
    }
#endif // ^^^ !defined(_DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR) ^^^

in 14.32 and 14.39, the default behaviour (when no compiler directive is specified) is

    _Mutex_base(int _Flags = 0) noexcept {
        _Mtx_init_in_situ(_Mymtx(), _Flags | _Mtx_try);
    }

Not providing the _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR directive when using 14.40 gives you a different constructor

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev

It's worse than that. I tried to install the same VC runtime as used during the build and the problem remains the same, see #10004 (comment)

@mmomtchev
Copy link

This is normal, if you build with this version of MSVC, you need the new runtime.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev,

This is normal, if you build with this version of MSVC, you need the new runtime.

Precisely, I explicitly install on the runner system the MSVC runtime with which I built the code. So, the same runtime is used in compilation and run. But it does not work. Running the application still fails on locking the mutex.

There is an inconsistency somewhere. Either the VCRedist package in the VS tree is not the same as used by the compiler, or there is a mess of already installed MSVC runtime which takes precedence because of a PATH settings. So, this is either an inconsistency in VS or an inconsistency in the GH runner.

@mmomtchev
Copy link

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime. This would be the same with gcc on Linux. Install an older shared library runtime and build with a new compiler that uses something that does not exist in this older runtime. It won't work. The difference is that on Linux these days you will get an error that says the dynamic linker cannot find its symbols. On Windows you get a crash. Microsoft should definitely find a solution to this problem. The Linux solution has also been retrofitted at a much later date than the original design.

@lelegard
Copy link
Author

lelegard commented Jun 9, 2024

@mmomtchev

MSVC does not build with the runtime. MSVC produces code that expects to find its runtime.

You misinterpreted what I meant. The compiler and the RTL work together, always. The compiler (well, the compilation environment at large) provides header files. These header files contain specific definitions, here the variants of the mutex constructor. The binary of the runtime must be compatible with these definitions. If a new version of the compiler (and compilation environment) introduces a new version of a constructor, the corresponding code must be in the RTL. Therefore, there is a new version of the RTL which comes (somehow) with the compiler.

When you install Visual Studio, the installed tree of files contains a VCRedist package, a package to install on target systems to make sure that they will be able to run applications which are compiled by this compiler.

This is what I mean: When you build with a given version of Visual Studio, the headers which are used during the compilation must be compatible with the provided VCRedist package. This is why, when you package an application, you typically include this VCRedist in the package of your application and you install both at the same time. Thus, you can be confident that your application will work on the target system, even if it is a bit older (not too old up to some point).

This is why, in my test, I explicitly installed the VCRedist that is found in the Visual Studio setup of the GH runner, before running the application (before compiling in fact). The expected result is that the application which is built is compatible with the RTL that we just installed.

And this is what fails... Therefore, something is rotten in the state of GitHub.

@mmomtchev
Copy link

You misinterpreted what I meant.

Yes, indeed. I agree, they updated the compiler without updating the VCRedist package. However this is also a huge wake-up call for everyone who ships Windows binaries. In my case, it is a Node.js addon that is installed through npm and does not have an installshield. In those cases, the only viable solution is /MT.

@RaviAkshintala
Copy link
Contributor

@lelegard We are looking into the issue, we will get back to you after investigating this issue.

@lelegard
Copy link
Author

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

@MarkCallow
Copy link

Thanks for your confirmation, we are closing the issue as completed.

All we've confirmed is that the runner image is still broken. I am therefore in total agreement with @lelegard.

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

@mprather
Copy link

@lelegard

Thanks for your confirmation, we are closing the issue as completed.

@RaviAkshintala, I don't understand. The confirmation indicated that the updated image is still unreliable and does not offer a stable, reliable build platform. It was clearly not a confirmation that everything is working once again. Why is this closed?

@lelegard
Copy link
Author

@RaviAkshintala, you initially wrote:

Actually we look into the issue.
Thanks for your confirmation, we are closing the issue as completed.

Then I wrote this:

@RaviAkshintala, so you say that you "close this issue as completed" while you "look into the issue".
Seriously? Are you kidding?

And, after my comment, you edited your previous comment and you removed the sentence "Actually we look into the issue". It is fortunate that the editing history of posts is available to demonstrate this.

Let me say that this is extremely offensive and dishonest.

As @MarkCallow and @mprather confirmed with me, the problem is NOT fixed. Not only you close the issue without a complete fix but you also erase the part of the discussion which exhibits this.

@alemuntoni
Copy link

I can confirm that the problem is not solved. Please reopen this issue and, please, solve it asap. At least by reverting to the old working runner.

We are experiencing two weeks of broken runners and this is very unprofessional.

@RaviAkshintala
Copy link
Contributor

Hi @lelegard We Apologise for the mistake, will look into the carefully.Thanks.

@MarkCallow
Copy link

If GitHub supported it, the right thing to do with this is mark it as a duplicate of #10004 so there aren't multiple threads of discussion going on.

The description I gave earlier of the JNI failure is what remains of the original problem since deployment of 20240610.1.0. Actually #10055 was opened specifically re the JNI issue. That too, in my view, is a duplicate of #10004.

@lelegard
Copy link
Author

@RaviAkshintala and all GitHub folks,

Because characterizing the problem was only possible in a GitHub Actions runner context, I had to run many workflows on a copy of a big repo to come to the conclusion of the C++ std::mutex issue.

Because of this problem, which was created by GitHub with a careless, insufficiently tested, upgrade, I burnt all my Actions credits:

You've used 100% of included services for GitHub Actions.
To continue using Actions & Packages uninterrupted, update your spending limit.

This is the first time it happens to me in 11 years of GitHub usage.

GitHub cannot credit back the many hours of my time I lost on this issue (and many others' time as well). However, it would be fare from GitHub to restore my Actions credits. Again, this credit was lost because of a GitHub bug, not for my own usage or the usage of my project.

So, please consider recrediting my Actions quota.

@lelegard
Copy link
Author

To all, 7 days after complaining that investigating GitHub's problem burnt all my GH Action credits and asking for a refill of the credits, I still got nothing. My GH Actions credit is still zero. All burnt to do what GH should have done: investigating the problem that they created. And the problem is still not fixed. The contempt and disregard of GH for its users seems have no limit.

@connorjclark
Copy link

@lelegard Did you file a customer support request? Those have always been resolved to my satisfaction. A comment in this thread won't get you the help you seek.

alemuntoni added a commit to alemuntoni/vclib that referenced this issue Jul 2, 2024
alemuntoni added a commit to alemuntoni/vclib that referenced this issue Jul 2, 2024
alemuntoni added a commit to cnr-isti-vclab/vclib that referenced this issue Jul 2, 2024
@aaron-bray
Copy link

aaron-bray commented Jul 2, 2024

I have been googling a similar issue, so maybe this is the same problem?

It seems to be the JVM in general....

corretto/corretto-8#510

@MarkCallow
Copy link

MarkCallow commented Jul 3, 2024

I have been googling a similar issue, so maybe this is the same problem?

Almost certainly. See #10055. As I noted there, since you can't know what JVM your users are using you need to define the macro to avoid the new constexpr mutex constructor to make your JNI modules compatible with older versions of msvcp140.dll.

@RaviAkshintala
Copy link
Contributor

@RaviAkshintala

Thanks for the update. It works "a bit better" but all JNI applications (Java Native Interface) are still crashing. So, no, the updated runner is not acceptable.

Let me explain: In my project, I have C++ DLL's and C++ executables. There are also Python and Java bindings. All C++ applications now work correctly. Same for Python applications (the Python interpreter successfully loads my DLL when Python applications calls my Python bindings).

However, when Java application calls the Java bindings, loading the DLL fails with this error:

Exception in thread "main" java.lang.UnsatisfiedLinkError: D:\a\tsduck\tsduck\bin\Release-x64\tsduck.dll:
A dynamic link library (DLL) initialization routine failed

Log here: https://github.com/tsduck/tsduck/actions/runs/9518023948/job/26237929549

That is exactly the same symptom as with the previous update: The initialization routines of the DLL loads stuff using non-thread-safe system functions and they use a std::mutex. That is what fails for everyone.

Note that the crash occurs only with 64-bit applications. The 32-bit version works with Java (probably not the same mixture of VC runtime DLL's).

I re-enabled the workaround I implemented earlier, defining _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR, and everything work again, Java / JNI applications included.

So, please consider adding JNI test cases in your validation suites.

@lelegard - I hope your issue got resolved, as i can see there was a workaround been used -
tsduck/tsduck@c537f0b
We are closing this issue, Thank you.

@lelegard
Copy link
Author

lelegard commented Jul 9, 2024

@lelegard - I hope your issue got resolved, as i can see there was a workaround been used -
tsduck/tsduck@c537f0b
We are closing this issue, Thank you.

@RaviAkshintala, do you understand the difference between "fixing a bug" and "implementing a workaround" ?

"Implementing a workaround" is a user's temporary survival effort to mitigate the consequences of a bug. Under no circumstances, implementing a workaround should be considered as a bug fix, a license to close the corresponding issue.

The problem is demonstrated in project lelegard/gh-runner-lock. The workflow ci.yml tests various configurations, including two "default" configurations (meaning "normal" projects with "default" options) which should work out of the box. As long as at least one of these two default configurations is flagged with a red tick (as in this worlflow log), the problem is NOT FIXED.

That is the second time you attempt to close this issue without fixing the problem. If you define your objectives as "closing issues" rather than "fixing problems", this is very unprofessional, at best.

@alemuntoni
Copy link

@RaviAkshintala the issue hasn't been resolved yet.

My workflow is still encountering segmentation faults (segfaults) when running some tests that involve mutexes: https://github.com/alemuntoni/vclib/actions/runs/9854078683/job/27206058584

This issue doesn't occur on any of my local Windows machines, nor on any of the runners prior to version 20240603.1.0.

Please, reopen this issue and address it directly. Workarounds aren't a permanent solution and shouldn't be considered a fix.

@lelegard
Copy link
Author

lelegard commented Jul 9, 2024

@alemuntoni, they also closed #10055 with the same type of comment ("thank you for reporting this, we see that your have found a workaround, bla, bla"). It seems they want to bury all open symptoms of the issue, without fixing it.

This is silly because the new constexpr constructor for the mutex was a good idea. Not only it speeds up the mutex constructor, but more importantly it solved potential determinism issues in initialization order of static mutexes. However, because of a few guys whose KPI seems to be the issue closing rate, we will have to disable it forever.

norihiro added a commit to norihiro/obs-face-tracker that referenced this issue Jul 10, 2024
A crash was caused by a combination of old msvcp140.dll and new SDK.
Details were discussed on the issue [1].

This commit should be reverted once this repository started to require
OBS Studio 30.2 or later since OBS Studio 30.2 will require the updated
DLL [2] so that the crash won't happen.

- [1] actions/runner-images#10020
- [2] obsproject/obs-studio@00c68981ab9
@alemuntoni
Copy link

This is completely unprofessional. Unfortunately, I haven't been able to find a workaround within my repository. As a result, I'm forced to disable the tests on Windows, as they consistently fail due to an issue that GitHub hasn't addressed yet.

@RaviAkshintala please reopen and address the issue properly.

@lelegard
Copy link
Author

@alemuntoni

I haven't been able to find a workaround within my repository.

I you have control over the C++ code, defining _DISABLE_CONSTEXPR_MUTEX_CONSTRUCTOR during compilation should be a workaround. But if you collect binary DLL's which were compiled with a recent compiler, without that symbol, then you have a problem.

I also tried to cleanup C:\hostedtoolcache from all MSVC runtime using this command before running the tests:

Get-ChildItem -Path "C:\hostedtoolcache\windows" -File -Include msvcp*.dll,concrt*.dll,vccorlib*.dll,vcruntime*.dll -Recurse | Remove-Item -Force -Verbose

I do not know what is C:\hostedtoolcache. It seems to be specific to GitHub runners and the cause of the problem. Purging all MSVC runtime from it worked for me (but maybe not in all contexts). However, the above cleanup command is awfully slow, it takes 19 minutes! See this log. This slows down all your workflows and burns you GitHub Actions credits, as I bitterly experienced.

So, this is hardly an acceptable workaround.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests