Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional attempting free on address which was not malloc()-ed crash in CI #78749

Closed
RedworkDE opened this issue Jun 27, 2023 · 20 comments · Fixed by #80296
Closed

Occasional attempting free on address which was not malloc()-ed crash in CI #78749

RedworkDE opened this issue Jun 27, 2023 · 20 comments · Fixed by #80296

Comments

@RedworkDE
Copy link
Member

RedworkDE commented Jun 27, 2023

Godot version

master CI builds

System information

Linux / Editor with doubles and GCC sanitizers

Issue description

The Linux / Editor with doubles and GCC sanitizers build can fail with a attempting free on address which was not malloc()-ed crash in the Open and close editor (Vulkan) step.

https://github.com/godotengine/godot/actions/runs/5366302204/jobs/9735786296#step:15:138
https://github.com/godotengine/godot/actions/runs/5389199469/jobs/9783070708?pr=78740#step:15:138

The failure doesn't seem related to either PR (in fact one of them is just a docs change). I have only seen this issue these two times, still probably worth looking into.

Steps to reproduce

Open and close https://github.com/godotengine/regression-test-project/archive/4.0.zip using the Vulkan renderer.

Minimal reproduction project

https://github.com/godotengine/regression-test-project/archive/4.0.zip

@AThousandShips
Copy link
Member

I've come across this several times, and was discussed some on rocketchat, since it almost always for me works when re-running the failed check I suspect it's a memory issue causing some corruption of the runner, similar to how we previously had issues with compilation in scu builds

@akien-mga
Copy link
Member

This is failing on the SCU build, and these errors started after enabling it, so it's probably related.

@akien-mga akien-mga added this to the 4.2 milestone Jun 27, 2023
@akien-mga akien-mga self-assigned this Jun 27, 2023
@lawnjelly
Copy link
Member

lawnjelly commented Jun 28, 2023

Sorry I've been out of action for a bit (my new PC arrived yesterday and is almost setup for compiling). We are planning to modify SCU build slightly soon so the CI can specify maximum SCU file size which will limit RAM use, at the moment it will be very high especially with sanitizers. I can help investigate this next week, but feel free to turn off SCU for this action temporarily, or use it on a simpler (non sanitizer / tests) build.

SCU build on sanitizer / test build?

If we want to run SCU in CI, we need to verify that running it on the sanitizer / tests run (because it is slowest) is actually the best choice. Although in theory SCU should accelerate CI too, when I looked at the actual build times (for CI anyway) they weren't necessarily faster than regular build. This may be due to e.g. the CI caching, the larger translation units, or simply running low on RAM. There's also no guarantee that the SCU build will operate exactly as the regular build, if there are code errors. The gold standard for sanitizer errors should probably be the regular build imo (although SCU sanitizer build will also reveal different bugs).

Some ideas for what might be happening

SCU build has fewer large files compared to normal build with lots of small files, so they can stress things in different ways. If the build itself is working ok, but the sanitizers are failing on a test, then that would seem to indicate e.g. a corrupt build, maybe a order construction / destruction problem, or a timing difference exposing race condition.

I'm not an expert on github actions, but some possible ideas.

  • Running out of RAM on certain runs, causing spurious errors (very likely)
  • Problem with the SCons caching
  • Timing issue with files in the CI and file locks
  • Undefined order of construction / destruction problem, which is revealed by SCU build and the sanitizers

Some things we might try to figure out the culprit:

  • limiting the number of files per SCU (lower RAM)
  • turning off CI caching

Order of construction / destruction

(this is only small chance of being the problem, but worth considering if reader is not familiar)

Order of construction / destruction bugs (usually globals / statics / possibly singletons) can be extremely nasty. These are why many programmers prefer explicit construction / destruction functions for globals rather than relying on constructor / destructor operators - i.e. defined in code rather than determined by the compiler. Whether or not this is causing the problem, we should probably consider having a test for order of construction / destruction in the CI.

This is an achilles heel of C++: order of construction and destruction of globals between translation units is undefined, and it can change from build to build, resulting in bugs on some builds but not in others.

See:
https://en.cppreference.com/w/cpp/language/siof
https://stackoverflow.com/questions/335369/finding-c-static-initialization-order-problems
https://github.com/google/sanitizers/wiki/AddressSanitizerInitializationOrderFiasco

With a SCU build in particular, any globals within the larger translation unit, the order will now be determined by where they appear in the "uber file", so it can expose existing order of construction bugs.

@akien-mga
Copy link
Member

akien-mga commented Jul 1, 2023

If we want to run SCU in CI, we need to verify that running it on the sanitizer / tests run (because it is slowest) is actually the best choice. Although in theory SCU should accelerate CI too, when I looked at the actual build times (for CI anyway) they weren't necessarily faster than regular build.

For a scratch build, it does have a significant impact.

But indeed when there is a valid cache and the change doesn't impact core, it might not make as big a difference.

@RedworkDE
Copy link
Member Author

A lot of running stuff in GH actions later:

  1. the crash seems to happen somewhat more with single precision builds instead of double precision builds (~1/6 instead of ~1/16)
  2. The crash only happens on the first run, if the caches (and/or settings) exist it will never crash.
  3. The crash is not specific to an executable / some compiler ordering choices,
  4. The crash NOT exclusive to SCU, the same crash also happens with scu_build=no
  5. The crash is exclusive to vulkan, neither opengl nor headless have crashed on 240 open attempts.
  6. I can't reproduce in my machine, not sure If I do something wrong or if something is just different on my machine so that these crashes don't happen.
  7. Possibly related, sometimes this asan check failed messages appears instead of the normal crash (this doesn't fail CI)

==13718==AddressSanitizer CHECK failed: ../../../../src/libsanitizer/asan/asan_allocator.cc:141 "((m->chunk_state)) == ((CHUNK_QUARANTINE))" (0x65, 0x3)
    #0 0x7f13257208be in AsanCheckFailed ../../../../src/libsanitizer/asan/asan_rtl.cc:72
    #1 0x7f132573eeee in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cc:77
    #2 0x7f132563350d in __asan::QuarantineCallback::Recycle(__asan::AsanChunk*) ../../../../src/libsanitizer/asan/asan_allocator.cc:141
    #3 0x7f132563389d in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::DoRecycle(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:191
    #4 0x7f1325633db4 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Recycle(unsigned long, __asan::QuarantineCallback) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:179
    #5 0x7f1325635b14 in __sanitizer::Quarantine<__asan::QuarantineCallback, __asan::AsanChunk>::Put(__sanitizer::QuarantineCache<__asan::QuarantineCallback>*, __asan::QuarantineCallback, __asan::AsanChunk*, unsigned long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_quarantine.h:111
    #6 0x7f1325635b14 in __asan::Allocator::QuarantineChunk(__asan::AsanChunk*, void*, __sanitizer::BufferedStackTrace*) ../../../../src/libsanitizer/asan/asan_allocator.cc:598
    #7 0x7f13257163d8 in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:127
    #8 0x7f131386430b  (/usr/lib/x86_64-linux-gnu/libvulkan_lvp.so+0x6430b)
    #9 0x55ab3f590934 in VmaAllocator_T::FreeVulkanMemory(unsigned int, unsigned long, VkDeviceMemory_T*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43aa2934)
    #10 0x55ab3f55118f in VmaDeviceMemoryBlock::Destroy(VmaAllocator_T*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a6318f)
    #11 0x55ab3f559459 in VmaBlockVector::~VmaBlockVector() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a6b459)
    #12 0x55ab3f5a4c25  (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43ab6c25)
    #13 0x55ab3f57b194 in VmaAllocator_T::~VmaAllocator_T() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43a8d194)
    #14 0x55ab3f5a512b  (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43ab712b)
    #15 0x55ab3f598138 in vmaDestroyAllocator (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x43aaa138)
    #16 0x55ab3ecd90d4 in RenderingDeviceVulkan::finalize() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x431eb0d4)
    #17 0x55ab36fd4025 in DisplayServerX11::~DisplayServerX11() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b4e6025)
    #18 0x55ab371b3a75 in void memdelete<DisplayServer>(DisplayServer*) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b6c5a75)
    #19 0x55ab37128b12 in finalize_display() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b63ab12)
    #20 0x55ab37190485 in Main::cleanup(bool) (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b6a2485)
    #21 0x55ab36eac969 in main (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b3be969)
    #22 0x7f1324956082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #23 0x55ab36eac32d in _start (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x3b3be32d)

  1. Branch with my experiments: https://github.com/RedworkDE/godot/tree/scu-stress-test (nothing too interesting here) and script I used for local testing (run in repo root & have the test project zip in there; warning: (tries to) delete all godot config and caches):
Details

#!/bin/bash -ex

i=0
scons platform=linuxbsd target=editor tests=no verbose=yes warnings=extra werror=yes module_text_server_fb_enabled=yes dev_build=yes scu_build=yes debug_symbols=no use_asan=yes use_ubsan=yes linker=gold
while true; do
    ((i+=1))
    echo $"Attempt $i"

    rm -rf test_project
    rm -rf ~/.local/share/godot/
    rm -rf ~/.config/godot/
    rm -rf ~/.cache/godot/
    unzip 4.0.zip
    mv "regression-test-project-4.0" "test_project"

    bin/godot.linuxbsd.editor.dev.x86_64.san --audio-driver Dummy --editor --quit --path test_project 2>&1 | tee sanitizers_log.txt || true
    misc/scripts/check_ci_log.py sanitizers_log.txt || break;
done

@akien-mga
Copy link
Member

akien-mga commented Jul 1, 2023

Thanks for testing, so the SCU stuff was a red herring and just a coincidence.

This occasional CI crash is still a recent regression that happened around the time we merged the SCU CI change (#78462), so it might come from another PR merged around that time, i.e. in the past 2-3 weeks. Or possibly a change to the GitHub Action environment.

@RedworkDE
Copy link
Member Author

RedworkDE commented Jul 1, 2023

@RedworkDE
Copy link
Member Author

CC @warriormaster12 @RandomShaper
Best I can tell is that the multithreaded saving of the shader cache causees memory corruption. possibly of asans tracking data and/or by the vulkan driver. Calling _save_pipeline_cache_threaded synchronously seems to solve the issue, but I don't really understand what is happening and will now stop looking into this.

@warriormaster12
Copy link
Contributor

@RandomShaper
Copy link
Member

RandomShaper commented Aug 5, 2023

@RedworkDE @RandomShaper could be that this flag was missing 🤔

https://registry.khronos.org/vulkan/specs/1.3-extensions/man/html/VkPipelineCacheCreateFlagBits.html

That flag reduces thread safety, so to say. It puts the responsibility of ensuring the writes to the PSO cache are synchronized on the application. In Godot, we can use it since the *_pipeline_create() functions are mutex-protected. In fact, I'm adding it in my PR #80296, but the thing is that, corruption-wise, it doesn't have a potential to make things better; only worse if misused.

@akien-mga
Copy link
Member

Sadly we still get intermittent crashes on ASAN builds on CI with the same error.

Here's an example from a PR which was rebased to include #80296:
https://github.com/godotengine/godot/actions/runs/6068373363/job/16461258316?pr=70329

The stacktrace is different:

================================================================
==11828==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x604000019350 in thread T0
    #0 0x7f70ada8840f in __interceptor_free ../../../../src/libsanitizer/asan/asan_malloc_linux.cc:122
    #1 0x7f70a3183058 in FcStrSetDestroy (/lib/x86_64-linux-gnu/libfontconfig.so.1+0x26058)
    #2 0x7f70a3169079 in FcConfigDestroy (/lib/x86_64-linux-gnu/libfontconfig.so.1+0xc079)
    #3 0x557999447ef2 in OS_LinuxBSD::~OS_LinuxBSD() (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x30a97ef2)
    #4 0x55799941cfaf in main (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x30a6cfaf)
    #5 0x7f70accc8082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #6 0x55799941c76d in _start (/home/runner/work/godot/godot/bin/godot.linuxbsd.editor.dev.double.x86_64.san+0x30a6c76d)

0x604000019350 is located 0 bytes inside of 41-byte region [0x604000019350,0x604000019379)

================================================================
handle_crash: Program crashed with signal 11
Engine version: Godot Engine v4.2.dev.custom_build (77827979c1aafefb694ef0a71b13d38d00bc6233)
Dumping the backtrace. 
[1] ./bin/godot.linuxbsd.editor.dev.double.x86_64.san(+0x30a6d41c) [0x55799941d41c] (??:?)
[2] /lib/x86_64-linux-gnu/libc.so.6(+0x43090) [0x7f70acce7090] (??:0)
[3] /lib/x86_64-linux-gnu/libasan.so.5(+0x11b913) [0x7f70ada96913] (??:0)
[4] /lib/x86_64-linux-gnu/libasan.so.5(+0x2f03a) [0x7f70ad9aa03a] (??:0)
[5] /lib/x86_64-linux-gnu/libasan.so.5(+0x2ff34) [0x7f70ad9aaf34] (??:0)
[6] /lib/x86_64-linux-gnu/libasan.so.5(+0x11774e) [0x7f70ada9274e] (??:0)
[7] /lib/x86_64-linux-gnu/libasan.so.5(+0x115935) [0x7f70ada90935] (??:0)
[8] /lib/x86_64-linux-gnu/libasan.so.5(__interceptor_free+0xa9) [0x7f70ada883d9] (??:0)
[9] /lib/x86_64-linux-gnu/libfontconfig.so.1(FcStrSetDestroy+0x39) [0x7f70a3183059] (??:0)
[10] /lib/x86_64-linux-gnu/libfontconfig.so.1(FcConfigDestroy+0x4a) [0x7f70a316907a] (??:0)
[11] OS_LinuxBSD::~OS_LinuxBSD() (??:?)
[12] ./bin/godot.linuxbsd.editor.dev.double.x86_64.san(main+0x787) [0x55799941cfb0] (??:?)
[13] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f70accc8083] (??:0)
[14] ./bin/godot.linuxbsd.editor.dev.double.x86_64.san(_start+0x2e) [0x55799941c76e] (??:?)
-- END OF BACKTRACE --
================================================================
Aborted (core dumped)
FATAL ERROR: An incorrectly used memory was found.

@warriormaster12
Copy link
Contributor

@RandomShaper What do you think, would an option to disable pipeline cache completely for github actions be a feasible solution?

@RandomShaper
Copy link
Member

So is this still caused by the Vulkan PSO cache? I thought that the new crash was unrelated. Can we please try a pipeline build it disabled, to be sure?

@akien-mga
Copy link
Member

The original intermittent crash was caused by #76348, and #80296 changed the issue but we still get intermittent crashes. So unless another PR in the meantime introduced another intermittent memory corruption issue that was hidden by the previous crash, it's likely that the PSO cache is still the issue.

How can we disable it?

@warriormaster12
Copy link
Contributor

Atm the moment the only option is to comment code related to pipeline caching. Commenting function _save_pipeline_cache_threaded could be a start. Better option of course would be to have a launch argument and maybe a project setting?

@warriormaster12
Copy link
Contributor

Have there been new cases of editor crashing or has the issue by any chance resolved itself? I might be very wrong but so far what I've looked at, none of the new pull request actions had experienced this issue.

@akien-mga
Copy link
Member

I haven't seen it in the past couple of weeks, while it used to be very frequent. Weird.

@warriormaster12
Copy link
Contributor

I guess we could follow the situation and close the issue if nothing changes maybe in another couple of weeks?

@bitsawer
Copy link
Member

My pet theory is that #81771 fixed these CI Vulkan pipeline cache crashes. Earlier, I noticed some weirdness with Vulkan when run with --gpu-validation and Godot immediately terminated when ran with --gpu-abort and the validation layer error message explicity mentioned invalid pipeline cache use. I was thinking about reporting this, but that PR fixed the errors and seemed to happen around the same time the CI errors stopped, too. But this is just a guess.

@akien-mga
Copy link
Member

That sounds plausible indeed. Either way, we can close this as resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants