-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory corruption and crash when streaming to zlib #45268
Comments
I can reproduce on Node.js main but it seems to be fixed by #44412. |
I can no longer reproduce the issue with Node.js 19.1.0. Can you please confirm? We also landed #45387 (not included in a release yet) and I also cannot reproduce the issue with Node.js main. |
Scratch that. I can reproduce with both Node.js 19.1.0 and Node.js main on Linux. |
This patch seems to fix the issue diff --git a/deps/zlib/zlib.gyp b/deps/zlib/zlib.gyp
index 547143e19d..50c8ae8be4 100644
--- a/deps/zlib/zlib.gyp
+++ b/deps/zlib/zlib.gyp
@@ -82,7 +82,6 @@
'defines': [
'ADLER32_SIMD_SSSE3',
'INFLATE_CHUNK_SIMD_SSE2',
- 'CRC32_SIMD_SSE42_PCLMUL',
'DEFLATE_SLIDE_HASH_SSE2'
],
'sources': [
but I' not sure if the issue is Node.js or in the zlib fork we use. |
I can not reproduce it in |
@ywave620 I can still reproduce with Node.js main. |
Then it's weird🤔️, I ran the script with the main version of Node in |
I can reproduce on every run on WSL and sometimes on macOS (13.1). I've also received this from a failed assertion in a recent run
I'm not sure if it is related. |
After a stress test in
|
Given that the original crash message provided by MaddieLowe, I suspect that bad memory allocation/deallocation is to blame. Since node performs (de)compression in bg thread, I wonder are |
Yes, they are thread safe in mac(See https://opensource.apple.com/source/libmalloc/libmalloc-53.1.1/src/malloc.c.auto.html) |
@lpinca reported removing Just a hunch, but the xmm0-xmm15 registers are caller-saved. The SIMD version of zlib's crc32 definitely clobbers them. Maybe something somewhere is not saving the xmm registers when it should? An easy way to test is to run under qemu twice, once with CPU = Intel Nehalem, then CPU = Intel Westmere. If the first one works okay and the second one crashes, then that's a pretty good indicator; Westmere was when PCLMULQDQ was introduced. |
@MaddieLowe Could you share some thought on why you write this project in the first place? what problem you want to conquer? how you come up with the |
@ywave620 from the issue description
|
Hey, yeah, I've given some background in the Additional Information section, but I can tell you more about the anonymized data. We ingest a lot of patient data with the PHI stripped out of it, and some of the data triggers this bug and some does not. When I was trying to find the smallest amount of code to reproduce this issue I tried just sending random bytes over the socket in a structure similar to what we're using for our data, but that did not reproduce the issue. There is something about this specific data that causes it and I'm not sure what. I think I talked about how the data is structured in some comments in the code but I can provide some more info. It's a size indicator followed by that number of bytes. For example, if I wanted to send my name and a message, this is what the data would look like: Another thing I can tell you is that it seems like it's necessary to send the data in chunks in order to reproduce the issue. Initially when I was trying to reproduce it I sent the whole file over the socket, and that didn't crash. But when I started breaking the data into random amounts of bytes (trying to make it more like real life) it caused the crash. |
We're experiencing an issue that might be similar or the same issue. We are extracting zip files with Word DOCX's in them in nodejs, using the After quite some debugging, we resolved the issue by pinning node to v18.12.1. Doing the same for our CI dockerfile (which runs our unit tests which include extracting the zipfile then trying to parse the DOCX files) fixed it there too. So, seems like a regression between 18.12.1 and 18.16.0. I see that for example upstream zlib was pulled in and some SIMD related compiler options were changed after those versions. If it helps, I can pin the exact nodejs version the regression happens for us? |
@credcore-dave that'd be helpful, yes. Can you try the qemu thing from #45268 (comment) once you've identified the first failing release? You don't need to set up a whole vm, just running it through qemu-user-x86_64 is sufficient. |
For our CI server: 18.15.0 is the last known good version of nodejs. 18.16.0 results in corrupted files. |
I checked out node and on the main branch, built the project. When I ran our application with the compiled from source node and tried to ingest a zip file, the files were corrupt, and the xml parser we use to read the extracted files failed. I then attempted to replicate the linked diff by removing the This time, I get a different error. The Forgot to mention: I'm on MacOS. My colleague, and our CI machine, are Linux. We've experienced both these issues on Linux too. Edit: I went back to the previous version of zlib.gyp (https://raw.githubusercontent.com/nodejs/node/978cfad00502e43dff6df2c773b3f3d43453280e/deps/zlib/zlib.gyp) and recompiled, I still get the FILE_ENDED error. |
There were no changes to deps/zlib or the zlib module in that time frame. I looked through all the other commits and there are a few that could conceivably cause this kind of fallout but it's hard to say for sure. Is it an option for you to find the offending commit with I had a look at that unzipper package but there's too much source code to figure out at a quick glance what exactly it's doing. |
Sorry for the delay in replying @bnoordhuis . I'll try and do some bisecting when I have a bit more free time and get back to you. |
I've run the test case under qemu with
|
cc: @Adenilson |
This issue is now blocking us from upgrading nodejs in production. Could the CPU-specific code causing it be reverted please? 18.15.0 is still the last known good node for us. |
It seems that the optimization causes memory corruption. Disable it until the issue is fixed upstream. Fixes: nodejs#45268
It seems that the optimization causes memory corruption. Disable it until the issue is fixed upstream. Fixes: nodejs#45268
It seems that the optimization causes memory corruption. Disable it until the issue is fixed upstream. Fixes: nodejs#45268
It seems that the optimization causes memory corruption. Disable it until the issue is fixed upstream. Fixes: nodejs#45268
I got a few questions that I hope could better insulate what the problem is, please check below: a) The report of reproducing on Mac, is it a M1 (Apple Silicon) macintosh or an Intel based? b) Is there a standalone repro case (i.e. relies only on zlib proper) that I could have access? The macro CRC32_SIMD_SSE42_PCLMUL guards the x86 vectorized implementation of CRC-32 in Chromium zlib. I would be quite surprised if that is the culprit for the observed misbehavior, since we have shipped it since 2018. There is one macro that guards an optimization that can be impacted by misbehaved client code though i.e. code that relies on undefined behavior present in canonical zlib that has changed thanks to this specific optimization. The macro is INFLATE_CHUNK_SIMD_SSE2, and I even mentioned it in the source code (i.e. https://github.com/nodejs/node/blob/main/deps/zlib/contrib/optimizations/inflate.c#L1256). What happens if you disable the above macro? |
Intel based Mac (3,2 GHz 8-Core Intel Xeon W). I have just tried again compiling the main branch, and it crashed after 2/3 runs
I don't have one.
I'm currently compiling Node.js on WSL where the issue is easier to reproduce. I will let you know. |
I have to sleep, I'll update tomorrow. |
With #49511 reverted I can still reproduce the issue after applying this patch diff --git a/deps/zlib/zlib.gyp b/deps/zlib/zlib.gyp
index 49de2a6c6d..70a05e40db 100644
--- a/deps/zlib/zlib.gyp
+++ b/deps/zlib/zlib.gyp
@@ -134,43 +134,6 @@
'<!@pymod_do_main(GN-scraper "<(ZLIB_ROOT)/BUILD.gn" "\\"zlib_crc32_simd\\".*?sources = ")',
],
}, # zlib_crc32_simd
- {
- 'target_name': 'zlib_inflate_chunk_simd',
- 'type': 'static_library',
- 'conditions': [
- ['target_arch in "ia32 x64" and OS!="ios"', {
- 'defines': [ 'INFLATE_CHUNK_SIMD_SSE2' ],
- 'conditions': [
- ['target_arch=="x64"', {
- 'defines': [ 'INFLATE_CHUNK_READ_64LE' ],
- }],
- ],
- }],
- ['arm_fpu=="neon"', {
- 'defines': [ 'INFLATE_CHUNK_SIMD_NEON' ],
- 'conditions': [
- ['target_arch=="arm64"', {
- 'defines': [ 'INFLATE_CHUNK_READ_64LE' ],
- }],
- ],
- }],
- ],
- 'include_dirs': [ '<(ZLIB_ROOT)' ],
- 'direct_dependent_settings': {
- 'conditions': [
- ['target_arch in "ia32 x64" and OS!="ios"', {
- 'defines': [ 'INFLATE_CHUNK_SIMD_SSE2' ],
- }],
- ['arm_fpu=="neon"', {
- 'defines': [ 'INFLATE_CHUNK_SIMD_NEON' ],
- }],
- ],
- 'include_dirs': [ '<(ZLIB_ROOT)' ],
- },
- 'sources': [
- '<!@pymod_do_main(GN-scraper "<(ZLIB_ROOT)/BUILD.gn" "\\"zlib_inflate_chunk_simd\\".*?sources = ")',
- ],
- }, # zlib_inflate_chunk_simd
{
'target_name': 'zlib',
'type': 'static_library',
@@ -199,8 +162,11 @@
}],
# Incorporate optimizations where possible.
['(target_arch in "ia32 x64" and OS!="ios") or arm_fpu=="neon"', {
- 'dependencies': [ 'zlib_inflate_chunk_simd' ],
- 'sources': [ '<(ZLIB_ROOT)/slide_hash_simd.h' ]
+ # 'dependencies': [ 'zlib_inflate_chunk_simd' ],
+ 'sources': [
+ '<(ZLIB_ROOT)/inflate.c',
+ '<(ZLIB_ROOT)/slide_hash_simd.h',
+ ]
}, {
'defines': [ 'CPU_NO_SIMD' ],
'sources': [ '<(ZLIB_ROOT)/inflate.c' ],
Also, if the issue was with |
It's quite possible chromium's zlib is blameless. It's been a while since I last looked at this issue but I remember seeing hints of xmm registers not being saved by callers. I have some jotted-down notes where I was apparently suspicious of xmm7 getting clobbered across calls. |
@bnoordhuis It seems that disabling CRC32_PCLMUL will be a regression of around -20% in decompression speed and -15% in compression speed. Could you (or someone from the node.js team) elaborate further on how the calls are performed between node.sj <--> zlib? |
The relevant code is in src/node_zlib.cc. We're not doing anything out of the ordinary, our C++ code just call zlib's public API. I entertained the idea that we may be miscompiling node or zlib somehow (missing or wrong compiler flags) but nothing Obviously Wrong stood out to me at the time. |
I think there is something wrong in our benchmarks or I ran the job incorrectly. See #49511 (comment). |
@lpinca I measured using zlib_bench (https://source.chromium.org/chromium/chromium/src/+/main:third_party/zlib/contrib/bench/zlib_bench.cc), where the data is all in memory and it calculates raw compression/decompression speeds of zlib. If values are smaller than what I've reported, that probably means that there are other bottlenecks probably in moving data between node.js <--> zlib, such that data integrity checks (i.e. CRC32) is not the limiting factor. |
@Adenilson yes I understand. What I find surprising is that our benchmarks show improvements without the CRC32 SIMD optimization. They just compare before and after. |
@lpinca what would be distribution of lengths of data vectors passed to zlib (i.e. the 'len' in crc32_z, https://source.chromium.org/chromium/chromium/src/+/main:third_party/zlib/crc32.c;l=702) during the benchmarks? If lengths are short (i.e. smaller than the threshold to enter vectorized code), we fallback to the portable code. This is really hot code and by disabling CRC32_SIMD_SSE42_PCLMUL quite a few branches are removed from the the hot path. |
It seems that the optimization causes memory corruption. Disable it until the issue is fixed upstream. Fixes: nodejs#45268 PR-URL: nodejs#49511 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
Version
v19.0.0
Platform
Linux 5.4.0-131-generic #147-Ubuntu SMP Fri Oct 14 17:07:22 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
Subsystem
No response
What steps will reproduce the bug?
The bug happens when I send certain data in random-sized chunks over a socket and then pipe it to gzip. I tried to reproduce it with a smaller data sample, but unfortunately it seems that very specific data is needed to reproduce this bug. The script and data to reproduce the issue are here: https://github.com/morpheus-med/node-defect. Repro steps are:
unzip anonymized-data.zip
node index.js
How often does it reproduce? Is there a required condition?
It should crash in 1-2 runs. It crashes every time for me.
What is the expected behavior?
The expected behaviour is for the program to read and pipe the data to gzip without crashing.
What do you see instead?
The program crashes with:
double free or corruption (!prev)
Aborted (core dumped)
or
node: malloc.c:4036: _int_malloc: Assertion (unsigned long) (size) >= (unsigned long) (nb)' failed.
Aborted (core dumped)
or
corrupted size vs. prev_size
Aborted (core dumped)
Additional information
The application this is currently affecting is used for streaming medical imaging data to a viewer. We've currently got around the problem by downgrading node, but we would like to be able to upgrade node in the future to keep up with security updates.
The problem isn't reproducible with node 12.16.3, but is reproducible with versions 12.17.0 and later. I've done a git bisect and it seems to have been introduced with this commit: 9e33f97
Thanks for your help
The text was updated successfully, but these errors were encountered: