Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buffer.concat and Buffer.copy silently produce invalid results when the operation involves indices equal or greater than 2^32 #55422

Open
rotemdan opened this issue Oct 17, 2024 · 17 comments · May be fixed by #55492
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. regression Issues related to regressions.

Comments

@rotemdan
Copy link

Version

v22.9.0, v23.0.0

Platform

Windows 11 x64

Microsoft Windows NT 10.0.22631.0 x64

Subsystem

Buffer

What steps will reproduce the bug?

const largeBuffer = Buffer.alloc(2 ** 32 + 5)
largeBuffer.fill(111)

const result = Buffer.concat([largeBuffer])
console.log(result)

How often does it reproduce? Is there a required condition?

Consistent in v22.9.0 and v23.0.0

What is the expected behavior? Why is that the expected behavior?

All bytes of the return buffer produced by Buffer.concat([largeBuffer]) should be identical to the source:

In this example:

111, 111, 111, 111, 111, 111, 111, 111, 111, 111, 111, ....

What do you see instead?

In the returned buffer, first 5 bytes are 111, and all following ones are 0.

111, 111, 111, 111, 111, 0, 0, 0, 0, 0, 0, ....

The console.log(result) output looks like:

<Buffer 6f 6f 6f 6f 6f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 4294967251
 more bytes>

Additional information

No response

@targos targos added the buffer Issues and PRs related to the buffer subsystem. label Oct 17, 2024
@rotemdan rotemdan changed the title Buffer.concat silently produces invalid output when its output size is greater than 4GB Buffer.concat silently produces invalid output when its output size is greater than 4GiB Oct 17, 2024
@avivkeller avivkeller added the confirmed-bug Issues with confirmed bugs. label Oct 17, 2024
@rotemdan
Copy link
Author

My current workaround (tested to produce correct results with sizes greater than 4 GiB):

export function concatBuffers(buffers: Buffer[]) {
	let totalLength = 0

	for (const buffer of buffers) {
		totalLength += buffer.length
	}

	const resultBuffer = Buffer.alloc(totalLength)

	if (totalLength === 0) {
		return resultBuffer
	}

	let writeOffset = 0

	for (const buffer of buffers) {
		resultBuffer.set(buffer, writeOffset)

		writeOffset += buffer.length
	}

	return resultBuffer
}

@avivkeller
Copy link
Member

avivkeller commented Oct 17, 2024

The issue started in v22.7.0. I'll start bisecting. Maybe #54087?

@avivkeller avivkeller added the regression Issues related to regressions. label Oct 17, 2024
@avivkeller
Copy link
Member

avivkeller commented Oct 17, 2024

I've finished bisecting. This was indeed caused by #54087 cc @ronag.

9f8f26eb2ff36f9352dd85643073af876b9d6b46 is the first bad commit
commit 9f8f26eb2ff36f9352dd85643073af876b9d6b46 (HEAD)
Author: Robert Nagy <ronagy@icloud.com>
Date:   Fri Aug 2 11:19:41 2024 +0200

    buffer: use native copy impl
    
    PR-URL: https://github.com/nodejs/node/pull/54087
    Reviewed-By: Yagiz Nizipli <yagiz@nizipli.com>
    Reviewed-By: Matteo Collina <matteo.collina@gmail.com>
    Reviewed-By: Benjamin Gruenbaum <benjamingr@gmail.com>
    Reviewed-By: Daniel Lemire <daniel@lemire.me>

 benchmark/buffers/buffer-copy.js |  6 ------
 lib/buffer.js                    | 11 ++++++-----
 src/node_buffer.cc               | 56 +++++++++++++++++++++++++++-----------------------------
 src/node_external_reference.h    |  9 +++++++++
 4 files changed, 42 insertions(+), 40 deletions(-)

@ronag
Copy link
Member

ronag commented Oct 21, 2024

Anyone care to open a PR? I think this could be a simple case of just switching to .set(srcBuffer) (instead of using native methods) in the case where the total length exceeds e.g. 2 GB.

@ronag ronag added the help wanted Issues that need assistance from volunteers or PRs that need help to proceed. label Oct 21, 2024
@duncpro
Copy link
Contributor

duncpro commented Oct 21, 2024

I reproduced this on macOS.

@ronag I'd like to try and tackle this one.

@MrJithil
Copy link
Member

I reproduced this on macOS.

@ronag I'd like to try and tackle this one.

good luck.

@rotemdan
Copy link
Author

This call to _copy is possibly the reason:

function _copyActual(source, target, targetStart, sourceStart, sourceEnd) {
  if (sourceEnd - sourceStart > target.byteLength - targetStart)
    sourceEnd = sourceStart + target.byteLength - targetStart;

  let nb = sourceEnd - sourceStart;
  const sourceLen = source.byteLength - sourceStart;
  if (nb > sourceLen)
    nb = sourceLen;

  if (nb <= 0)
    return 0;

  _copy(source, target, targetStart, sourceStart, nb); // <--

  return nb;
}

_copy is imported from some sort of internal binding.

const {
  byteLengthUtf8,
  compare: _compare,
  compareOffset,
  copy: _copy, // <--
  fill: bindingFill,
  isAscii: bindingIsAscii,
  isUtf8: bindingIsUtf8,
  indexOfBuffer,
  indexOfNumber,
  indexOfString,
  swap16: _swap16,
  swap32: _swap32,
  swap64: _swap64,
  kMaxLength,
  kStringMaxLength,
  atob: _atob,
  btoa: _btoa,
} = internalBinding('buffer');

A thorough solution is to ensure this method correctly handles large array sizes, or fails.

Just working around it by falling back to TypedArray.set, would leave the possibility of a future issue if some other code calls _copy.

@duncpro
Copy link
Contributor

duncpro commented Oct 21, 2024

So the root cause of this problem is 32-bit integer overflow in SlowCopy in node_buffer.cc here.

const auto target_start = args[2]->Uint32Value(env->context()).ToChecked();
const auto source_start = args[3]->Uint32Value(env->context()).ToChecked();
const auto to_copy = args[4]->Uint32Value(env->context()).ToChecked();

Apparently Uint32Value performs a wrapping conversion. So that's why in the example below the target buffer only gets filled with 5 bytes.

const largeBuffer = Buffer.alloc(2 ** 32 + 5)
largeBuffer.fill(111)

const result = Buffer.concat([largeBuffer])
console.log(result); // 6f 6f 6f 6f 6f 00 00 00 ...
                     // 1  2  3  4  5

Simply replacing Uint32Value with IntegerValue will fix this barring edge cases I've yet to fully consider.

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

I'm not sure what exactly the binding refers to, but I found a candidate method in the C++ code (at node/src/node_buffer.cc) that treats all arguments as Uint32:

// Assume caller has properly validated args.
void SlowCopy(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);

  ArrayBufferViewContents<char> source(args[0]);
  SPREAD_BUFFER_ARG(args[1].As<Object>(), target);

  const auto target_start = args[2]->Uint32Value(env->context()).ToChecked();
  const auto source_start = args[3]->Uint32Value(env->context()).ToChecked();
  const auto to_copy = args[4]->Uint32Value(env->context()).ToChecked();

  memmove(target_data + target_start, source.data() + source_start, to_copy);
  args.GetReturnValue().Set(to_copy);
}

Regardless on whether it's the method used in the binding, using Uint32Value to extract the arguments doesn't seem right.

This method follows, also taking in uint32_ts:

uint32_t FastCopy(Local<Value> receiver,
                  const v8::FastApiTypedArray<uint8_t>& source,
                  const v8::FastApiTypedArray<uint8_t>& target,
                  uint32_t target_start,
                  uint32_t source_start,
                  uint32_t to_copy) {
  uint8_t* source_data;
  CHECK(source.getStorageIfAligned(&source_data));

  uint8_t* target_data;
  CHECK(target.getStorageIfAligned(&target_data));

  memmove(target_data + target_start, source_data + source_start, to_copy);

  return to_copy;
}

@duncpro
Copy link
Contributor

duncpro commented Oct 21, 2024

@rotemdan this is correct

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

If you simply search for the string "uint32" in node/src/node_buffer.cc, you'd realize that many other methods assume that indices are uint32 (4 GiB max). Examples I've found:

  • CopyArrayBuffer
  • Fill
  • StringWrite
  • FastByteLengthUtf8
  • SlowIndexOfNumber (makes assumption that needle is uint32 - not the index)
  • FastIndexOfNumber (makes assumption that needle is uint32 - not the index)
  • WriteOneByteString
  • FastWriteString
  • ...

@ronag
Copy link
Member

ronag commented Oct 21, 2024

I think the fast methods won't get called with anything that doesn't fit into uint32.

@ronag
Copy link
Member

ronag commented Oct 21, 2024

It's the slow methods that need fixing I guess. Should we even support 4G+ Buffers? @jasnell

@rotemdan
Copy link
Author

rotemdan commented Oct 21, 2024

It already supports large typed arrays (new Uint8Array(>= 4 GiB)) and buffers (Buffer.alloc(>= 4 GiB)) since version 22 (or earlier? not sure), which I think is great because it opened up many use cases that were limited before (in my case audio processing of multi-hour audio, and loading large machine-learning models, etc).

Fixing the methods in node/src/node_buffer.cc, by itself, isn't really that hard. It's more about ensuring that the code works correctly in various 32 bit and 64 bit platforms and processor architectures that are currently supported by Node.js.

As an intermediate solution, you could allow large ArrayBuffers but disallow large Buffer objects, but eventually you'd want to fix the Buffer objects to match the capabilities of ArrayBuffers (unless Buffer would be entirely deprecated at some point, or something like that).

@rotemdan
Copy link
Author

rotemdan commented Oct 22, 2024

The fix should be really simple (couldn't test because I don't really know how to compile Node.js at the moment):

In SlowCopy: change ->Uint32Value to ->IntegerValue, which would cause target_start, source_start, to_copy to receive int64_t values:

// Assume caller has properly validated args.
void SlowCopy(const FunctionCallbackInfo<Value>& args) {
  Environment* env = Environment::GetCurrent(args);

  ArrayBufferViewContents<char> source(args[0]);
  SPREAD_BUFFER_ARG(args[1].As<Object>(), target);

  const auto target_start = args[2]->IntegerValue(env->context()).ToChecked();
  const auto source_start = args[3]->IntegerValue(env->context()).ToChecked();
  const auto to_copy = args[4]->IntegerValue(env->context()).ToChecked();

  memmove(target_data + target_start, source.data() + source_start, to_copy);
  args.GetReturnValue().Set(to_copy);
}

The signature of memmove is:

_VCRTIMP void* __cdecl memmove(
    _Out_writes_bytes_all_opt_(_Size) void*       _Dst,
    _In_reads_bytes_opt_(_Size)       void const* _Src,
    _In_                              size_t      _Size
    );

This means there's an implicit cast here from int64_t to size_t. If we pre-validate (in JavaScript) that these are all positive JavaScript safe integers (between 0 and Number.MAX_SAFE_INTEGER), then the cast should be safe (could add static_cast<size_t>(...) but it wouldn't necessarily do anything).

For extra safety for 32-bit platforms, we could ensure they are all in the range of 0 to SIZE_MAX (pointer size), but those checks can also be done in JavaScript.

It's also easy fix FastCopy by changing uint32_t to size_t:

// Assume caller has properly validated args.
size_t FastCopy(Local<Value> receiver,
                  const v8::FastApiTypedArray<uint8_t>& source,
                  const v8::FastApiTypedArray<uint8_t>& target,
                  size_t target_start,
                  size_t source_start,
                  size_t to_copy) {
  uint8_t* source_data;
  CHECK(source.getStorageIfAligned(&source_data));

  uint8_t* target_data;
  CHECK(target.getStorageIfAligned(&target_data));

  memmove(target_data + target_start, source_data + source_start, to_copy);

  return to_copy;
}

These kind of changes are really simple to do.

I definitely think they are worth it.

Anyway, 4 GiB+ contiguous ArrayBuffers should be important (essential?) for WASM64, I believe (and so many other great applications, of course, like memory mapping, databases, machine-learning, large vectors/matrices etc.), and based on my observations of the Node.js code, the amount of effort that would be required to try to artificially restrict Buffer to 32-bit pointers may also be significant by itself, and may turn out to be problematic due to various subtle interactions with underlying ArrayBuffers.

Deprecating Buffer entirely would be orders of magnitudes larger in terms of effort (would require changing massive amounts of code, like string processing, networking, streams, etc.). These fixes aren't that much in comparison. I'm willing to participate, once I figure out how to compile Node.js.

@rotemdan
Copy link
Author

rotemdan commented Oct 22, 2024

I've verified that changing:

  const auto target_start = args[2]->Uint32Value(env->context()).ToChecked();
  const auto source_start = args[3]->Uint32Value(env->context()).ToChecked();
  const auto to_copy = args[4]->Uint32Value(env->context()).ToChecked();

To:

  const auto target_start = args[2]->IntegerValue(env->context()).ToChecked();
  const auto source_start = args[3]->IntegerValue(env->context()).ToChecked();
  const auto to_copy = args[4]->IntegerValue(env->context()).ToChecked();

Seems to fix the issue (tested on Windows 11 x64)

Before:
Screenshot_1

After:
Screenshot_2

Screenshot_6

I'll try to do some more testing before I'll give a pull request.


I also made a fix for CopyArrayBuffer, which is also an exported method.

I've had trouble with fixing other methods that required changing the signature, like FastCopy and FastByteLengthUtf8 since the compiler was giving errors I didn't fully understand. Maybe their signature is encoded somewhere, and I need to modify it there. I couldn't really find so far.

There are also two other minor fixes I looked at:

In StringWrite and SlowWriteString:

uint32_t written = 0;

May be changed to:

size_t written = 0;

Since those assignments may be casting from size_t anyway.

I'll try to work on each fix separately for now. Not all at once.

@rotemdan rotemdan changed the title Buffer.concat silently produces invalid output when its output size is greater than 4GiB Buffer.concat and Buffer.copy silently produce invalid results when the operation involves indices equal or greater than 2^32 Oct 23, 2024
@rotemdan
Copy link
Author

rotemdan commented Oct 23, 2024

Based on observations on the code, I realized the same problem should also occur in Buffer.copy. Turns out it did actually:

Before fix:
Screenshot_1

After fix:
Screenshot_2

Buffer.copy is defined as:

Buffer.prototype.copy =
  function copy(target, targetStart, sourceStart, sourceEnd) {
    return copyImpl(this, target, targetStart, sourceStart, sourceEnd);
  };

copyImpl (also called from Buffer.concat) calls _copyActual which calls the C++ function SlowCopy via the binding:

function _copyActual(source, target, targetStart, sourceStart, sourceEnd) {
  if (sourceEnd - sourceStart > target.byteLength - targetStart)
    sourceEnd = sourceStart + target.byteLength - targetStart;

  let nb = sourceEnd - sourceStart;
  const sourceLen = source.byteLength - sourceStart;
  if (nb > sourceLen)
    nb = sourceLen;

  if (nb <= 0)
    return 0;

  _copy(source, target, targetStart, sourceStart, nb); // <------- Binds to SlowCopy

  return nb;
}

Fixing the C++ method (SlowCopy) should also fix this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
buffer Issues and PRs related to the buffer subsystem. confirmed-bug Issues with confirmed bugs. help wanted Issues that need assistance from volunteers or PRs that need help to proceed. regression Issues related to regressions.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants