Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String vs Buffer memory usage + performance #4506

Open
2 tasks done
znewsham opened this issue Nov 21, 2024 · 1 comment
Open
2 tasks done

String vs Buffer memory usage + performance #4506

znewsham opened this issue Nov 21, 2024 · 1 comment

Comments

@znewsham
Copy link

Node.js Version

v18-v22

NPM Version

v10.8.2

Operating System

Linux zacknewsham-xps 6.8.0-48-generic #48~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 7 11:24:13 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Subsystem

buffer, string_decoder, v8

Description

I'm trying to understand why my application has high baseline memory usage - in doing this I discovered something I can't explain - strings seem to cost >10x more memory per character than the equivalent buffer. Some amount of this is expected (~2x given UTF-16 nature of JS strings) - but not on this scale.

A secondary question is why the setup time of a String->String map is so much slower (8x) than a map that takes that string and converts it to a buffer before storing.

Below is a minimal preproduction - the commented out lines in test allow you to toggle between the string->string map and the string->buffer map

I run it with --expose-gc just to get a valid heap snapshot at the end. The total string size stored is (17 + 1000) * 100,000 - so the absolute minimal memory usage of this would be around 100mb (a trivial C++ implementation of the same takes 114mb).

When running with the string->string map, the memory cost is around 3.2GB and the setup time (to populate the map) is ~11s, when running as a string->buffer map the memory cost is 280MB and the setup time is ~1.3s. The "time" difference reported is completely explicable (the cost of parsing the buffer each time)

Minimal Reproduction

import { setTimeout } from "timers/promises";

const characters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789';

// random alpha numeric strings of a specific length
function makeid(length) {
  let result = '';
  const charactersLength = characters.length;
  let counter = 0;
  while (counter < length) {
    result += characters.charAt(Math.floor(Math.random() * charactersLength));
    counter += 1;
  }
  return result;
}

// test setup - 100,000 keys 17 chars long, 1mn iterations, values are 1000 chars long
const keyCount = 100000;
const iterations = 1_000_000;
const keyLength = 17;
const valueLength = 1000;
const keys = new Array(keyCount).fill(0).map(() => makeid(keyLength));

function testMap(map) {
  const startSetup = performance.now();
  keys.forEach(key => map.set(key, makeid(valueLength)));
  const endSetup = performance.now();
  const start = performance.now();
  for (let i = 0; i < iterations; i++) {
    const key = keys[Math.floor(Math.random() * keys.length)];
    const value = map.get(key);

    // v8 optimisation busting - without this the loop is 4x faster due to optimising out the get call
    globalThis.value = value;
  }
  const end = performance.now();
  return { time: end - start, setup: endSetup - startSetup };
}

// a naive implementation that keeps the API the same but converts value's into buffers
class ConvertToBufferMap extends Map {
  set(key, value) {
    super.set(key, Buffer.from(value, "utf-8"));
  }
  get(key) {
    return super.get(key)?.toString("utf-8");
  }
}


async function test() {
  // const map = new Map();
  // console.log("map", testMap(map));
  const bufferMap = new ConvertToBufferMap();
  console.log("bufferMap", testMap(bufferMap));
  gc();
  console.log(process.memoryUsage().rss / 1024 / 1024);

  // pause to go get a heap snapshot or whatever
  await setTimeout(100000);
}

test();

Output

bufferMap { time: 705.9530600000003, setup: 1303.258812 }
Memory usage:  279.30078125

map { time: 83.8109829999994, setup: 10450.127824000001 }
Memory usage:  3195.6953125

Before You Submit

  • I have looked for issues that already exist before submitting this
  • My issue follows the guidelines in the README file, and follows the 'How to ask a good question' guide at https://stackoverflow.com/help/how-to-ask
@znewsham
Copy link
Author

As is so often the case, 10 mins after I ask the question I figure it out :( (at least partially). The problem is the test setup - it looks like makeid leaks memory - I was aware that it would allocate a lot of memory in the incremental string building, but thought the call to gc would clear it up - evidently not. Additionally, it seems that any conversion (e.g., to a buffer) releases that accumulated memory.

If I change makeid to use an array of a pre-allocated length + join at the end, the setup performance and memory usage both drop below that of a buffer - I ran valgrind on it, and it didn't show any lost bytes - so the memory does get cleaned up, just not by the gc call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant