Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zlib compression needs to be sync #1369

Open
ghost opened this issue Apr 23, 2018 · 48 comments
Open

Zlib compression needs to be sync #1369

ghost opened this issue Apr 23, 2018 · 48 comments

Comments

@ghost
Copy link

ghost commented Apr 23, 2018

I've read and followed the zlib "memory fragmentation" issue down to where people blaming the Linux kernel and glibc and whatnot.

My two cents:

  • There is nothing wrong with the Linux kernel
  • There is nothing wrong with glibc
  • There is nothing wrong with Node.js

The issue is with websockets/ws.

Doing CPU-bound tasks "async" is nothing more than a delusion - there is no such thing as an async CPU-bound task. If it takes 1 second of CPU-time, it takes 1 second of CPU time no matter when you perform it. Passing a CPU bound task to some thread pool doesn't improve anything in fact it only adds more overhead:

  • Threading overhead
  • You need to copy the data one extra time (major overhead!)
  • You're basically just procrastinating what you need to do right now.

If you cannot keep up with what you buffer up, you're obviously going to blow the process up from too much memory usage since you produce more than you consume.

You have two solutions:

  1. Replace all zlib calls with sync calls and all problems go away and you will lower latencies and actually do work you need to do and not just fool yourself you're doing things "in the background".

OR if you really want to try and use multithreading here (which goes against how Node.js is built and deployed):

  1. Only buffer up NUMBER_OF_LOGICAL_CPU_CORES with the async version and after this do sync calls until the buffer contains less than NUMBER_OF_LOGICAL_CPU_CORES tasks. Aka - you NEED to block sometime because CPU work cannot be "async" like networking. This solution is more complex and goes against what Node.js is but you get to choose.
@lpinca
Copy link
Member

lpinca commented Apr 23, 2018

This issue nodejs/node#8871 (comment) is the problem in ws and yes it will probably go away if sync calls are used but it's still valid as the async usage is what is actually recommended.

As written in the comment a different allocator can work around the issue and in fact, there is no leak on macOS.

I don't know if it's viable to use the Node.js zlib sync methods. It will certainly require extensive testing and a big production environment to validate the results.

@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

I'm not going to argue with you either. I've tried in the past, it's useless, there is no constructive discussion.
Thank you for the suggestion. If anyone want to experiment with sync zlib calls, cool. Please do that! Run benchmarks, use it in a heavy-loaded production environment, analyse the event loop lag, etc. Then eventually open a PR.

@ghost ghost closed this as completed Apr 24, 2018
@lpinca lpinca reopened this Apr 24, 2018
@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

Again, experimentation with this in a production environment would be greatly appreciated. Just back it up with numbers.

@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

Where? BitMEX? In that case why isn't @STRML proposing a patch? They proposed multiple patches in the past to work around the issue. Or are they no longer interested now since they are using uws?

Open a PR, convince maintainers that it is the right thing to do (with numbers), PR merged, discussion ends, all users of ws benefit from this. How awesome is that.

@STRML
Copy link
Contributor

STRML commented Apr 24, 2018

Hey @lpinca - sorry for not updating you further. I'm happy to contribute a sync patch and put it through our benchmarks for completeness' sake. Even with concurrency=1 we still saw fragmentation-related blowups pretty much daily in production. I have not yet tested if sync will fix it but it rings true. Unfortunately this is a difficult bug to reproduce in test.

@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

Cool, looking forward to that. My biggest fear is that using sync calls will fix the memory fragmentation issue but introduce other issues like blocking the event loop for too long. This is what I meant above with "will probably go away".

@STRML
Copy link
Contributor

STRML commented Apr 24, 2018

Yeah it depends on how you've clustered out your app. If you're doing one app per hyperthread it's already pretty close to counterproductive to thread out zlib, and the fragmentation bug takes it over the edge.

@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

blocking with CPU work is always happening to some degree and inevitable.

Well this is why many computationally intensive tasks are delegated to the libuv thread pool no?

@STRML
Copy link
Contributor

STRML commented Apr 24, 2018

Well they can be. Zlib is a bit of an outlier because the standard task delegated to the libuv pool is IO, not CPU. If you assume that any process can actually spawn 1-5 CPU-intensive processes this completely breaks assumptions when clustering. Ideally you want to spawn one cluster worker per hyperthread otherwise.

@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

And here we are again with that arrogant attitude. The point is that it's not ws fault. Any Node.js app/library that uses the asynchronous zlib deflate methods is potentially affected by that memory fragmentation issue.

Is it viable to use the sync methods without incurring in other issues? I don't know. Very happy if this is the case as the code will actually be also a lot easier to maintain.

@ghost ghost closed this as completed Apr 24, 2018
@lpinca
Copy link
Member

lpinca commented Apr 24, 2018

Issue should stay open, and I already said thank you for your suggestion, not sure what else should I do haha. Please explain me exactly what is ws doing wrong because I'm shortsighted. Is using a recommended and public API ws fault? Is it ws fault if that API leads to memory fragmentation?

@luislobo
Copy link

So, #1204 does not fix the issue completely, right?

@lpinca
Copy link
Member

lpinca commented May 3, 2018

@luislobo according to the above comments, yes that's right.

@rahbari
Copy link

rahbari commented Aug 21, 2018

After some testing I saw the async version of gzip takes up to 10 times more than the sync version according to the content size.

100KB: sync version takes 5ms and async version 20ms
2KB: sync version takes 0.3ms and async version 3ms
As size increase the overhead becomes more acceptable.

As websocket usually is used for small chunk of json data I think regardless of the memory leak it's better to use sync deflate for small data considering under 1ms operations are neglectable.

I suggest adding another threshold option for sync compression (syncThreshold maybe) with default value of 10KB or more. this way anyone who wants to go totally sync can increase this value.

@mk-pmb
Copy link

mk-pmb commented Aug 22, 2018

Sorry for not having read it properly a minute ago. 👍 for the threshold option.

@piranna
Copy link

piranna commented Aug 24, 2018 via email

@ghost
Copy link
Author

ghost commented Aug 24, 2018

I'm assuming you toss the compressed variant if it turns out bigger than the original.....

@mk-pmb
Copy link

mk-pmb commented Aug 25, 2018

I'll assume that for each length of input, gzip achives maximum efficiency if and only if all input bytes are the same. In this case, and using values from 0…127, the break-even is at 23 bytes.
Inputs shorter than 23 bytes will result in larger outputs. For these, we shouldn't waste any CPU cycles on gzip and the length comparison.
Compression becomes efficient starting at 24 bytes of input. For these we should compare lengths. Maybe even use a threshold a little bit higher, assuming that most messages won't be the same 7-bit character repeated. I see how people might want to tweak this threshold based on their message characteristics.
Example: If I know that most of my messages will contain at least {"nick":"","msg":""}, I'd want to not spend extra CPU time for inputs up to 42 bytes long.

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

Node.js does not have an official API to use inflate/deflate streams synchronously.

We can't use https://nodejs.org/api/zlib.html#zlib_zlib_deflaterawsync_buffer_options as that does not work as intended for fragmented messages of if context takeover is enabled. We need the ability to reuse the same https://nodejs.org/api/zlib.html#zlib_class_zlib_deflateraw instance.

In order to use a deflate stream synchronously we need to copy this internal function https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L459-L542 but ofc that can change at any time.

As said in #1369 (comment) experimentation (backed by numbers) is welcomed.

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

Yes, but where is the code that produced those results? How many deflate instances were used (we have one per socket)? How long is the event loop blocked when doing that for say 50k clients? I would like to see this kind of stuff.

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

2KB: sync version takes 0.3ms and async version 3ms

@rahbari can you please share your code?

I did a dummy and irrelevant quick test and this is what I get:

Code
'use strict';

const crypto = require('crypto');
const zlib = require('zlib');

const buf = crypto.randomBytes(2048);

console.time('deflate sync');
for (let i = 0; i < 1000; i++) {
  zlib.deflateRawSync(buf);
}
console.timeEnd('deflate sync');

const deflate = zlib.createDeflateRaw();
deflate.resume();

let runs = 0;

function runAsync() {
  deflate.write(buf);
  deflate.flush(() => {
    if (++runs === 1000) {
      console.timeEnd('deflate async');
      return;
    }

    runAsync();
  });
}

console.time('deflate async');
runAsync();
Results
$ for i in $(seq 1 10); do node index.js; echo ""; done
deflate sync: 148.891ms
deflate async: 169.522ms

deflate sync: 149.968ms
deflate async: 164.356ms

deflate sync: 150.329ms
deflate async: 164.349ms

deflate sync: 144.671ms
deflate async: 162.748ms

deflate sync: 163.011ms
deflate async: 164.543ms

deflate sync: 147.325ms
deflate async: 163.505ms

deflate sync: 172.986ms
deflate async: 169.949ms

deflate sync: 148.711ms
deflate async: 167.773ms

deflate sync: 146.679ms
deflate async: 168.316ms

deflate sync: 153.743ms
deflate async: 166.653ms

@mk-pmb
Copy link

mk-pmb commented Aug 25, 2018

I think it would be nicer to run the benchmark some few hundered times in the same nodejs process for warmup, only then collect measurements. v8 might be able to optimize functions, once it knows we want to use them a lot.

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

@piranna please read the full comment. That creates a new instance for each buffer, see https://github.com/nodejs/node/blob/5d210d4ac938a16d132a1433857927c5c58a8954/lib/zlib.js#L692-L706. We can't use that for fragmented messages or if context takeover is enabled.

@piranna
Copy link

piranna commented Aug 25, 2018

But does compression being done at fragment level? Isn't it message level?

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

It is done at the frame/fragment level, yes.

@lpinca
Copy link
Member

lpinca commented Aug 25, 2018

The option to enable compression based on the message byte size is available since version 2.0.0, see https://github.com/websockets/ws/blob/1e78aba9b9b29407517ec7ba1da97427c1678e5c/doc/ws.md#new-websocketserveroptions-callback

threshold {Number} Payloads smaller than this will not be compressed. Defaults to 1024 bytes.

@lpinca
Copy link
Member

lpinca commented Aug 26, 2018

I noticed that there is Zlib.prototype._processChunk(). The API is "private" and undocumented but it seems to work and it is used in user land so it will probably not go away anytime soon.

const deflate = new zlib.DeflateRaw();
const data = deflate._processChunk(chunk, zlib.Z_SYNC_FLUSH);

@fungiboletus
Copy link

fungiboletus commented Oct 24, 2019

I made the following code to reproduce the problem (still happening on alpine linux or debian buster with ws@7.2.0 and node 12 or 13).

const WebSocket = require('ws');

const wss = new WebSocket.Server({
  port: 8080,
  perMessageDeflate: {
    zlibDeflateOptions: {
      memLevel: 3,
    },
    serverMaxWindowBits: 10,
    threshold: 16,
  },
});

wss.on('connection', (ws) => {
  ws.on('message', (message) => {
    ws.send(message);
  });
  ws.on('close', () => {
    console.log(`close, ${wss.clients.size} listeners left`);
  });
});

const message = Buffer.alloc(256);

function sendMessage(ws, j) {
  if (j < 100) {
    setTimeout(() => {
      ws.send(message, sendMessage.bind(null, ws, j + 1));
    }, 16);
  } else {
    ws.close();
  }
}

for (let i = 0; i < 1000; ++i) {
  const ws = new WebSocket('ws://localhost:8080/');
  ws.on('open', () => {
    sendMessage(ws, 0);
  });
}

I tried to make the zlib compression sync, but I only managed to break the unit tests. However from an external point of view of the ws project, I feel like the permessage-deflate.js file is already very dependant on the zlib.js NodeJS implementation. So I think it wouldn't be a huge issue to go further into custom code that will break if the NodeJS zlib.js changes, because it will already break with the current code.

By the way, the memory leak doesn't happen on Windows 10. I don't feel good about deploying windows servers to fix a memory leak though.

@mk-pmb
Copy link

mk-pmb commented Oct 24, 2019

Would it help to deflate/inflate buffers instead of the stream? That way we could use zlib.deflateSync. ("Added in: v0.11.12")

@fungiboletus
Copy link

The difficulty is to support "context takeover".

The RFC defines it like this :

The term "LZ77 sliding window" used in this section means the
buffer used by the DEFLATE algorithm to store recently processed
input. The DEFLATE compression algorithm searches the buffer for a
match with the following input.

The term "use context takeover" used in this section means that the
same LZ77 sliding window used by the endpoint to build frames of the
previous sent message is reused to build frames of the next message
to be sent.

In my opinion, context takeover is what makes websocket compression very cool. You can compress based on the previous messages, which is very convenient when you have many similar messages which I think is very common with websockets.

So, ws should keep the sliding window between messages and the easiest way to do it was to code something on top of zlib streams.

@lpinca
Copy link
Member

lpinca commented Nov 14, 2019

@fungiboletus I can't reproduce on Ubuntu 18.04 using your example and Node.js 13.1.0:

server.js
'use strict';

const WebSocket = require('ws');

const message = Buffer.alloc(256);

const wss = new WebSocket.Server({
  port: 8080,
  perMessageDeflate: {
    zlibDeflateOptions: {
      memLevel: 3
    },
    serverMaxWindowBits: 10,
    threshold: 16
  }
});

function onClose() {
  if (wss.clients.size === 0) {
    console.log('connections: 0');
  }
}

function broadcast() {
  console.log('Broadcasting message');

  for (const ws of wss.clients) {
    ws.send(message);
    ws.close();
  }
}

wss.on('connection', (ws) => {
  if (wss.clients.size === 1000) {
    console.log('connections: 1000');
    broadcast();
  }

  ws.on('close', onClose);
});

wss.on('listening', function() {
  console.log('Listening on *:8080');
});

setInterval(function() {
  console.log(process.memoryUsage());
}, 10000);
clients.js
'use strict';

const WebSocket = require('ws');

const total = 1000;
let connected = 0;

function connect() {
  const ws = new WebSocket('ws://localhost:8080/');

  ws.on('open', function() {
    if (++connected < total) connect();

    ws.on('close', function() {
      if (--connected === 0) setTimeout(connect, 5000);
    });
  });
}

connect();

Memory usage stabilizes at ~100 MiB.

@LongTengDao
Copy link
Contributor

LongTengDao commented Mar 18, 2020

Why not send and receive compressed buffer in userland?

@chanced
Copy link

chanced commented Jul 27, 2020

nodejs/node#34048

This PR is an experiment which hopefully mitigates #8871.
Moves zlib initialization to the point of execution on worker thread pool. This allows to postpone the corresponding memory allocation (the computation) to a later point in time which may be critical in situations when many zlib streams are created and operated simultaneously.

@timotejroiko
Copy link

timotejroiko commented Aug 28, 2020

Greetings,

I am experimenting with synchronous zlib compression in ws.
This implementation is built on top of Zlib.prototype._processChunk() but because its also released as a standalone library, i added it to ws via a peer dependency to avoid tampering with the existing solution as much as possible. The code is still a bit rough and may be missing some error handling but here are some early test results made with a modified version of the /bench/speed.js test:

(windows 10 build 19041, node.js 14.8.0, intel i5 7300HQ 2.5ghz)

> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data:   2.9s    437.94 KiB/s
5000 roundtrips of 16 KiB binary data:  2.8s    55.45 MiB/s
1000 roundtrips of 128 KiB binary data: 13.5s   18.48 MiB/s
100 roundtrips of 1 MiB binary data:    9.6s    20.87 MiB/s
1 roundtrips of 100 MiB binary data:    9.4s    21.23 MiB/s
10000 roundtrips of 64 B text data:     3.6s    352.02 KiB/s
5000 roundtrips of 16 KiB text data:    2.6s    59.87 MiB/s
1000 roundtrips of 128 KiB text data:   13.2s   18.95 MiB/s
100 roundtrips of 1 MiB text data:      9.8s    20.34 MiB/s
1 roundtrips of 100 MiB text data:      9.5s    20.96 MiB/s

> npm install fast-zlib
> node speedzlib.js
Generating 100 MiB of test data...
Testing ws on [::]:8181
10000 roundtrips of 64 B binary data:   1.1s    1.11 MiB/s
5000 roundtrips of 16 KiB binary data:  1.5s    101.33 MiB/s
1000 roundtrips of 128 KiB binary data: 10.1s   24.77 MiB/s
100 roundtrips of 1 MiB binary data:    7.3s    27.53 MiB/s
1 roundtrips of 100 MiB binary data:    8s      24.93 MiB/s
10000 roundtrips of 64 B text data:     1.2s    1.05 MiB/s
5000 roundtrips of 16 KiB text data:    1.6s    96.19 MiB/s
1000 roundtrips of 128 KiB text data:   9.7s    25.65 MiB/s
100 roundtrips of 1 MiB text data:      7.7s    26.09 MiB/s
1 roundtrips of 100 MiB text data:      8.2s    24.29 MiB/s

Synchronous zlib appears to provide performance gains of 20% to 40%+ depending on packet size. If anyone wants to experiment with it and test for memory issues, the code is available in my fork of ws.

Cheers!

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

fast-zlib is bit hacky but that is not a problem, we can propose an API to use inflate/deflate streams synchronously in Node.js core.

The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.

The benchmark above uses only a single client and reflects the results of running this benchmark #1369 (comment). There is no need to use ws or transmit data over the network to benchmark this.

One more thing, it doesn't make sense to compress small messages. It is actually counterproductive. This is why ws does not compress messages whose size is < 1024 bytes by default. The threshold can be customized of course.

@timotejroiko
Copy link

timotejroiko commented Aug 29, 2020

Indeed, i intentionally disabled the threshold to test zlib in all situations, to avoid having mixed compressed and non-compressed results

@piranna
Copy link

piranna commented Aug 29, 2020

The problem as I see it is doing this synchronously when there are thousands of clients. Yes, it will be slower but doing it asynchronously does not block the event loop.

What about a flag to left user decide to use them sync or async?

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

It's an option but it complicates everything :(.

The user can already compress the message synchronously before sending it in a WebSocket message. It's not the same but close enough. This also allows to optimize a broadcast a lot because it allows to create only one frame and send the same frame to all clients. See discussion on #617.

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

99% of all WS apps always send messages smaller than 1KB, so this rule is essentially disabling compression altogether.

What is this based on? Probably true for stock trading apps but 99% of all WS apps seems a bit too broad. The default value can be changed but:

  1. Compression is already disabled by default. I think there is not much need for it when using an efficient binary serialization/deserialization format.
  2. Users can customize the threshold to fit their needs.

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

I've seen and worked on WS apps used to transfer huge files, used to stream audio/video data, etc. Those apps used messages way bigger than 1 KB (64 KB / 16 KB).

@ghost
Copy link

ghost commented Aug 29, 2020

Yes some use it like that. Strange. But it doesn't invalidate the observation that the vast majority of use cases don't. And it doesn't matter, really, all that matters is that small message sending is a huge use case with WS. So having an optimized/stable execution path that works well for small messages is important. That's all.

@mk-pmb
Copy link

mk-pmb commented Aug 29, 2020

I agree, we need an optimized path for applications that are basically signalling or chat, because that's how I will primarily use it. (Of course we shall still support large media transfer as well.)

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

@mk-pmb to stay on topic, from my experience compressing very small messages (regardless of how compression is done, sync or async) does not make an app faster but slower.

If you are not happy with ws performance when dealing with very small messages you are in the wrong repository because the bottleneck is not ws but Node.js {net,tls}.Socket, so you should bring this up to Node.js core repository.

@mk-pmb
Copy link

mk-pmb commented Aug 29, 2020

compressing very small messages […] does not make an app faster but slower.

Thanks for reminding. I should have expressed myself more clearly. I meant that even in scenarios where network traffic is the bottle neck, where a few bytes saved make a difference, and it's thus acceptable to invest some more CPU cycles, our compression should still try to be economical in how many extra CPU cycles users would need to invest.

You're right that most of my chat-like apps won't usually run in such transmission-restricted circumstance though.

@fungiboletus
Copy link

The discussion about the performances is interesting, but I want to mention that the memory leak is still there with ws@7.3.1 and node 14.9.0. The current implementation may be faster in some uses cases and slower in some others, but it does leak memory.

@lpinca
Copy link
Member

lpinca commented Aug 29, 2020

@fungiboletus I was not able to reproduce the memory fragmentation issue using your code above, see #1369 (comment).

Also it should be better now as nodejs/node#34048 mitigated the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

13 participants