Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Publish event doesn't work with no error message #666

Closed
Gianluca-Casagrande-Stiga opened this issue Sep 13, 2021 · 41 comments · Fixed by mcollina/fastparallel#17
Closed
Labels

Comments

@Gianluca-Casagrande-Stiga
Copy link
Contributor

Hi everyone,
I have a mqtts server running 24/7 with about 20 clients which are publishing 500B payload every second.
At a random time (3-5 times a week) the publish event stops to work: from client side there is no error, you can do the publish command. But from the broker side, no publish event is emitted. On the subscriber client obviously you don't receive any message.
There are no aedes error message: the unique solution I've found is to restart the server.
Anyone has the same problem?

I'm using version 0.46.1

thank you

@robertsLando
Copy link
Member

Could you submit a test code to reproduce the issue?

@Gianluca-Casagrande-Stiga
Copy link
Contributor Author

Hi, I'm not able to reproduce the problem because I haven't found the reason yet. It happens 3-5 times a week with apparently no reason. I would like to know if anyone has encountered the same issue.
I'm working on it and as soon as I find a possible reason, I will update this issue.

@robertsLando
Copy link
Member

What persistence/mqemitter are you using?

@casagrandeg
Copy link

import aedes from "aedes";
import tls from "tls";

const broker = aedes({
    authenticate: (client, username, password, callback) => {
    }
});

const port = 8883
const options = {
    key: process.env.MQTT_BROKER_CERT_KEY,
    cert: process.env.MQTT_BROKER_CERT_CRT,
    requestCert: true,
    ca: process.env.MQTT_CLIENT_CA
}

const server = tls.createServer(options, broker.handle);
server.listen(port, () => {
});

//broker events
broker.on('publish', async function (packet, client) {
});
broker.on('subscribe', async function (subscriptions, client) {
});
broker.on('unsubscribe', async function (unsubscriptions, client) {
});
broker.on('keepaliveTimeout', async function (client) {
});
broker.on('client', async function (client) {
});
broker.on('clientReady', async function (client) {
});
broker.on('clientDisconnect', async function (client) {
});
broker.on('connectionError', async function (client, error) {
});
broker.on('clientError', async function (client, error) {
});
broker.on('closed', async function () {
});

Hi Daniel, this is the schema of my mqtt broker made with aedes.

@robertsLando
Copy link
Member

Ok so you are using the in memory mqemitter/persistence, not redis or mongo right? How does clients connect to broker? Do they have the clean flag set to true or false? What qos?

@casagrandeg
Copy link

Yes, I'm using only the server memory. The pub/sub qos is always 0. I haven't set the clean flag, so I think it is the "true" default value, but I will check every client.

@robertsLando
Copy link
Member

I suggest to double check the clean flag and be sure it set to true otherwise that could be the problem, the memory becomes full after a week and the broker stops processing new messages

@robertsLando
Copy link
Member

@Gianluca-Casagrande-Stiga check Linkedin when you have time ;)

@Gianluca-Casagrande-Stiga
Copy link
Contributor Author

Hi, we still haven't found the reason of this bug. I've only noticed a considerable reduction of this bug when I set requestCert "false" on tls server options, but it is still present.
I have a question: May the reason be due to the node.js version? What is the recommended node version for Aedes 0.46.1? We are using 12.16-alpine Docker image.
Thanks

@robertsLando
Copy link
Member

@Gianluca-Casagrande-Stiga Did you checked the clean flag on clients connect? Also my suggestion is to check proocess memory and cpu usage to see if there is a memory leak somewhere.

@Gianluca-Casagrande-Stiga
Copy link
Contributor Author

Yes, I've checked the clean flag is set on each client.
The server memory is constant over the time (more or less 50MB), so it isn't correlated to the bug event.

@robertsLando
Copy link
Member

@getlarge Any clue what could be the root cause of this? I cannot find any error, there are no memory leaks detected, after some time the server just stops responding without any error

@getlarge
Copy link
Member

getlarge commented Oct 18, 2021

@robertsLando Unfortunately i don't have any idea, TBH i also encountered that same issue (at least symptoms look very similar) few weeks ago and had no time to investigate.
I was using Redis persistence and emitter, so we could eliminate the potential persistence issue.

@robertsLando
Copy link
Member

@getlarge This user isn't using redis/mongo persistence, it's using the built-in one

@getlarge
Copy link
Member

getlarge commented Oct 18, 2021

yes i noticed, if the error is the same but the persistence is not, then it's tempting to suppose it's not persistence/emitter related.

@robertsLando
Copy link
Member

Are you using tls too?

@getlarge
Copy link
Member

Not from node but there is a proxy that does SSL terrmination.

@robertsLando
Copy link
Member

I'm thinking it could also be an issue related to mqtt-packet parser

@JaosnHsieh
Copy link

Related #553

@JaosnHsieh
Copy link

JaosnHsieh commented Oct 19, 2021

I've just found that if ignoring $SYS that includes new/clients in mqemitter -> emit would prevent publishing fail.

details

@robertsLando
Copy link
Member

@JaosnHsieh Your fix doesn't resolve the main bug, why ignoring $SYS emits works?

@robertsLando
Copy link
Member

robertsLando commented Oct 19, 2021

By double checking the code this could only happen when using in memory emitter, this could not happen when using mqemmitter-mongo/redis as them are not using concurrency, what's missing in in memory emitter is a way to release queue when it gets full

cc @mcollina

@robertsLando
Copy link
Member

I have submitted a PR on mqemitter: mcollina/mqemitter#94

@Gianluca-Casagrande-Stiga
Copy link
Contributor Author

Hi, in the meantime, I'm deploying my broker using docker image node:14.18.1-buster instead of node:12.16-alpine, to exclude a node version problem correlation.

@robertsLando
Copy link
Member

robertsLando commented Oct 19, 2021

@Gianluca-Casagrande-Stiga If possible, just to ensure this is not the problem, try set the concurrency to 0

@JaosnHsieh
Copy link

JaosnHsieh commented Oct 19, 2021

@JaosnHsieh Your fix doesn't resolve the main bug, why ignoring $SYS emits works?

Not sure. I found the released function is not called when there is $SYS... new/clients topic involved.

Thanks for your work on this mqemitter PR

I'm wondering is it correct to call this_released() there?
The root cause should be the reason released is not being called ontime instead of calling it extra when it's full? Anyway, I'll follow the PR.

@robertsLando
Copy link
Member

robertsLando commented Oct 19, 2021

I have fixed my PR and added a test for it. BTW I don't think that will fix this issue

The problem was that when the queue was full there was nothing that could release it, I have tested my pr with your test code and it's working

@JaosnHsieh
Copy link

JaosnHsieh commented Oct 19, 2021

I have fixed my PR and added a test for it. BTW I don't think that will fix this issue

The problem was that when the queue was full there was nothing that could release it, I have tested my pr with your test code and it's working

I think this._parallel(..) should trigger released(...) every time.
I think the issue is released(...) not being called after this._parallel(..).

  this.current++
  this._parallel(this, matches, message, callback)

Cloud it be something wrong in the return functions in matches?

  const matches = this._matcher.match(message.topic)

code copied from your pr

@robertsLando
Copy link
Member

@JaosnHsieh parallel alway call released once finished, check line 22, released is a fastparallel option

@robertsLando
Copy link
Member

Just checked again my edited PR with your code and seems it's not working, I think there is a bug elsewhere so...

@JaosnHsieh
Copy link

JaosnHsieh commented Oct 19, 2021

@JaosnHsieh parallel alway call released once finished, check line 22, released is a fastparallel option

Either it didn't finish or it didn't call.

I tested it by adding a count before calling this._parallel(this, matches, message, callback) and another releaseCount.

When it's hanging, the releaseCount is less than this._parallel call count a concurrent value number, let's say 1 in the example repo.

$ git clone --branch reproduce-bug git@github.com:JaosnHsieh/aedes-unable-to-publish.git
$ npm i
$ npm run server
$ npm run client

you might see something like this in termial when publishing hanging

$ releaseCount 30
$ callCount+unmatchedCallCount 31

@robertsLando
Copy link
Member

robertsLando commented Oct 19, 2021

My thought so Is that somewhere in aedes we are missing a cb or else and this at a certain point stucks everything, unfortunately I cannot find where this happens, @JaosnHsieh you mentioned the $SYS/../new/clients topic, actually in aedes code we do a publish to that topic only on aedes.js, second argument is noop

@robertsLando
Copy link
Member

After some digging by adding some logs this is the state of the queue when it stucks:

starting parallel.. current: 1
starting parallel.. queue: Packet {
  cmd: 'publish',
  brokerId: '5bdea25f-a04d-4d5f-a19d-ddbb1d104410',
  brokerCounter: 32,
  topic: '$SYS/5bdea25f-a04d-4d5f-a19d-ddbb1d104410/new/clients',
  payload: <Buffer 38 33 35 33 39 33>,
  qos: 0,
  retain: false,
  dup: false
}
starting parallel.. callback: [Function: release]

I will try to check fastparallel source code to see if I can find out more. BTW seems that by removing the publish on new/clients the bug doens't happen

@robertsLando
Copy link
Member

robertsLando commented Oct 19, 2021

Ok that's really strange... seems the problem is closeSameClients function, that function is triggered everytime a new client connects and so those are the leaked callbacks that cause the stuck.

Strange thing is that by going in depth in fastparallel code I reached this point:

https://github.com/mcollina/fastparallel/blob/master/parallel.js#L111

Where I have added a log like:

console.log('Calling function', toCall, toCall.length, holder.release)

That outputs:

Calling function [Function: closeSameClients] 2 [Function (anonymous)]

After that nothing happens, like if the function is never called (but it actually is as it calls the done). I would like to know @mcollina opinion on this, could it be a problem on mqemitter/fastparallel or what else?

robertsLando added a commit to robertsLando/fastparallel that referenced this issue Oct 19, 2021
@robertsLando
Copy link
Member

Ok I have found the reason, seems the problem relies on fastparallel. Wondering if this could be the root cause of this issue too. @Gianluca-Casagrande-Stiga please try to use another mqemitter (redis or mongo)

@Gianluca-Casagrande-Stiga
Copy link
Contributor Author

I confirm that I have the same problem with publish event also using node:14.18.1-buster, so it is not a node version related problem. As you suggest, I've just deployed the broker using mqemitter-redis. I'll let you know if it solves the problem.
For the moment I left the concurrency value at the default.
Thanks!

mcollina pushed a commit to mcollina/fastparallel that referenced this issue Oct 19, 2021
* fix: ensure release is called

Fixes: moscajs/aedes#666

* fix: typo

* fix: the real fix

* fix: add test

* chore: add ci and dependabot

* docs: ci badge

* chore: add missing nodejs versions

* fix: use string for versions

* fix: remove let/const and other
@robertsLando
Copy link
Member

running an npm update should fix the issue now

@mcollina
Copy link
Collaborator

Amazing @robertsLando, you have been able to track down this issue - this is kind of a degree in async debugging!

@robertsLando
Copy link
Member

It has cost me an headache to find that single line but... It's ok 😆

Grazie Matteo! 🚀

@JaosnHsieh
Copy link

JaosnHsieh commented Oct 20, 2021

@JaosnHsieh Your fix doesn't resolve the main bug, why ignoring $SYS emits works?

@robertsLando

Thanks for you fix first !! I'm still curious why it's related to $SYS ... /new/clients topic.

I found this change done() to setImmediate(done) in aedes also fixed the issue.

diff --git a/node_modules/aedes/aedes.js b/node_modules/aedes/aedes.js
index 5f74455..9a7547f 100755
--- a/node_modules/aedes/aedes.js
+++ b/node_modules/aedes/aedes.js
@@ -151,7 +151,7 @@ function Aedes (opts) {
     if (that.clients[clientId] && serverId !== that.id) {
       that.clients[clientId].close(done)
     } else {
-      done()
+      setImmediate(done)
     }
   })
 

I have the same confusion as this issue on fastparallel.

I thought fastparallel is faster mostly because interval v8 cache mechanism as the doc.

It seems like we have to put asynchronous action call when using fastparallel.
Do you have any ideas why it has to be like that?

you can reproduce the fix by

$ git clone --branch why-setimmediate-needed-on-fastparallel git@github.com:JaosnHsieh/aedes-unable-to-publish.git
$ npm i
$ npm run server
$ npm run client

@robertsLando
Copy link
Member

robertsLando commented Oct 20, 2021

I found this change done() to setImmediate(done) in aedes also fixed the issue.

It's obivious, using a setImmediate doesn't call done code immediatly but on the next poll of the event loop, this causes the bug on fastparallel to be avoided as the index has the time to be resetted

TLDR; Yes that would have fixed the problem on aedes side too but the problem on fastparallel would have been still there (and could happen somewhere else in the future)

@jdiamond I suggest to take a look at: https://nodejs.org/en/docs/guides/event-loop-timers-and-nexttick/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants