Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: fixed the error handler & improve stability #448

Merged
merged 11 commits into from
Aug 2, 2022

Conversation

habibalkhabbaz
Copy link
Contributor

@habibalkhabbaz habibalkhabbaz commented Jul 28, 2022

Description

Based on Node.js documentation, when there is an uncaught exception found using process.on('uncaughtException') it's better to solve the actual problem or use try/catch in the right places

'uncaughtException' is a crude mechanism for exception handling intended to be used only as a last resort. The event should not be used as an equivalent to On Error Resume Next. Unhandled exceptions inherently mean that an application is in an undefined state. Attempting to resume application code without properly recovering from the exception can cause additional unforeseen and unpredictable issues.

Reference: https://nodejs.org/api/process.html#warning-using-uncaughtexception-correctly

and based on that, I updated the error handler and used try/catch in the appropriate places to improve the stability.

Motivation and Context

This pull request has improvements based on my last pull request #434, including:

  • implement errorHandlerWrapper and use it in the places where it should be
  • refactor process.on('uncaughtException') and use it only for the missed uncaught exceptions that should be solved (if any) and report them to slack
  • add await in some places to make try/catch work as expected
  • fix a possible bug while retrieving symbols using getCacheTrailingTradeSymbols with an undefined page or less than 1

How Has This Been Tested?

I ran the bot for two days and tried to inspect the crashes that may happen or happened and I applied the fixes accordingly.
Test files were also provided.

Thanks

@habibalkhabbaz
Copy link
Contributor Author

habibalkhabbaz commented Jul 28, 2022

@chrisleekr

I just increased the redlock retryDelay from 20ms to 200ms which is the optimal value in their documentation.
The reason behind that when we use a lower value like 20ms it increases the possibility of promise rejection because we are doing this:

await Promise.all([
    setupUserWebsocket(logger),
    setupCandlesWebsocket(logger, symbols),
    setupATHCandlesWebsocket(logger, symbols),
    setupTickersWebsocket(logger, symbols)
  ]);

https://github.com/habibalkhabbaz/binance-trading-bot/blob/dce8ca118dde8e7196a9c87486fd679e041e6d7d/app/server-binance.js#L43

All of them use getConfiguration inside their implementation which can cause a redlock error because there is no enough delay and they are running concurrently.
If there is no enough delay and promise rejection arises, the bot will crash on boot or when we restart the bot.
It cannot be solved with a simple try/catch because this is one time call and it must be running smoothly even without try/catch.
That's why I increase the delay.

@habibalkhabbaz
Copy link
Contributor Author

habibalkhabbaz commented Jul 28, 2022

@chrisleekr

One more update: based on the above code, we have to increase the retryCount to 3.

Why?

All promises mentioned above have getConfiguration which has lock except setupUserWebsocket. In worst case, one of them will fail 3 times.
Let's assume the following scenario for one specific symbol

First try, all of them fail because of redlock

  • ❌ setupCandlesWebsocket
  • ❌ setupATHCandlesWebsocket
  • ❌ setupTickersWebsocket

Next try, one of them processed the lock, the others fail

  • ✅ setupCandlesWebsocket
  • ❌ setupATHCandlesWebsocket
  • ❌ setupTickersWebsocket

Next try, two of them processed the lock, others fail

  • ✅ setupCandlesWebsocket
  • ✅ setupATHCandlesWebsocket
  • ❌ setupTickersWebsocket

Next try, all processed the lock successfully

  • ✅ setupCandlesWebsocket
  • ✅ setupATHCandlesWebsocket
  • ✅ setupTickersWebsocket

As we can see, setupTickersWebsocket failed 3 times.

Suggested Solutions

  1. Increase retryCount to 3 or 4 to cover all cases.
  2. Or, remove Promise.all and wait for each call synchronously.
  3. Refactor the code in another way? I don't have an idea about this.

Do you agree with me? If you agree let me know which solution you prefer or if you have another idea.
I will be happy to work on this. Or if you like and you have free time you can do it.

Finally, sorry for the long story 😆 , but I am sure this case will happen to someone because I am facing this as well.

@chrisleekr
Copy link
Owner

Hi @habibalkhabbaz

Sorry for the late reply. I've been busy with work and family.

Firstly, thank you for your contributions. It's amazing.

Let me review your code and I will give you some feedback if required.

@chrisleekr
Copy link
Owner

@habibalkhabbaz

This looks very good and I can see the code will stabilise the bot. 👍
I did leave some feedback, nothing major.

I will try to run it on my server and see how it goes.
If everything goes well, we can release it next week.

@habibalkhabbaz
Copy link
Contributor Author

Thanks @chrisleekr for your feedback. It is indeed a valuable feedback.
I just pushed a performance fix as per your feedback.

Morever, I pushed a fix for Redlock which increases the retryCount to 4 based on my comment above.
If you have any feedback or suggestion let me know. I will work on it for sure!

Thanks again!

@chrisleekr
Copy link
Owner

@habibalkhabbaz

Thanks! As always. :)

Let me run it on my machine and see how it goes.

Morever, I pushed a fix for Redlock which increases the retryCount to 4 based on my comment above.

To be honest, I am also not sure what retryCount will be best. It would be depending on the machine's spec, I guess.
Let's monitor with it.

@habibalkhabbaz
Copy link
Contributor Author

Hello @chrisleekr
Yes, let's test!
I can see it is running smoothly on my pc till now.

With retryCount: 1 most probably I have crash on boot because of concurrently calls to redis and having +130 symbols 😆. That's why I increase it.

I will try to read more about retryCount and will see if it will decrease the performance or not.

Last thing, can we include a fix for #450 and #451 in this PR? I already wrote explanation there and the fix can be easily implemented. Or it's better to create a new PR?

@chrisleekr
Copy link
Owner

+130 symbols

Whoa... that is a lot of symbols.

Last thing, can we include a fix for #450 and #451 in this PR? I already wrote explanation there and the fix can be easily implemented. Or it's better to create a new PR?

@habibalkhabbaz However, I assume the changes will be conflicted with this PR. In that case, put in this PR.
If the changes are not in conflict with this PR, I prefer to have a new PR.

@habibalkhabbaz
Copy link
Contributor Author

@chrisleekr

Whoa... that is a lot of symbols.

Haha, I like trying and doing such things.

@habibalkhabbaz However, I assume the changes will be conflicted with this PR. In that case, put in this PR.
If the changes are not in conflict with this PR, I prefer to have a new PR.

Done.
Pushed a fix on this PR to not have conflicts.

Thanks!


const execute = async (rawLogger, symbol) => {
const logger = rawLogger.child({ jobName: 'trailingTrade' });

await errorHandlerWrapper(logger, 'Trailing Trade', async () => {
if ((await cache.hget(`execute-trailing-trade`, symbol)) === 'true') {
// do nothing, there is another execution task running
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can set a flag in cache here to mark there was an attempt to execute processing and check for that flag later to reexecute processing again.

Copy link
Owner

@chrisleekr chrisleekr Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I resolved it accidentally.
It looks like @uhliksk suggesting the queue?

If implements the queue, it would need to check whether should re-queue or not; otherwise, there could be the situation of queueing the execution multiple times if the bot receives the tick multiple times during the lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisleekr
Yes, @uhliksk is suggesting use a queue instead of simply do nothing.
While I am worried about the delayed execution because of the queue.
To be honest, I don't know which solution is most suitable.

If you have a better idea, drop it here and I will work on it :)

Copy link
Owner

@chrisleekr chrisleekr Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@habibalkhabbaz

Since we have changed the bot to use WebSocket, this kind of issue is anticipated. Although, I was expecting the tick happens almost every second. But the reality was some coins are not very active.

One of the best Redis queue tools I've used is called Bull queue - https://github.com/OptimalBits/bull
Easy to implement, there is UI as well.

Since you've added the code to prevent multiple executions, adding a queue can be adding extra stability.
The logic would be like,

  1. Check if the symbol is current in the execution - https://github.com/chrisleekr/binance-trading-bot/pull/448/files/1bd376ac271322d07df190e5ed8ec2b4c9d6f97d..d5a41033b5d5224386bd802a6a4bfd06be3b1b59#
  2. If true, then queue to Bull. The Bull has a mechanism that can prevent duplicate jobs by setting ID.

Now I am thinking, it would be much better if adding the execution job to Bull whenever receiving the tick.
In addition, all those manual executions we added when the manual task is initiated from the UI can be added to the queue as well.

Although it would require some refactoring...

Copy link
Contributor Author

@habibalkhabbaz habibalkhabbaz Aug 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisleekr

Thanks for the inspiration!
In fact, at the beginning of refactoring the code to use the WebSocket, I use the queue but I thought it is not necessary and nothing critical will happen if I remove it and I removed it.
However, after I read your comment I started to rethink and I found it's better to use it, especially when we trigger manual actions as you said.
It can improve stability and management of actions.

I can work on the changes. Will start with small changes when I have time.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@habibalkhabbaz
That is awesome.
Sorry for not being able to help you code. Family first...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrisleekr
Nothing wrong. Enjoy with your family. We will do what we can do with this bot :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for understanding. :)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@habibalkhabbaz

I was thinking, if you are going to implement the queue, let's merge this PR and make a new PR for the queue.
I think your code is stable.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @chrisleekr
Yes, I see it's fine, we can merge now and make a new PR later for the queue.

@chrisleekr
Copy link
Owner

Pushed a fix on this PR to not have conflicts.

Cool, I've pulled the change and will run in my machine.

@chrisleekr chrisleekr mentioned this pull request Aug 1, 2022
@chrisleekr chrisleekr added the enhancement New feature or request label Aug 1, 2022
@chrisleekr chrisleekr merged commit 5d15757 into chrisleekr:master Aug 2, 2022
@habibalkhabbaz habibalkhabbaz deleted the fix/error-handling branch August 5, 2022 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants