Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kafka microservice health check does not work #1690

Closed
2 of 4 tasks
jirawat-tokenx opened this issue Feb 20, 2022 · 13 comments
Closed
2 of 4 tasks

Kafka microservice health check does not work #1690

jirawat-tokenx opened this issue Feb 20, 2022 · 13 comments

Comments

@jirawat-tokenx
Copy link

jirawat-tokenx commented Feb 20, 2022

Is there an existing issue for this?

  • I have searched the existing issues

Current behavior

Configuring a microservice ping check, like:

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

The check will alway return "timeout of x ms exceeded" whatever how many timeout will be set.

But my service still working well with the same Microservice config.

Minimum reproduction code

       this.kafka.pingCheck('KAFKA', {
          name: configService.getKafkaClientName(),
          transport: Transport.KAFKA,
          options: {
            client: {
              brokers: configService.getKafkaHost(),
            },
            consumer: {
              groupId: configService.getKafkaGroupId(),
            },
          },
        });

Steps to reproduce

No response

Expected behavior

Health check should be return UP status when the kafka is able to connect

Package version

8.0.4

NestJS version

8.2.6

Node.js version

14.17.3

In which operating systems have you tested?

  • macOS
  • Windows
  • Linux

Other

No response

@jirawat-tokenx
Copy link
Author

Hi if there is anything I can provide for this issue please let me know

@Tony133
Copy link
Contributor

Tony133 commented Feb 27, 2022

Hi @jirawat-tokenx, I advise you to leave a minimal reproduction of a clonable git repository so that the core team can evaluate the problem you have reported.

@fer8a
Copy link

fer8a commented Mar 25, 2022

The check will always return "timeout of x ms exceeded" whatever how many timeout will be set.

This is also the case for me out of the box, however, I can get a successful response by increasing the timeout directly in the options (I can't see this in the options you shared @jirawat-tokenx)

This is what I'm doing to make it work.

const kafkaOptions = {
    transport: kafkaConfig.transport,
    options: kafkaConfig.options,
    timeout: 10000,
};
    
this.microservice.pingCheck<KafkaOptions>('Kafka', kafkaOptions)

That being said, the response time is unmanageable. A ping check that is taking ~10 seconds to give back a response is absolutely way too much. Is that something that can be looked at @Tony133?

For context, DB and HTTP health indicators that I'm using (from this same package) have a response time of ~1 second.

@Tony133
Copy link
Contributor

Tony133 commented Mar 25, 2022

Hi @fer8a, i think maybe that's something that can be improved, however if you can create a minimal reproduction in a clonable git repository the Nest Core Team can better assess the issue.

@fer8a
Copy link

fer8a commented Mar 30, 2022

Hello @Tony133 sorry for the late reply. It's been some busy days.

Take a look at the following respository
It's a simple example exposing the health indicator at the base route /.
You can use it for the tests.

@Tony133
Copy link
Contributor

Tony133 commented Mar 31, 2022

Hi all, I tried @fer8a minimum reproduction, I did some tests setting the timeout up to 5140 gives the following error (see screenshot):

screenshot

Log terminal:
Schermata 2022-03-31 alle 19 04 35

while setting the timeout on 5200 and increasing it, the status check works correctly (see screenshot):

screenshot

Log Terminal:
screenshot

This is a brief summary of the test I took.
You need to understand if this is a specific problem that affects only the @nestjs/microservices package or if it also affects the @nestjs/terminus package as a whole. Since internally for microservices terminus uses the @nestjs/microservices package.

P.s. For the Kafka service I used a docker container.

@ohudenkoongage
Copy link

Hi everyone. I found that kafka pingCheck works if added producerOnlyMode: true flag. Think, for those who're looking for a solution, it could be helpful.

Example:

this.microservice.pingCheck('kafka', {
          transport: Transport.KAFKA,
          options: {
            ...this.configService.get('kafka.options'),
            producerOnlyMode: true,
      },
}),

@jlahtinen
Copy link

@Tony133 If you have time, please check https://github.com/jlahtinen/nestjskafkaterminus

I think this is about as simple as it can get to reproduce described problem. I have not idea how to fix this.

If producerOnlyMode: true is the fix, it should be documented and make clear are the consequences using it (for me all seems to be working with that on localhost, but on k8 all starts to fail again like without producerOnlyMode).

@Tony133
Copy link
Contributor

Tony133 commented Oct 28, 2022

Hi @jlahtinen, I tried your minimal reproduction in a local environment with docker and managed to get it working, but I had to increase the timeout to 9000ms or more, using NestJS v9.x and Terminus v9.x compared to my message I had posted above with NestJS v8.x and Terminus v8.x.

I attach here two screenshots:

Schermata 2022-10-28 alle 18 52 51

Schermata 2022-10-28 alle 18 53 51

I hope I have been of some help

@jlahtinen
Copy link

@Tony133 Thank you. Works as you described.

@Ugzuzg
Copy link

Ugzuzg commented Jan 10, 2023

A big problem with this is that the check creates a new connection, causing Kafka to rebalance the consumers, then closes it, which triggers another rebalance.

I think, it's an issue with the approach itself: instead of checking if the existing connection is healthy, the indicator checks if Kafka is connectable, so we don't really know if the microservice itself is still connected.

@EugeneKorshenko
Copy link

Yeah, I agree that it's not actually terminus issue. It's natural state of things for Kafka.
If you want to check both producer and consumer connections you have to take into account that it takes some time to join consumer group and make consumer connection stable.
But you can just use producerOnlyMode to significantly speed up things (as suggested above). It should give you a necessary amount of confidence that your microservice is able to reach Kafka cluster.

@Tony133 I'd probably consider introducing default config for Kafka transport consumer config with producerOnlyMode set to true.
Or, if we don't want to mutate health check config from inside of the library we could think of adding warning note to the documentation.

@BrunnerLivio
Copy link
Member

producerOnlyMode will be set per default to true now. We've also added additional e2e tests to ensure this behavior.
If you already have producerOnlyMode set to false it will overwrite the default behavior.

Released with v10.1.0 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants