Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a PDS operator, I want to know the health of the registry API service #336

Closed
jimmie opened this issue May 16, 2023 · 9 comments
Closed

Comments

@jimmie
Copy link
Member

jimmie commented May 16, 2023

This indication will be based on a new API endpoint specific to indicating the health of the service. The primary purpose of this is to provide a comprehensive assessment of if the service is healthy, to be used as the healthcheck for the service listener.

Sending a request to this endpoint will provide a summary of the following:

  • Opensearch connectivity
  • Springboot metrics, if possible
  • Non-sensitive config details: active remotes (in CCS context), active indices (in new MT context)
  • JVM information: Runtime.freeMemory(), Runtime.maxMemory(), Runtime.version()

Failures or certain levels of the above info will result in a non-200 return code.

The determined state of health will be:

  • returned in a JSON response payload. This will be disregarded by the listener group healthcheck but will provide key information to operators if invoked directly
  • recorded in the service log file (i.e. stdout/stderr which is recorded in CloudWatch logs)

Once available, we will need to update the terraform scripts to include this endpoint in the listener healthcheck definition.

Acceptance Criteria

Given a nominal running Registry API and OpenSearch registry
When I perform a query of the healthcheck/ endpoint
Then I expect to receive a 200 response and metadata indicating a successful running application

Given a running Registry API and OpenSearch registry, with an off-nominal state for the Registry API
When I perform a query of the healthcheck/ endpoint
Then I expect to receive TBD response code(s) and applicable metadata

Given a running Registry API and Registry (OpenSearch), with an off-nominal state for the Registry (OpenSearch)
When I perform a query of the healthcheck/ endpoint
Then I expect to receive TBD response code(s) and applicable metadata

Sub-tasks

@jimmie jimmie self-assigned this May 16, 2023
@jimmie
Copy link
Member Author

jimmie commented May 16, 2023

Derived from registry-api #297

@jordanpadams
Copy link
Member

@jimmie can we also be sure to include a check that there is non-zero results returned? basically just making sure there is actually data in the registry too, not just that everything is running.

@jimmie
Copy link
Member Author

jimmie commented May 18, 2023

Sure, we can do that - maybe get a count of documents and include that in the return payload? Note that this will not constitute a failure on the part of the healthcheck (i.e. there may be zero documents but if everything else checks out OK a 200 will still be returned) since the intention for this is to convey to ECS/Fargate of whether the task needs to be recycled.

@tloubrieu-jpl
Copy link
Member

@jimmie what are you thinking of, regarding the springboot metrics ?

@jimmie
Copy link
Member Author

jimmie commented May 24, 2023

@tloubrieu-jpl I am not 100% sure, I did a quick (very quick) scan and saw hints that there may be some useful information but it may prove to be very difficult to access. I was hoping for something like # of requests, response rates, etc to include in the response payload but maybe I'm being too optimistic.

@tloubrieu-jpl
Copy link
Member

Thanks @jimmie I was seeing the springboot can provide a specific URL for those metrics, maybe we could expose it separatelly, maybe with a logn/passowrd protection. It is not critical and not part of this ticket anyway.

@jordanpadams jordanpadams changed the title As a PDS operator, I need a comprehensive indication of registry API service health As a PDS operator, I want to know the health of the registry API service Jun 20, 2023
@jordanpadams
Copy link
Member

@jimmie can you help with the off-nominal response code(s) expected for the 2 failure acceptance criteria described in the original ticket above? Or if there are other acceptance criteria you think are worth noting (e.g. 404 vs 418 vs 501 vs ...)

@tloubrieu-jpl
Copy link
Member

Waiting for @jimmie to create a PR (I pinged him on slack today)

@jordanpadams
Copy link
Member

endpoint done. tests needed added to icebox

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants