Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#175014199] Health checks #90

Merged
merged 18 commits into from
Oct 6, 2020
Merged

[#175014199] Health checks #90

merged 18 commits into from
Oct 6, 2020

Conversation

balanza
Copy link
Contributor

@balanza balanza commented Sep 30, 2020

Introducing health checks for the application.

  • a utils/healthcheck module
  • check for wrong configuration and unreachable resources (cosmosdb, storage, urls)
  • add healthcheck on /info endpoint
  • add /info endpoint to openapi spec

@balanza balanza requested a review from gunzip September 30, 2020 15:53
@pagopa-github-bot
Copy link
Contributor

pagopa-github-bot commented Sep 30, 2020

Affected stories

  • ⚙️ #175014199: Ideare un meccanismo di health check strutturato per le functions

Generated by 🚫 dangerJS

@codecov-commenter
Copy link

codecov-commenter commented Sep 30, 2020

Codecov Report

Merging #90 into master will decrease coverage by 1.17%.
The diff coverage is 54.83%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #90      +/-   ##
==========================================
- Coverage   83.94%   82.77%   -1.18%     
==========================================
  Files          44       47       +3     
  Lines        1489     1556      +67     
  Branches      124      127       +3     
==========================================
+ Hits         1250     1288      +38     
- Misses        234      263      +29     
  Partials        5        5              
Impacted Files Coverage Δ
utils/healthcheck.ts 41.02% <41.02%> (ø)
utils/config.ts 75.00% <75.00%> (ø)
Info/handler.ts 83.33% <85.71%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3918cfd...ba5b0ed. Read the comment docs.

.fold<IResponseSuccessJson<IInfo> | IResponseErrorInternal>(
problems => ResponseErrorInternal(problems.join("\n")),
_ =>
ResponseSuccessJson({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be useful to return all the statuses even in case of success (to build a status dashboard of all services). we should think about some common vocabulary for status checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return all the statuses even in case of success

What d'you mean? We are checking only the current application status - at most, if it can reach an external resources. So I think we just have a single pass/fail status

common vocabulary for status checks.

I agree indeed.

something like this: https://tools.ietf.org/html/draft-inadarei-api-health-check-04

I like this approach, but I notice that the above is a draft. Meanwhile, we are using a problem+json which is already a standard: https://tools.ietf.org/html/rfc7807.

As for now, we need:

  • a machine-readable encoding of the service pass/fail status
    |> we use http status code for that
  • a human-readable report of failures, for throubleshooting
    |> we user detail section of the ProblemJson schema for that

We can model further and define a proper encoding for failure reports, to make them machine-readable. However, I see no use for that so far.

utils/config.ts Outdated
@@ -0,0 +1,97 @@
/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you point this PR versus the other one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both this PR and #89 have just one file in common, but very different impact. I'd rather handle them separately and take myself the burden of keeping the lone file in sync.

utils/config.ts Outdated Show resolved Hide resolved
Comment on lines 29 to 76
/**
* Check the application can connect to an Azure CosmosDb instances
*
* @param dbUri uri of the database
* @param dbUri connection string for the storage
*
* @returns either true or an array of error messages
*/
export const checkAzureCosmosDbHealth = (
dbUri: string,
dbKey?: string
): HealthCheck<true> =>
tryCatch(() => {
const client = new CosmosClient({
endpoint: dbUri,
key: dbKey
});
return client.getDatabaseAccount();
}, toHealthProblems).map(_ => true);

/**
* Check the application can connect to an Azure Storage
*
* @param connStr connection string for the storage
*
* @returns either true or an array of error messages
*/
export const checkAzureStorageHealth = (connStr: string): HealthCheck =>
tryCatch(
() =>
new Promise<azurestorageCommon.models.ServiceStats>((resolve, reject) =>
createBlobService(connStr).getServiceStats((err, result) =>
err ? reject(err) : resolve(result)
)
),
toHealthProblems
).map(_ => true);

/**
* Check a url is reachable
*
* @param url url to connect with
*
* @returns either true or an array of error messages
*/
export const checkUrlHealth = (_: string): HealthCheck =>
// TODO: implement this check
taskEither.of(true);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions may be moved into io-functions-commons (or even io-ts-commons)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functions-commons is ok (ts-commons is used by the app as well)

Copy link
Contributor

@AleDore AleDore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, apart form minor discussion

utils/config.ts Outdated
Comment on lines 94 to 132
COSMOSDB_KEY: NonEmptyString,
COSMOSDB_NAME: NonEmptyString,
COSMOSDB_URI: NonEmptyString,

SERVICE_PRINCIPAL_CLIENT_ID: NonEmptyString,
SERVICE_PRINCIPAL_SECRET: NonEmptyString,
SERVICE_PRINCIPAL_TENANT_ID: NonEmptyString,

AZURE_APIM: NonEmptyString,
AZURE_APIM_HOST: NonEmptyString,
AZURE_APIM_RESOURCE_GROUP: NonEmptyString,
AZURE_SUBSCRIPTION_ID: NonEmptyString,

ADB2C_CLIENT_ID: NonEmptyString,
ADB2C_CLIENT_KEY: NonEmptyString,
ADB2C_TENANT_ID: NonEmptyString,

UserDataBackupStorageConnection: NonEmptyString,

MESSAGE_CONTAINER_NAME: NonEmptyString,
USER_DATA_BACKUP_CONTAINER_NAME: NonEmptyString,
USER_DATA_CONTAINER_NAME: NonEmptyString,

StorageConnection: NonEmptyString,
SubscriptionFeedStorageConnection: NonEmptyString,
UserDataArchiveStorageConnection: NonEmptyString,

PUBLIC_API_KEY: NonEmptyString,
PUBLIC_API_URL: NonEmptyString,

PUBLIC_DOWNLOAD_BASE_URL: NonEmptyString,

SESSION_API_KEY: NonEmptyString,
SESSION_API_URL: NonEmptyString,

LOGOS_URL: NonEmptyString,

SUBSCRIPTIONS_FEED_TABLE: NonEmptyString,
USER_DATA_DELETE_DELAY_DAYS: NonEmptyString,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to remember that in case of a global env variable creation or deletion or renaming we have to update this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem in creation - you'll be forced to add it to the model. When deleting it is tricky, and we'd must rely on our discipline imho

*
* @returns either true or an array of error messages
*/
export const checkAzureStorageHealth = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename this method checkAzureBlobStorageHealth? Because only Blob Storage connection check is executed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This checks the storage resource itself and if the app can correctly connect with it. However I actually found no better solution than try to connect to a blob storage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave the scope of the test to be "the whole storage resource" (thus, I'd leave the name) and I'd rather improve the implementation. Suggestions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could extend the check on lower resources es. for Blob check that the containers exist (api), for table storage that all required tables exist and so on. checkAzureStorageHealth should regroup all these storage health checks. Attention on autogenerated resources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check that the containers exist

That is a deeper level of check which will introduce additional considerations, for example many containers are lazily created.

Attention on autogenerated resources.

What do you mean?

Copy link
Contributor

@BurnedMarshal BurnedMarshal Oct 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean?

The same as you for "lazily created containers". They don't need to be checked. Must be checked only required resource for application start and run. We can improve this logic later if hard to implements.

Comment on lines 37 to 49
it("should decode configuration for sendgrid", () => {
const rawConf = {
MAIL_FROM: aMailFrom,
NODE_ENV: "production",
SENDGRID_API_KEY: "a-sg-key"
};
const result = MailerConfig.decode(rawConf);

expectRight(result, value => {
expect(value.SENDGRID_API_KEY).toBe("a-sg-key");
expect(typeof value.MAILUP_USERNAME).toBe("undefined");
});
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a duplicate test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do yo mean the whole case or just expect(typeof value.MAILUP_USERNAME).toBe("undefined");?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole test is exactly the same as the previous one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, didn't notice that. Good catch!

Comment on lines 78 to 80
// check types
const _: NonEmptyString = value.MAILUP_SECRET;
const __: NonEmptyString = value.MAILUP_USERNAME;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually a canary to catch possible problems when/if we'll edit the config model. I found that several times the programmatic check doesn't ensure the correct type inference (hence, you have green tests but the project will fail to build). Not exhaustive, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't understand. At runtime we have IConfig.decode({ ...process.env, isProduction: process.env.NODE_ENV === "production" }); to check the correct type of required values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need a type check inside the tests is better to create an explicit type

const MailUpConfig = t.interface({
    MAILHOG_HOSTNAME: t.undefined,
    MAILUP_SECRET: NonEmptyString,
    MAILUP_USERNAME: NonEmptyString,
    MAIL_TRANSPORTS: t.undefined,
    NODE_ENV: t.literal("production"),
    SENDGRID_API_KEY: t.undefined
})

And check the response. We can use these types even inside the logic with MailUpConfig.is(config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the check is all about "let me se if the interface I designed actually results in the correct type once I exit the io-ts world".
In the case above, it happened that MAILUP_USERNAME could not be correctly narrowed to NonEmptyString but to string | NonEmptyString, although the interface was programmatically good.
Maybe I could do NonEmptyString.is(value.MAILUP_USERNAME)

return () =>
healthCheck
.fold<IResponseSuccessJson<IInfo> | IResponseErrorInternal>(
problems => ResponseErrorInternal(problems.join("\n\n")),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure than any critical information come from healthCheck is filtered before sending it through the public API Info?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. For the case I tested I saw no problems (resource unavailable, wrong credentials). Still it's important to point out, thanks.

Do you see any case which can be problematic? Please also consider that, for what concerns to functions, the endpoint is unauthenticated but not public, it's reachable only from our private network.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could anyway return a generic Error message of what type of error occurs problems.map(_ => _.__source); and logs the complete error for detail. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm as soon as it'll be easy to troubleshoot. @gunzip thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 8 to 20
const expectRight = <L, R>(e: Either<L, R>, t: (r: R) => void = noop) =>
e.fold(
_ =>
fail(`Expecting right, received left. Value: ${JSON.stringify(e.value)}`),
_ => t(_)
);

const expectLeft = <L, R>(e: Either<L, R>, t: (l: L) => void = noop) =>
e.fold(
_ => t(_),
_ =>
fail(`Expecting left, received right. Value: ${JSON.stringify(e.value)}`)
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice approach. Can we move these methods to a test utility library to share this logic across all the projects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to write a jest extension module with custom matchers for io-ts/fp-ts. Maybe one day ;)
There are still quirks with the above functions (failures aren't reported very well)


describe("InfoHandler", () => {
it("should return an internal error if the application is not healthy", async () => {
const healthCheck: HealthCheck = fromLeft(["failure 1", "failure 2"]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a type error in here, the test fails.

@balanza balanza marked this pull request as ready for review October 5, 2020 15:49
@gunzip
Copy link
Contributor

gunzip commented Oct 6, 2020

once rebased, is this ready to be merged?

@balanza
Copy link
Contributor Author

balanza commented Oct 6, 2020

once rebased, is this ready to be merged?

yes

@gunzip gunzip merged commit 57ed674 into master Oct 6, 2020
@gunzip gunzip deleted the 175014199-healthcheck branch October 6, 2020 10:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants