Skip to content
This repository has been archived by the owner on Nov 20, 2023. It is now read-only.

adding health checks #432

Merged
merged 35 commits into from
May 20, 2020
Merged

adding health checks #432

merged 35 commits into from
May 20, 2020

Conversation

areller
Copy link
Contributor

@areller areller commented Apr 30, 2020

adding health checks according to this #19

Changes

  1. Replicas can be cancelled independent of the service
  2. There are 2 new configurations for a service: liveness, readiness. each can hold these settings
http:
  path: /health
  port: 80 # default: pick the first binding of the service
  protocol: http # default: none
  headers:
    - name: A
       value: B
initialDelay: 5 # default: 0
period: 1 # default: 1
timeout: 1 # default: 1
successThreshold: 1 # default: 1
failureThreshold: 3 # default: 3
  1. There are 2 new replica states: Healthy, Ready (Ready = passed both liveness and readiness probing. Healthy = passed liveness but not readiness).
    If neither liveness nor readiness were provided, replica will be upgraded to Ready upon creation. If only liveness if provided, replica will be upgraded to Ready upon passing the liveness probe. if only readiness is provided, replica will be upgraded to Healthy upon creation.
  2. There is a ReplicaMonitor that reads the liveness and readiness probe configuration, and kills/changes the state of the replica according to probe results
  3. Proxy and Ingress only forward to Ready replicas
  4. The generated Kubernetes manifest contains the liveness and readiness configuration.

@@ -219,210 +219,215 @@ private async Task StartContainerAsync(Application application, Service service,

async Task RunDockerContainer(IEnumerable<(int ExternalPort, int Port, int? ContainerPort, string? Protocol)> ports)
{
var hasPorts = ports.Any();
while (!dockerInfo.StoppingTokenSource.IsCancellationRequested)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is there a loop here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I implement the monitor, it might decide to kill a container because it's unhealthy. In that case I'd want the DockerRunner to recreate the replica.
Another way to do it, is to kill the container by calling docker restart, and then there is no need for that loop. what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument behind the former, is that it's consistent with how I'll do it in ProcessRunner

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker runner already restarts the replica albiet with the same name.

Copy link
Contributor Author

@areller areller May 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl I think that the problem with doing docker restart is that the health monitor won't be able to tell that the container was restarted (because there won't be Stopped and Started events). Unless, I explicitly send ReplicaState.Stopped and ReplicaState.Started events whenever I restart the container, but I don't like that approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell with this loop, you would call docker rm and other commands I don't think you want to run:

result = await ProcessUtil.RunAsync("docker", $"rm {containerId}", throwOnError: false, cancellationToken: timeoutCts.Token);

The restart loop for rerunning docker is here:

var logsRes = await ProcessUtil.RunAsync("docker", $"logs -f {containerId}",
. Can you modify this loop instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotalik yeah, makes sense. I'll try doing it that way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotalik changed the DockerRunner to do docker restart when StoppingTokenSource is canceled.

@areller
Copy link
Contributor Author

areller commented May 7, 2020

@davidfowl there are two major things left

  1. generating k8s yamls with liveness and readiness
  2. I don't forward traffic to non ready replicas, unless there is a single replica. proxying with a single replica causes some tests that use external docker images (e.g. redis) to fail, and I need to check why.

Maybe it's best to start reviewing this PR and I can submit the rest later on this PR or another PR?

@areller areller marked this pull request as ready for review May 7, 2020 03:50
replicas: 3
bindings:
- port: 8004
liveness:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 probes are bigger than the entire yaml, that makes me 😢

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl yeah... I thought it's best to go with something flexible like in k8s, but you can get a way with just providing a path, the rest have default values

@areller
Copy link
Contributor Author

areller commented May 12, 2020

I'll probably add the k8s manifest generation tomorrow.

@areller areller changed the title [WIP] adding health checks adding health checks May 13, 2020
@jkotalik
Copy link
Contributor

@areller this change is amazing. Got my coffee ☕ and will be reviewing now.

Copy link
Contributor

@jkotalik jkotalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great start. I think there is some small cleanup tasks for code readability in the replica state management, but overall looks like the right idea.


if (builderHttp.Protocol != null)
{
httpGet.Add("scheme", builderHttp.Protocol!.ToUpper());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you don't need ! here, other similar cases in this file as well.

Suggested change
httpGet.Add("scheme", builderHttp.Protocol!.ToUpper());
httpGet.Add("scheme", builderHttp.Protocol.ToUpper());

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep. I removed it and it didn't throw any warnings. probably because it's inside a block that runs if builderHttp.Protocol != null.
It's weird because at times the compiler will throw a warning when not using the ! operator, even though the expression is in a context where it's guaranteed to not be null...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

found an example: if you go to ReplicaMonitor.cs:113 and remove the ! after serviceDesc.Readiness, the compiler will throw a warning, even though there is no way for that variable to be null at that point.

else
{
// If port is not given, we pull default port
var binding = service.Bindings.First(b => builderHttp.Protocol is null || b.Protocol == builderHttp.Protocol);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel a bit awkward to have health checks per service rather than per endpoint, but it seems to be what k8s does. Or maybe we require a port?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requiring a port will indeed make it more compatible with k8s, but it will make the tye.yaml more verbose and less maintainable in my opinion. one thing that k8s does to help with maintainability is to allow to specify a variable as a port. e.g.

port: liveness-port

@@ -50,7 +50,7 @@ static int GetNextPort()
binding.Port = GetNextPort();
}

if (service.Description.Replicas == 1)
if ((service.Description.Readiness is null || service.Description.RunInfo is IngressRunInfo) && service.Description.Replicas == 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the check for service.Description.RunInfo is IngressRunInfo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because I forgot to remove when added service.Description.Readiness is null :)

private void StartLivenessProbe(Probe probe, ReplicaState moveToOnSuccess = ReplicaState.Healthy)
{
// currently only HTTP is available
if (probe.Http is null)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (probe.Http is null)
if (probe.Http == null)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the rest of the file

// currently only HTTP is available
if (probe.Http is null)
{
_logger.LogWarning("cannot start probing replica {name} because probe configuration is not set", _replica.Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_logger.LogWarning("cannot start probing replica {name} because probe configuration is not set", _replica.Name);
_logger.LogWarning("Cannot start probing replica {name} because probe configuration is not set", _replica.Name);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for the rest of the file.


private void MoveToHealthy(ReplicaState from)
{
_logger.LogInformation("replica {name} is moving to an healthy state", _replica.Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These logs should probably be at debug level.


public override void Start()
{
Func<ReplicaBinding, bool> bindingClosure = (_httpProberSettings.Port.HasValue, _httpProberSettings.Protocol != null) switch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably deserves a comment as well.

Copy link
Contributor

@jkotalik jkotalik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this overall looks good. There will be some follow up I'd like for us to do for polish, but can be done in a separate PR, including:

  • Sample
  • Docs
  • Recipe

src/Microsoft.Tye.Core/CoreStrings.resx Outdated Show resolved Hide resolved
src/Microsoft.Tye.Core/KubernetesManifestGenerator.cs Outdated Show resolved Hide resolved
@@ -187,6 +201,165 @@ private static void HandleServiceVolumes(YamlSequenceNode yamlSequenceNode, List
}
}

private static void HandleServiceProbe(YamlMappingNode yamlMappingNode, ConfigProbe probe)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: return ConfigProbe instead of passing it in s.t you can set service.Liveness/service.Readiness directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. That would probably make more sense but it will break compatibility with the rest of the methods. What do you think?

@@ -187,6 +201,165 @@ private static void HandleServiceVolumes(YamlSequenceNode yamlSequenceNode, List
}
}

private static void HandleServiceProbe(YamlMappingNode yamlMappingNode, ConfigProbe probe)
{
foreach (var child in yamlMappingNode.Children)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add some tests for these parsing wise? Much faster for checking these kinds of conditions like making sure initialDelay can't be negative.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to come up with something. You want it as part of this PR? or break it up in a different one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think following the structure here: https://github.com/dotnet/tye/blob/master/test/UnitTests/TyeDeserializationTests.cs#L25.

Mostly just parsing/ validation of tye.yaml to make sure we cover these edge cases correctly.

No preference whether we add them in this PR now or follow up, I'd just like them in general 😄

{
public class ReplicaMonitor : IApplicationProcessor
{
private ILogger _logger;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidfowl to review this on top of me.

areller and others added 2 commits May 18, 2020 13:57
Co-authored-by: Justin Kotalik <jukotali@microsoft.com>
Co-authored-by: Justin Kotalik <jukotali@microsoft.com>
@jkotalik
Copy link
Contributor

I'm going to go ahead and merge this, let's do some follow up on top of it 😄 .

Again, thanks @areller for all the work you have done here

@jkotalik jkotalik merged commit f27905b into dotnet:master May 20, 2020
@areller
Copy link
Contributor Author

areller commented May 20, 2020

@jkotalik thank you! Will create a PR in the following days

@areller areller deleted the health branch May 23, 2020 03:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants