Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Runtime Agnostic Healthcheck #1740

Merged
merged 7 commits into from
Nov 14, 2024
Merged

Conversation

schnie
Copy link
Member

@schnie schnie commented Nov 1, 2024

Description

This PR aims to provide a consistent local dev startup experience across container runtimes. Today, astro dev start will print the status update and attempt to open the browser to the airflow dashboard once its up and running. This is nice and it works for Docker Desktop. It does not work as expected for orbstack or podman. For these, the command line hangs and we end up opening the browser to a URL that isn't ready yet. This is a bad experience.

We're exploring the idea of using podman as our default runtime going forward and this is another small fix towards preparing for something like that. This fix will improve todays UX either way.

🎟 Issue(s)

Related to https://github.com/astronomer/astro/issues/24344

🧪 Functional Testing

Have tested this using astro dev start on:

  • Mac OS, Docker Desktop
  • Mac OS, Podman
  • Mac OS, Orbstack
  • Windows OS, Podman
  • Windows OS, Docker Desktop

📸 Screenshots

Add screenshots to illustrate the validity of these changes.

📋 Checklist

  • Rebased from the main (or release if patching) branch (before testing)
  • Ran make test before taking out of draft
  • Ran make lint before taking out of draft
  • Added/updated applicable tests
  • Tested against Astro-API (if necessary).
  • Tested against Houston-API and Astronomer (if necessary).
  • Communicated to/tagged owners of respective clients potentially impacted by these changes.
  • Updated any related documentation

@schnie schnie force-pushed the universal-webserver-healthcheck branch 3 times, most recently from 075590e to e737fd9 Compare November 1, 2024 02:04
@schnie schnie force-pushed the universal-webserver-healthcheck branch from e737fd9 to d088cd2 Compare November 1, 2024 02:12
}

// If we've successfully gotten a healthcheck response, print the status.
err = printStatus(settingsFile, envConns, project, d.composeService, airflowDockerVersion, noBrowser)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

printStatus() was previously called by the original checkWebserverHealth() function. That function has been removed and replaced and doesn't concern itself with printing the message. That now just happens at the "top-level" here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

} else {
fmt.Println("\nProject is running! All components are now available.")
}
fmt.Println("\nProject is running! All components are now available.")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The healthcheck now works for all runtimes so there's no need to differentiate here.

@@ -1433,7 +1397,7 @@ func printStatus(settingsFile string, envConns map[string]astrocore.EnvironmentO
fmt.Println("\nUnable to open the webserver URL, please visit the following link: " + webserverURL)
}
}
return errComposeProjectRunning
return nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we were returning an error for success, but we don't anymore.

Comment on lines -112 to -117
healthcheck:
test: curl --fail http://webserver:8080/health || exit 1
interval: 2s
retries: 15
start_period: 5s
timeout: 60s
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This original check built into the compose file does not work for orbstack or podman. We remove it here and it's been replaced by the new functionality.

@@ -19,7 +17,7 @@ type Suite struct {
origCmdExec func(cmd string, stdout, stderr io.Writer, args ...string) error
origGetDockerClient func() (client.APIClient, error)
origInitSettings func(id, settingsFile string, envConns map[string]astrocore.EnvironmentObjectConnection, version uint64, connections, variables, pools bool) error
origCheckWebserverHealth func(settingsFile string, envConns map[string]astrocore.EnvironmentObjectConnection, project *types.Project, composeService api.Service, airflowDockerVersion uint64, noBrowser bool, timeout time.Duration) error
origCheckWebserverHealth func(url string, timeout time.Duration) error
Copy link
Member Author

@schnie schnie Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just updating to the signature.

@schnie schnie force-pushed the universal-webserver-healthcheck branch 7 times, most recently from 25af892 to ca5e07d Compare November 1, 2024 03:48
@schnie schnie force-pushed the universal-webserver-healthcheck branch from ca5e07d to 47a5644 Compare November 1, 2024 03:51
@@ -82,7 +81,6 @@ const (
var (
errNoFile = errors.New("file specified does not exist")
errSettingsPath = "error looking for settings.yaml"
errComposeProjectRunning = errors.New("project is up and running")
Copy link
Member Author

@schnie schnie Nov 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was an error that we used to indicate success for some reason. This isn't used anymore as we just return nil if things work (no error).

Comment on lines -1464 to -1494
// CheckM1Image checks if the CLI is currently running on M1 architecture
// next it checks if the runtime version of the image building is above 6.0.4
// if the image is M1 architecture and runtime version is less than or equal to 6.0.4 it will print true
var CheckM1Image = func(imageLabels map[string]string) bool {
if !isM1(runtime.GOOS, runtime.GOARCH) {
// the architecture is not arm64 no need to print message
return false
}
runtimeVersion, ok := imageLabels[runtimeVersionLabelName]
if !ok {
// cannot determine runtime version print message by default
return true
}

return versions.LessThanOrEqualTo(runtimeVersion, M1ImageRuntimeVersion)
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is purely used to decide if the astro dev start --wait flag defaults to 1m or 5m. Thinking we just simplify and use 5m. The CLI progresses as soon as it sees the 200 from the webserver /health endpoint.

@schnie schnie force-pushed the universal-webserver-healthcheck branch 3 times, most recently from bed5d87 to 4caa877 Compare November 1, 2024 19:19
Comment on lines -292 to -305
startupTimeout = waitTime
// check if user provided a waitTime
// default is 1 minute
if waitTime != 1*time.Minute {
startupTimeout = waitTime
} else if CheckM1Image(imageLabels) {
// user did not provide a waitTime
// if running darwin/M1 architecture
// we wait for a longer startup time
startupTimeout = 5 * time.Minute
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing a chunk of logic around this M1 architecture check and runtime comparison. The only reason we have all this is to bump the startup timeout from 1m to 5m. I've just made 5m the default across the board.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we feel strongly about not changing the default across the board, we could revert this piece. It just feels like unnecessary cruft at this point, and this feels like a fine time to remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cutting out these legacy "M1" bits, it is definitely time to remove it.

I believe we added the 5 minute timeout because when Apple Silicon first came out we did not have arm64 support in the Astro Runtime image and so the macOS emulation was really slow. We have long since added that arm64 support, so I'm wondering if the default timeout should be 1 minute?

Copy link
Member Author

@schnie schnie Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I'm fine with making it 1m across the board now. @neel-astro that sound ok to you?

@@ -52,7 +51,6 @@ const (
defaultAirflowVersion = uint64(0x2) //nolint:gomnd
triggererAllowedRuntimeVersion = "4.0.0"
triggererAllowedAirflowVersion = "2.2.0"
M1ImageRuntimeVersion = "6.0.4"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 6.x version of runtime has been deprecated as of March 2024 as stated in our docs.

Screenshot 2024-11-01 at 3 43 13 PM

Comment on lines 1352 to 1405
var checkWebserverHealth = func(settingsFile string, envConns map[string]astrocore.EnvironmentObjectConnection, project *types.Project, composeService api.Service, airflowDockerVersion uint64, noBrowser bool, timeout time.Duration) error {
if config.CFG.DockerCommand.GetString() == podman {
err := printStatus(settingsFile, envConns, project, composeService, airflowDockerVersion, noBrowser)
if err != nil {
if !errors.Is(err, errComposeProjectRunning) {
return err
}
}
} else {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
// check if webserver is healthy for user
err := composeService.Events(ctx, project.Name, api.EventsOptions{
Services: []string{WebserverDockerContainerName}, Consumer: func(event api.Event) error {
marshal, err := json.Marshal(map[string]interface{}{
"action": event.Status,
})
if err != nil {
return err
}
if string(marshal) == `{"action":"health_status: healthy"}` {
err := printStatus(settingsFile, envConns, project, composeService, airflowDockerVersion, noBrowser)
if err != nil {
return err
}
}

return nil
},
})
if err != nil {
if errors.Is(err, context.DeadlineExceeded) {
fmt.Printf("\n")
return fmt.Errorf("there might be a problem with your project starting up. The webserver health check timed out after %s but your project will continue trying to start. Run 'astro dev logs --webserver | --scheduler' for details.\n\nTry again or use the --wait flag to increase the time out", timeout)
}
if !errors.Is(err, errComposeProjectRunning) {
return err
}
}
}
return nil
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is being removed in favor of the new one declared in its own file below - health_check.go.

@schnie schnie changed the title Adds Runtime Agnostic Webserver Healthcheck Adds Runtime Agnostic Healthcheck Nov 1, 2024
@schnie schnie marked this pull request as ready for review November 1, 2024 19:53
@schnie schnie requested a review from neel-astro as a code owner November 1, 2024 19:53
@schnie schnie force-pushed the universal-webserver-healthcheck branch from 4caa877 to 0f96cb3 Compare November 1, 2024 19:55
@schnie schnie force-pushed the universal-webserver-healthcheck branch from 0f96cb3 to d5c9fcf Compare November 1, 2024 19:56
@schnie schnie requested a review from pritt20 November 4, 2024 15:10
@@ -407,7 +407,7 @@ func (s *AirflowSuite) TestAirflowStart() {

mockContainerHandler := new(mocks.ContainerHandler)
containerHandlerInit = func(airflowHome, envFile, dockerfile, imageName string) (airflow.ContainerHandler, error) {
mockContainerHandler.On("Start", "", "airflow_settings.yaml", "", "", false, false, 1*time.Minute, map[string]astrocore.EnvironmentObjectConnection(nil)).Return(nil).Once()
mockContainerHandler.On("Start", "", "airflow_settings.yaml", "", "", false, false, 5*time.Minute, map[string]astrocore.EnvironmentObjectConnection(nil)).Return(nil).Once()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally unit tests should use the constants.

Copy link
Contributor

@jeremybeard jeremybeard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is excellent, it simplifies the code and gains more functionality as a result! Appreciate the proactive PR comments too, makes it all very clear.

Approving ahead of a couple of small comments to look into.

Copy link
Contributor

@neel-astro neel-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a step in the right direction, but I am slightly concerned about potentially breaking the behavior for folks using docker-compose override or custom docker-compose in their local setup (applies to both docker and podman users, and I have seen a decent amount of examples where we suggest customers use it to customize any change), because in those cases we can't for sure say that the port mapping would be 8080 or even the same as that of the webserver.port in config (because changing port number is quite a common occurrence).
We need to think through a way to keep supporting those. Maybe adjust our error messaging to ask the user to set the webserver port in the config even if they are using custom docker-compose or override.

airflow/docker.go Outdated Show resolved Hide resolved
// This fires on every tick of our timer to run the healthcheck.
// We return successfully from this function when we get a 200 status code.
case <-ticker.C:
statusCode, _ := healthCheck(ctx, client, url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: either we could log the error or drop the error from the return arguments

airflow/health_check_test.go Outdated Show resolved Hide resolved
cmd/airflow.go Outdated Show resolved Hide resolved
@schnie
Copy link
Member Author

schnie commented Nov 5, 2024

@neel-astro I've updated the healthcheck URL to get the port from the config now. I also attempted to override compose with:

webserver:
  ports:
  - 9090:9090

This results in both our standard 8080 and 9090 being bound, so the override file can't wipe out our existing port and it'll always be the first in the list of ports (or the overridden port from the config).

Also, the healthcheck that was defined previously via the compose file had 8080 hard-coded, so this should actually result in an improvement here now that we grab it from the config.

@schnie schnie force-pushed the universal-webserver-healthcheck branch from feedeb0 to 7352506 Compare November 5, 2024 18:32
Copy link
Contributor

@neel-astro neel-astro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this change would still be slightly disruptive for anyone using astro dev start --compose-file <custom-compose-file> with a different webserver port, but there is a workaround for them by setting astro config set webserver.port <webserver-port>. But since this solves the problem for a bunch of other broken use cases, I am on board with the change.

@schnie
Copy link
Member Author

schnie commented Nov 6, 2024

@neel-astro do you have an example of what that compose file would look like to override the websever port in a problematic way? As far as I know, any ports defined in a custom compose file are added to the standard definition we provide as the base, so 8080 (or whatever the user configures) will always be bound.

Or are you referring to overriding the airflow setting to change the port that airflow webserver listens on?

@neel-astro
Copy link
Contributor

@neel-astro do you have an example of what that compose file would look like to override the webserver port in a problematic way? As far as I know, any ports defined in a custom compose file are added to the standard definition we provide as the base, so 8080 (or whatever the user configures) will always be bound.

@schnie there is a functionality within astro dev commands to export the current docker-compose file locally, edit it, and use that local file to spin components up. This is a bit advanced use case, and it is also a backdoor for us to unblock users running into various non-common issues due to how they are running local setups.
astro dev object export --compose --compose-file <compose-file> to export the compose file
astro dev start --compose-file <compose-file> to use the local compose file to spin up components.

So theoretically, users could export the compose file, edit it to replace the existing 8080 webserver port, and spin up the local setup.

@schnie
Copy link
Member Author

schnie commented Nov 6, 2024

Hmm, yea I was able to recreate that here locally. Does seem like a bit of an edge case.

I did have some code I was playing with that actually grabbed the port from the webserver service in the compose project. ports is a list, so if someone is really messing with things at this level they could also have multiple ports listed, in which case the port to healthcheck on becomes ambiguous.. unless we scanned all of them, but that doesn't feel quite right either.

Do you know of any use-case where someone out there would be doing this kind of change? If we know of someone actually doing this on the webserver maybe we think about it some more. Otherwise, it does feel like a lot for someone to go through for something like this.

@neel-astro
Copy link
Contributor

Yeah, I think it's fine because in such cases the new way to handle the case would be for them to set the webserver.port in CLI config.
The other way around might be for CLI to fetch the port from the container named "webserver", i.e. parse and get it from the docker port webserver output.
I have seen use cases for custom compose file used by the users (mostly around volume mounting), but I am not sure if they would be using that with a custom webserver port or not.

@schnie schnie merged commit 0d3145f into main Nov 14, 2024
2 of 3 checks passed
@schnie schnie deleted the universal-webserver-healthcheck branch November 14, 2024 16:21
@lzdanski
Copy link
Contributor

@yanmastin-astro Add to docs release notes?

"Additional improvements: Streamlined the behavior of astro dev start for Podman users."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants