Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Draft] Observability and Reliability Batcher #1451

Closed
wants to merge 35 commits into from

Conversation

LPetro
Copy link

@LPetro LPetro commented Jul 20, 2024

Fixes #N/A

Description
This is a draft PR to start getting code and design feedback on the 'ORB' batch logger.

It extracts more-dynamic inputs to provisioning scheduling and logs salient aspects of them to a mounted PV. The data it logs is meant to assist in debugging provisioning and disruption/consolidation.

Milestones completed:

  • Scheduling inputs logged from provisioning scheduler.
  • Scheduling action metadata (from provisioning and disruption) sent to batcher and logged.
  • Scheduling inputs reduced to subset of fields for more concise logging.
  • Heaps built for batching via reconcile loops of the controller.
  • Diff functions built to compare these reduced fields for more efficient logging.
  • Logs serialized to protobuf.
  • Reconstruction functions to deserialize from protobuf.

Intended functionality not yet included:

  • Generalized set-up of PV, mounting and access of mount location (currently hard-coded to my dev S3 bucket and default '/data' mount location) and adding log retention policy.
  • Feature flagging the logging.
  • Command-line tool to pull logs back from PV and present them to the user.
  • Reconstruct scheduling input from logged baseline+diffs based on a time input to tool.
  • (Stretch Goal) Use reconstructed inputs to re-simulate a scheduling decision, to say "these inputs yield this set of nodeclaims" for verification/debugging.

How was this change tested?
Test writing in progress, not yet included. It does include a suite_test.go file, but this is just a blank template for now.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

linux-foundation-easycla bot commented Jul 20, 2024

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LPetro
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jul 20, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @LPetro!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @LPetro. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 20, 2024
@LPetro LPetro marked this pull request as draft July 20, 2024 04:29
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 20, 2024
Comment on lines 64 to 66
SIHeap := orb.NewSchedulingInputHeap()
SMHeap := orb.NewSchedulingMetadataHeap()
p := provisioning.NewProvisioner(kubeClient, recorder, cloudProvider, cluster, SIHeap, SMHeap)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wdyt about using a generic orb.Inject()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or, I think what happens is here is that orb is instantiated, we can then call Inject() for a NewProvisioner and then pass a new orb controller to []controller.Controller{. See the metricsnode controller with its cluster param.

}

func NewProvisioner(kubeClient client.Client, recorder events.Recorder,
cloudProvider cloudprovider.CloudProvider, cluster *state.Cluster,
schedulingInputHeap *orb.SchedulingInputHeap, schedulingMetadataHeap *orb.SchedulingMetadataHeap,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have orb added like cloudProvider?

)

// These are the inputs to the scheduling function (scheduler.NewSchedule) which change more dynamically
type SchedulingInput struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just Input?

//"google.golang.org/protobuf/proto"
)

type SchedulingMetadata struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just metadata?

Comment on lines 133 to 247
func reducePodConditions(conditions []v1.PodCondition) []v1.PodCondition {
reducedConditions := []v1.PodCondition{}
for _, condition := range conditions {
reducedCondition := v1.PodCondition{
Type: condition.Type,
Status: condition.Status,
Reason: condition.Reason,
Message: condition.Message,
}
reducedConditions = append(reducedConditions, reducedCondition)
}
return reducedConditions
}
Copy link
Contributor

@rschalo rschalo Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't any filtering here, could we just use pod.Status.Conditions on schedulinginputs.go:125?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 30, 2024
… tracking for easier reconstruction later. Simplified some diff logic.
…ends to file for inspection. Fixed protobuf version I was using.
…e representation, but I think they're required for reconstruction later. Added an empty test file based on Provisioning which I'll start populating soon.
…I definition, to ToStrings, to proto, to the proto-reconstruct functions and overhauled diff functions to have better wrappers for simplicity and robustness.
…nd associated changes. Updated SchedulingInput String function to just take the String representation of the protobuf. Iteration code clean-up.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 7, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 13, 2024
Copy link

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 27, 2024
@github-actions github-actions bot closed this Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. lifecycle/closed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants