Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow specifying a kubeconfig per task in the internal API #746

Merged
merged 8 commits into from
Sep 25, 2020

Conversation

DazWorrall
Copy link
Member

What are you trying to accomplish with this PR?

Being primarily a cli tool, Krane currently relies on some global state in the environment - notable for this issue is the kubeconfig path. Deploying our main monolith at Shopify is a bit of a pathological case - we use Krane as a library, and deploy to multiple targets simultaneously using the internal API and threads.

Occasionally we are seeing some errors I believe are ultimately thread safety issues - e.g. apparent corruption of the kubeconfig as contexts are 'missing' (when we check the files, they are not), or base64 certificate data is invalid and un-decodeable. I have not been able to determine the precise cause here (e.g. fd leaks) as reproducing is very difficult, but would like to at least work around the symptom we're seeing by giving each deploy task its own kubeconfig instead of sharing one everywhere. This PR alters the internal API to allow this, while not changing the behaviour for cli users, and in doing so remove a little of this global state.

How is this accomplished?

Having the various Task classes take kubeconfig as a parameter, and ensuring that is passed through to Kubectl and Kubeclient instances. If kubeconfig is nil, the value of KUBECONFIG in the environment is used, preserving the current behaviour.

There is some prior art here. An explicit --kubeconfig option was added in #428 and then reverted in #468 as the cli flag did not support parsing multiple files. This approach overrides the KUBECONFIG envvar in the kubectl child process, which should avoid that issue.

I have removed a bunch of memoizing from the test suite as some tests currently rely on mutating global state, notably the multiple config file integration test. It's likely we can keep some of this optimisation by reworking these tests, but it could use someone with better knowledge of the test suite than me.

The various Task classes share a common set of options and setup, it might be worth refactoring these to share a base class and dedup some of the changes here, but I consider that out of scope for this PR.

What could go wrong?

As I noted above a change similar to this was attempted before and abandoned. I believe this approach of modifying the subprocess env avoids the previous issue, but maybe there's some other problem. I hope to test this branch by deploying our monolith to check for other issues, and that it actually resolves this issue we're seeing.

I have done some sanity 🎩 locally with minikube and fixtures.

Rakefile Outdated
end

desc("Run cli tests")
Rake::TestTask.new(:cli_test) do |t|
t.libs << "test"
t.libs << "lib"
t.test_files = FileList['test/exe/**/*_test.rb']
t.warning = false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was really struggling to crawl through the test output with the warnings on. Happy to move this elsewhere though if there's more to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What warnings were you seeing? When I ran the tests, the only output I got was the test suite summaries, and a large log output from a failed assertion upon the log string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember exactly, it's possible it was just in CI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty. I'll open a separate draft PR (targeting this one) with the original rakefile and I'll see if I can repro.

@jpfourny
Copy link
Contributor

jpfourny commented Sep 8, 2020

This looks reasonable to me, but folks like @timothysmith0609 and @kddeisz may have stronger opinions.

Copy link
Contributor

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally +1 on this PR, just a couple of architectural changes I'd like to see first.

Rakefile Outdated Show resolved Hide resolved
lib/krane/deploy_task.rb Outdated Show resolved Hide resolved
test/unit/krane/deploy_task_test.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really good. Only thing I think left is don't delegate directly to the ivar, instead delegate to the getter method.

Reason being is if we end up changing that thing to wrap the object in some way, it'll require further changes, whereas if it's just a method call it won't.

@ajshepley ajshepley self-assigned this Sep 21, 2020
@ajshepley ajshepley marked this pull request as ready for review September 21, 2020 23:07
@ajshepley ajshepley requested a review from a team as a code owner September 21, 2020 23:07
@ajshepley
Copy link
Contributor

Since deploying this last week, we haven't seen any of the corruption issues crop up in core.

Copy link
Contributor

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this is looking good. Last thing for me is don't delegate directly to the ivar. i.e., change:

delegate :kubeclient_builder, to: :@task_config

to

delegate :kubeclient_builder, to: :task_config

@ajshepley
Copy link
Contributor

Ah, I missed that and only thought about the task config internally. Whoops. Will fix, thanks.

Copy link
Contributor

@kddnewton kddnewton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@ajshepley
Copy link
Contributor

I accidentally overlooked some lint errors (the buildkite job spits out a looooot of stuff and the error is in the middle somewhere) but it should be good now :)

@ajshepley ajshepley merged commit 4239cde into master Sep 25, 2020
@ajshepley ajshepley deleted the internal-api-kubeconfig branch September 30, 2020 17:47
ajshepley added a commit that referenced this pull request Oct 5, 2020
Adds 2.0.1 section and info about #746
@timothysmith0609 timothysmith0609 temporarily deployed to rubygems October 6, 2020 17:34 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants