-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add proper handling for golang ssh #1136
Conversation
6ec4168
to
e361e89
Compare
I have no idea why, but when I run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I’d really prefer if we didn’t do any of these special cases. But, if we have to, let’s do that for
podman machine
only. - It would be nice for $someone to enumerate the cases we care about, to make sure that we didn’t forget anything, and then to see how that works (E.g. create a machine + use it; create a non-machine connection and use it; is
scp
special enough that we need to discuss it separately? Also see how any of those work when starting with an empty~/.ssh
.) - All the time here we are talking about
connection_golang.go
. Is there a reason why this does not need to be handled inconnection_native.go
? Is that perhaps a non-default mode that users only opt into? Or is that able to interactively prompt for confirmation on the TTY, even if we redirect stdout/stderr as we seem to do?
@mtrmac do you have any guess as to when the next release is? If this works in podman I think I can make some of these changes if we are planning to cut a release next week, but some of these might not get done before then. |
cf3097c
to
9a2200f
Compare
/hold |
0aaa2a1
to
55b1e51
Compare
You rever to the boolean in multiple different ways which I find confusing. Should the boolean be IsMachine? |
sure, I'll make all of the references the same |
pkg/ssh/connection_golang.go
Outdated
if os.IsNotExist(err) && insecureMachine { | ||
logrus.Warn("please create a known_hosts file and recreate the machine to add the public key entry") | ||
return nil | ||
} | ||
return err | ||
} | ||
hErr := known(host, remote, pubKey) | ||
if errors.As(hErr, &keyErr) && len(keyErr.Want) > 0 { | ||
logrus.Warnf("ssh host key mismatch for host %s, got key %s of type %s", host, ssh.FingerprintSHA256(pubKey), pubKey.Type()) | ||
return keyErr | ||
} else if errors.As(hErr, &keyErr) && len(keyErr.Want) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write down, in words, the intended policy?
This might be correct but it seems surprising. In particular I didn’t expect that a missing known_hosts
, and an existing known_hosts
without an entry for the host in question, don’t cause the same behavior. Why does it warn only on a missing file? Why does it add an entry only on an existing file with a missing entry, and not with an empty file?
Is there, maybe, some expectation that in machine mode, we would never get in the len(keyErr.Want) == 0
situation? If so, this should loudly fail if we do get in that situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, since we are adding the machine keys here too, this will fail for machine as well. I am adding comments to explain each condition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS if known_hosts
does not exist, we either warn and succeed, or don’t warn and fail, but we never get to the known(
call and never get the opportunity to add a new known_hosts entry.
To be a bit more explicit, the “the intended policy” was asking about what is the ~higher-level wanted behavior, maybe I’d phase it as asking what are the guarantees that users can rely on, and what are the compatibility compromises we have decided to use that weaken security or usability.
Explaining the code line by line might not be the same thing — i.e. please focus on why the code is using these conditions to decide, not just re-explain the details of the effects of the Go code in words. At least assuming I’ve been reading the code correctly (which might not be the case), I don’t think we are quite in disagreement in what the code does on any code path — but I find it difficult to figure out what is the overall effect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some intense documentation to this section and the code here is good to go, resolving this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added line-by-line documentation answers not a single one of the specific questions from this thread:
- I didn’t expect that a missing known_hosts, and an existing known_hosts without an entry for the host in question, don’t cause the same behavior. Why does it warn only on a missing file? Why does it add an entry only on an existing file with a missing entry, and not with an empty file?
- what are the guarantees that users can rely on
- what are the compatibility compromises we have decided to use that weaken security or usability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will un-resolve this so we can talk about it. @mtrmac would you be opposed to getting this to a mergeable state but me filing an issue to keep track of the remaining design discussions we want to have?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You’re right, let’s collect the outstanding ideas somewhere. #1152 , feel free to edit / add entries.
logrus.Warnf("ssh host key mismatch for host %s, got key %s of type %s", host, ssh.FingerprintSHA256(pubKey), pubKey.Type()) | ||
return keyErr | ||
} else if errors.As(hErr, &keyErr) && len(keyErr.Want) == 0 { | ||
// write to known_hosts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, this is an improvement over the status quo but I’ve been talking about the need to interactively get user’s approval first, for some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I will prompt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrmac this will actually break the machine tests since there is no way to force through the prompt. suggestions?
and most podman-remote tests actually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For machine, maybe that’s consistent with the isMachine
bool in general. But that rather depends on the specifics of the semantics of that field.
For “most podman-remote tests”, how does that fail? (It might very well be possible but just a lot of work.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would a simple command to a new "machine" in the e2e tests work in the testing suite if we need to prompt for y/n to continue?
I could do a timeout default yes here, where if the user starts a machine or executes podman-remote and doesn't answer y/n we default to yes after 15 seconds.
Other than that, this is a huge hindrance to regular usage and people who use this in applied scenarios... a very breaking change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how would a simple command to a new "machine" in the e2e tests work in the testing suite if we need to prompt for y/n to continue?
Whatever the current equivalent of expect(1)
is. I assume (without checking) that we have some tests of podman run -it
, that’s the same issue.
I could do a timeout default yes here, where if the user starts a machine or executes podman-remote and doesn't answer y/n we default to yes after 15 seconds.
No thank you. That’s not an explicit expression of intent.
Other than that, this is a huge hindrance to regular usage and people who use this in applied scenarios... a very breaking change.
How is prompting on podman connection add
, and never again, a huge hindrance?
I’ve been talking about the need to interactively prompt since the very start of the existence of this package, and bringing up the need to walk though the UX to ensure it is practical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree a prompt is best but that is not something that can be introduced in 4.3 I do not think. if you exec podman-remote
or any podman machine command and it idles, we will break a lot of existing builds.
@mtrmac the |
7fd7c2c
to
ace5da1
Compare
knownFile, err := writeKnownHosts(host, pubKey) | ||
if err != nil { | ||
if os.IsNotExist(err) { | ||
logrus.Warn("podman will soon require a known_hosts file to function properly.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this code is confident enough to in the public key to add an entry, why would it be reluctant to create a new file with a single line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought our desired path was to warn people for a few podman versions and then maybe talk about creating a file? I think this is exiting the scope of this PR.
This fix is needed for 4.3 next week
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @mtrmac. The stock behavior for podman machine is that a file does not need to be created. Granted, it was because it wasn't in use, but from a UX perspective, we should avoid having podman machine instances break after the upgrade. This is also the stock behavior for stock ssh (silent file creation). Change-wise should be safe and small (just adding O_CREATE to flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n1hility This path is not really a podman machine
concern in this sense. If the file is entirely missing, we get into the other path with the insecureIsMachineConnection
check.
Even this path does not break machines, it warns and continues.
It’s the non-machine case where the behavior between existing file with no match / no file is hard for me to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my original comment is way off base.
If we get to the path that start with a known()
call at all, we know that the file did exist when creating known
. So the only way we can get here is if someone removed known_hosts
, or a parent directory, between the read and this write.
In that case I don’t think we really need any special handling for os.IsNotExist
(though using O_CREATE
would, I think, be closer to the right thing to do [with the absolutely right thing being worrying about locking the known_hosts
writers across disparate implementations??? ugh]).
But also, the current warning text seems very unrelated to the situation at hand: it is warning about a known_hosts
file that did exist just a few milliseconds ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrmac @cdoern aha sorry! I just skimmed the shown snippets without looking at the full code. Ok, that alleviates my primary concern.
That said I agree it would make sense to auto-create in the non-machine case as well, and this could simplify the logic a bit.
_ need any special handling for
os.IsNotExist
(though usingO_CREATE
would, I think, be closer to the right thing to do [with the absolutely right thing being worrying about locking theknown_hosts
writers across disparate implementations??? ugh]).
I think its ok relying on O_APPENDs atomicity as long as its always one write for the record. I just double checked the openssh source and they dont use cooperative file locking on known hosts, so its effect would be limited to just concurrent usage of podman
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am mocking up a version where we create a known_hosts file which removes the need for insecureIsMachineConnection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its ok relying on O_APPENDs atomicity as long as its always one write for the record. I just double checked the openssh source and they dont use cooperative file locking on known hosts
Thanks for taking the time to actually check. Yes, looking at add_host_to_hostfile
in https://cvsweb.openbsd.org/cgi-bin/cvsweb/src/usr.bin/ssh/hostfile.c?rev=1.93&content-type=text/x-cvsweb-markup , we don’t need to do any better than that:
- Just write text+
\n
, don’t worry about files that don’t end with a newline. - Use O_CREATE | O_APPEND, don’t do anything special about concurrency (but potentially create
~/.ssh
, and in that case it is fairly important to use mode 0700).
I guess that’s for a future version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtrmac do you think we should merge this as is for 4.3? I modified it locally to create ~/.ssh/known_hosts, I can push that if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdoern I apologize, for some reason I thought that our 4.2 baseline is #1094 , and that this PR is a net decrease in security. I have only today actually checked, and 4.2 has the pre-c/common ssh implementation if I’m not mistaken.
Assuming that’s the case, any version of this is an improvement over 4.2. So, at this point I think the priorities are, in this order:
- Make sure this + the corresponding Podman PRs is merged into 4.3, so that we have a working Podman machine in 4.3.
- Get the implementation of the feature set as is right now correct and usable. (I think the PR as is is fine in isolation, but can we rely on tests for all relevant features? I.e. I think that rather than add more last-minute features, it would be more useful to do a manual walk-through of all the relevant use cases [and to capture that for posterity], and ensure that we haven’t overlooked any corner cases / interactions. (E.g. is there really no intersection between
ssh_native
and theIsMachineConnection
cases?)
And for 4.4, let’s first document what we want to do, then see what pieces are missing, and finally finish this.
/hold cancel |
@containers/podman-maintainers PTAL, this needs reviews and merging before weeks end. I think this is in a state where we can merge it as it fixes the machine issues. Of course, everything here is up for debate and probably will be but for 4.3, this is a crucial fix that needs to be brought into podman cc/ @n1hility |
@containers/podman-maintainers PTAL! all of the review comments are for a different version. |
5eaaab5
to
cd1d646
Compare
My concerns were addressed so LGTM |
@mtrmac PTAL and merge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The known_hosts parts LGTM, just 1+2 small things.
The fairly recent changes relating to ssh://
made me look at the parser (both before and after this PR), and … I’m worried. It’s non-obvious, with a completely undocumented intent, and generally gives me the impression of a tangle that we are going to be struggling with for a few years, or something that will be just rewritten from scratch (reintroducing old long-fixed bugs) because it’s easier than to fully understand all the interactions.
Maybe this PR should be split into two, if there is no relationship between these two major themes, then? I don’t know that the split would help decrease the total time to get both in, but at least we could stop thinking about the known_hosts part for 4.3.
@@ -84,12 +85,15 @@ func golangConnectionDial(options ConnectionDialOptions) (*ConnectionDialReport, | |||
} | |||
|
|||
func golangConnectionExec(options ConnectionExecOptions) (*ConnectionExecReport, error) { | |||
if !strings.HasPrefix(options.Host, "ssh://") { | |||
options.Host = "ssh://" + options.Host |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throughout, in several instances of these changes: Why does it make sense to set a “host” field to something URI-like?
IMHO fields/variables, should almost always clear and unambiguous semantics — the obvious semantics. If there are two possible formats, have two separate variables/fields.
It would at least be somewhat manageable if the ConnectionExecOptions.Host
field were clearly documented that way, but… it isn’t. How is someone in the future (3 years into the future? or maybe just next week) going to update this undocumented code, without exhaustively enumerating every single caller to understand what formats are acceptable? Or perhaps even validating many past versions of Podman to see what they were writing into persistent config files?
If you currently know these things, please write them down.
If the user’s input is always supposed to be a host name and not an URI, but Validate
uses an URI parser for some reason, I’d expect Validate
to do this internally (or, maybe, not to use an URI parser at all).
If the user’s input can either be a URI or a non-URI, do we have to have that ambiguity? Maybe that ambiguity should only exist at some top CLI level (possibly as a helper provided by this package) but not here inside the fairly low levels of the API?
If the user’s input can be a URI, what happens if the input uses a different scheme than ssh
?
@@ -111,11 +115,15 @@ func golangConnectionScp(options ConnectionScpOptions) (*ConnectionScpReport, er | |||
return nil, err | |||
} | |||
|
|||
// removed for parsing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t understand what this comment is trying to say.
pkg/ssh/connection_golang.go
Outdated
return nil, fmt.Errorf("creating host key callback function for %s: %w", keyFilePath, err) | ||
|
||
var callback ssh.HostKeyCallback | ||
var keyErr *knownhosts.KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move keyErr
inside the callback so that we clearly don’t need to worry about reuse of the value across callback invocations.
(This is the only blocking part WRT known_hosts changes.)
} | ||
return err | ||
} | ||
case hErr != nil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: Simplifying this to default: return hErr
would make it a tiny bit clearer that we never return nil
without being intentional about it.
|
||
// addKnownHostsEntry adds (host, pubKey) to user’s known_hosts. | ||
func addKnownHostsEntry(host string, pubKey ssh.PublicKey) error { | ||
hd := homedir.Get() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely non-blocking: The caller already computed this, so it could pass a path instead of this function recomputing.
(Right now this makes ~no difference, but it will eventually help if we start maintaining per-machine known_hosts files in non-default locations.)
@@ -21,6 +21,7 @@ func Validate(user *url.Userinfo, path string, port int, identity string) (*conf | |||
if strings.Contains(path, "/run") { | |||
sock = strings.Split(path, "/run")[1] | |||
} | |||
// url.Parse NEEDS ssh://, if this ever fails or returns some nonsense, that is why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See elsewhere, why do the callers need to deal with that?
Why is a “path” parameter being parsed as an URI? What is the actually intended semantics of that value? Please name parameters exactly correctly, or if that’s impossible (preferably restructure code to make that possible, or at least) clearly, and exhaustively, document the expected formats, in the function’s/field’s doc string.
@@ -21,6 +21,7 @@ func Validate(user *url.Userinfo, path string, port int, identity string) (*conf | |||
if strings.Contains(path, "/run") { | |||
sock = strings.Split(path, "/run")[1] | |||
} | |||
// url.Parse NEEDS ssh://, if this ever fails or returns some nonsense, that is why. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the “sometimes we are not going to have a path” comment below?
- What’s an example? I guessed
ssh://foo
andssh://user@foo
, but that didn’t match the code. If there is a a corner case (ssh:foo
??), it should be documented clearly — perhaps in a comment, but ideally actually documented also by a unit test that ensures the behavior is not accidentally broken. - How exactly does “not going to have a path” relate to the presence, or not, of a user designation? I can‘t see the relationship.
@@ -21,6 +21,7 @@ func Validate(user *url.Userinfo, path string, port int, identity string) (*conf | |||
if strings.Contains(path, "/run") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking, about all of Validate
:
This kind of parsing/transformation function, overall, seems to me like a pretty good candidate for writing fairly exhaustive unit tests: it is
- non—trivial, with quite a few code paths and, apparently, corner cases (assuming they can’t be eliminated in the first place)
- very isolated, so cheap to test
@@ -21,6 +21,7 @@ func Validate(user *url.Userinfo, path string, port int, identity string) (*conf | |||
if strings.Contains(path, "/run") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non-blocking: BTW naming a transformative parser like this “validate” is a bit misleading.
Non-blocking: And why are native callers passing options.Identity
to this function, and immediately afterwards to ValidateAndConfigure
? It seems to me that this could be simplified, e.g. by this function returning a struct (resolvedConnectionInfo
[1]?), and callers can then pass around that single struct and access dest.URI
and dest.Identity
.
[1] I guess that struct can’t be *config.Destination
, because the on-disk format and the normalized internal data seem to differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Regardless of the name, AFAICS Validate
does not need to be public at all; so, with the recent plans to make the c/common API stable, it should start private. We can easily make things public later, but we can’t take back public API nearly as easily. This applies to other functions in this package as well.)
@@ -33,9 +34,9 @@ func Validate(user *url.Userinfo, path string, port int, identity string) (*conf | |||
|
|||
if uri.Port() == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if both uri.Port()
and port
is set?
Looking at various Podman callers, so many (but not all of them) call url.Parse
themselves already. Why do that twice, and then need to worry about these port/port conflicts that might not be actually possible in practice?
Maybe there should be a single ssh.Destination
type, containing all relevant information in a native support optimized for further use (whether that’s an url.URL
, individual values, or some combination), and helpers vaguely like DestinationFromConfigDestination(*config.Destination)
, DestinationFromURI
, DestinationFromCLIString
, DestinationFromPodmanConnectionAddComponents
— each with appropriate heuristics, without several round-trips back and forth.
@mtrmac for the 4.3 version would you like all of the above comments addressed? I am unsure if I can get to them before weeks end due to exams I have |
From my POV worrying about authentication, the For the rest… *shrug*. It’s not something I can prioritize over my own work near-term. To the extent that c/common is vendored into Podman 4.3 and not exposed externally, the unclear data semantics and formats are an internal technical debt (unless they show up as user-visible bugs, or user-visible unexpected behavior users start relying on) that can be fixed any time, and I don’t think the 4.3 release boundary is a major influence. I think there would be a notable difference between getting these things well defined and clearly documented before the discussed API freeze of c/common vs. afterwards. That’s a much more relevant deadline to me. |
@cdoern FYI I’ll prepare updates of the two ssh PRs so that they can be merged early, without last-minute scramble. |
#1163 is the updated variant of this PR. |
…mes to key verification this ensures that podman machine will still work (until we want to make this mandatory). I made the call back function more verbose so we know what is happening from now on. Signed-off-by: Charlie Doern <cdoern@redhat.com>
/lgtm |
@mtrmac: changing LGTM is restricted to collaborators In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@rhatdan ^ |
/approve |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cdoern, mtrmac, rhatdan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
podman machine and podman-remote need some softer handling when it comes to key verification
this ensures that podman machine will still work (until we want to make this mandatory). I made the call back function more verbose so we know what is happening from now on.
Signed-off-by: Charlie Doern cdoern@redhat.com