Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Make SSH usage more obvious #701

Closed
DanielHabenicht opened this issue Nov 8, 2023 · 9 comments
Closed

Proposal: Make SSH usage more obvious #701

DanielHabenicht opened this issue Nov 8, 2023 · 9 comments
Labels

Comments

@DanielHabenicht
Copy link
Contributor

DanielHabenicht commented Nov 8, 2023

Is your feature request related to a problem? Please describe.

As a new user it is not completely clear that this module uses SSH under the hood for certain resources.
This could be because of various reasons:

  • "found a snippet somewhere and uses it"
  • "follows somebodies tutorial"
  • "uses AI generated code"

In all of the cases, they would never be in contact with the documentation.

This makes finding the root cause of certain problems harder such as:

  1. Error: failed to open SSH client: failed to dial 192.168.178.147:22: ssh: handshake failed: REMOTE HOST IDENTIFICATION HAS CHANGED for host 192.168.178.147:22! This may indicate a MitM attack)
  2. Error: failed to open SSH client: failed to dial 10.0.0.2:22: dial tcp 10.0.0.2:22: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

This is also closely related to issues where the automatic detection of the IP address for SSH is not working and leaves the user clueless.

Describe the solution you'd like
Make it more obvious that SSH is used, e.g. by

Option 1

Requiring to enable SSH in the provider, throws an error when using resources with SSH functionality (on terraform plan)

  • This is a BREAKING CHANGE (so if decided for it it should probably be introduced gradually by first warning for an amount of time (half a year?) and only then making it mandatory.

This could be done by introducing a new argument in the ssh provider block.

provider "proxmox" {
  endpoint  = var.virtual_environment_endpoint
  # SSH Block must be specified if resources that use SSH are used. 
  ssh {
    # The argument  "node_ip_detection" or "node" are required.
    # New argument: 
    node_ip_detection = "automatic"
    # which is mutually exclusive with
    node {
      name    = "pve1"
      address = "192.168.10.1"
    }
  }
}

Option 2

Fail early by adding a connection check during the plan phase (for resources using SSH).

  • Any other ideas?

Describe alternatives you've considered
Live with it. Their fault if they did not read the f****** manual. :D

@otopetrik
Copy link
Contributor

Requiring one of node or node_ip_detection is wrong.

Displaying error message early (missing provider configuration argument) instead of later (failure to use SSH when it is needed) is not really an improvement:

  • SSH error message is full of useful details. Anything that could be added to the early error message (e.g. link to documentation) could be added to the later one.
  • Terraform should handle (and expect) that resource creation can fail and result in incomplete resource, such tainted resource should be deleted and create on next apply.

Surprise caused by the provider successfully using existing SSH setup should not really happen - user has to either enter ssh password on terraform apply, or enable agent in provider configuration.

It may be possible for this to work silently, if user's private ssh key in ~/.ssh is passwordless (in that case user obviously does not care about security).

  • Any other ideas?

In all of the cases, they would never be in contact with the documentation.

This makes finding the root cause of certain problems harder such as:

1. `Error: failed to open SSH client: failed to dial 192.168.178.147:22: ssh: handshake failed: REMOTE HOST IDENTIFICATION HAS CHANGED for host 192.168.178.147:22! This may indicate a MitM attack)`

2. ` Error: failed to open SSH client: failed to dial 10.0.0.2:22: dial tcp 10.0.0.2:22: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.`

The user has precise information what was attempted and why it failed, that is great!
User might not understand what it means, but that could be improved by adding a link ssh documentation at the very end of the error message.

btw. Linking the section directly does not seem to work (tested firefox and chrome, does not scroll to the ssh section), so in addition to the link, the error message should contain a note to scroll down to "SSH connection" section.

Splitting "Authentication" section to "Authentication - API" and "Authentication - SSH" would place the text "SSH" very early in the page: the page's table of contents on the right edge.

It is unlikely that a user can effectively use the provider without referencing the documentation, but in case of an error, the message should direct the user to specific part of the documentation relating to the error.

Not sure if it would be worth the effort to have a separate page for every possible error in ssh client. In addition to "Resources" and "Data Sources" add a section "Error Codes". If each ssh error message ended with different "[Error code PVE-TF-SSH-0123: see http://...url.../] it would make it easier for users - in case of issues understanding the target page, they would have a keyword (PVE-TF-SSH-0123) to search.

Describe the solution you'd like Make it more obvious that SSH is used, e.g. by
* Requiring to enable SSH in the provider, throws an error when using resources with SSH functionality (on terraform plan)

Is there any advantage at having the error at plan time ?
Terraform should handle failures to create resources anyway, it should delete and recreate tainted resources on next apply.

Use of SSH depends on resource arguments:

Majority of proxmox_virtual_environment_vm uses probably requires SSH provider (initial disk contents imported using file_id), but not all.
Same with proxmox_virtual_environment_file - uploading a snippet (like cloud-init configuration) requires SSH, uploading an iso image should not.

A really determined user could run a possibly useful setup without the provider using SSH. It would be needlessly complex, but it could work:

  • Use a http server (and a terraform provider) for uploading and hosting cloud init files (maybe S3 or something S3-compatible).
  • Create VMs not by importing boot disk, but by cloning a template VM (that should work without SSH).
  • Use smbios.serial to point cloud-init to the server.

Possible future:

  • Proxmox API seems to have way to copy disk image without SSH (POST tab), but it is root-only and comes with a rather strong warning: Copy a volume. This is experimental code - do not use..

  • Terraform provider for "libvirt" generates cloud-init configuration locally using mkisofs.
    (unlike cloud-init files' snippets, it should be possible to upload final cloud-init configuration .iso using API)

If both are implemented, it might make sense to add use_ssh (default true) provider configuration option, to be set false by users who really do not want to use ssh.

On the other hand, because the provider already uses ssh, it might be possible to have a pve-zsync terraform resource.

@DanielHabenicht
Copy link
Contributor Author

DanielHabenicht commented Nov 16, 2023

Requiring one of node or node_ip_detection is wrong.

Displaying error message early (missing provider configuration argument) instead of later (failure to use SSH when it is needed) is not really an improvement:

  • SSH error message is full of useful details. Anything that could be added to the early error message (e.g. link to documentation) could be added to the later one.
  • Terraform should handle (and expect) that resource creation can fail and result in incomplete resource, such tainted resource should be deleted and create on next apply.

I think failing earlier is better and an improvement. Sure the later error messages could be improved. But when I run a terraform plan first and then apply I would expect that it does not fail in the second step because of a simple connection error (sure there might be other errors but a connection error after already having connected during the plan step is unexpected).
Another way to counteract this would be to try connect during the plan step (if a resource with SSH configuration is used). This would also satisfy my requirement to fail early and would not be a breaking change, yet still achieve functionally the same.

Surprise caused by the provider successfully using existing SSH setup should not really happen - user has to either enter ssh password on terraform apply, or enable agent in provider configuration.

It may be possible for this to work silently, if user's private ssh key in ~/.ssh is passwordless (in that case user obviously does not care about security).

That is not completely true, by default the provider tries to connect with username and password given for the normal API, which is probably also used by many users for SSH too. This means no additional configuration is needed for most new users.

Is there any advantage at having the error at plan time ?
Terraform should handle failures to create resources anyway, it should delete and recreate tainted resources on next apply.

Yes, it saves time and protects from possible inconsistent states. I agree that Terraform should be able to handle it, but that's not always the case (although I haven't encountered it with this provider yet) so why provoke errors when checking beforehand would resolve it?

Going forward I mostly agree with your proposal of adding better error messages and hoping that the problems resolves itself™ as the API improves and withdraw my proposal of adding extra configuration options for ssh usage.
Until then how about adding a check for the SSH connection in the Read functions for resources that use the SSH connection?

@otopetrik
Copy link
Contributor

But when I run a terraform plan first and then apply I would expect that it does not fail in the second step because of a simple connection error (sure there might be other errors but a connection error after already having connected during the plan step is unexpected).

That would be the ideal behavior, but even the big providers do not guarantee that provider configuration and resource description which pass through the planning stage would result in successful resoruce creation.
Configure a provider with inssuficient permissions, the plan would succeeed, apply would fail.
E.g. AWS provider does not verify IAM permissions at plan stage:

Just because ssh block is configured, that does not mean that ssh connection would succeed and be usable for all configured resources.

Incomplete list of possible failures:

  • wrong ip address
  • wrong username (e.g. user mistakenly uses local username e.g. john instead of their PAM username used on proxmox host johnsmith)

and even if ssh connection succeeds, that is not enough:

  • selected datastore does not have snippets enabled
  • PAM username does not have has no access to snippets directory (on that datastore used by a configured resource)
  • selected datastore does not even exist (maybe a typo in terraform configuration ?)

It looks like that using CustomizeDiff for stricter checking of resources has its issues. For interpolated values it can get empty values instead (see here).

There could be cases, when user wants a VM to use the same datastore through attribute of another VM (e.g. when using full = false cloning of a template) and also create snippets on the same datastore.
It is reasonable that the template VM (created by importing .img file) and the clones are intended to be created during the same terraform run (e.g. download debian cloud image, and create few cloud-init configured VMs).
Not sure if template VM's attributes are available to CustomizeDiff in that case, to detect that creating clones with cloud-init files requires ssh access and permission to specific datastore.

Another way to counteract this would be to try connect during the plan step (if a resource with SSH configuration is used). This would also satisfy my requirement to fail early and would not be a breaking change, yet still achieve functionally the same.

Opening a SSH connection would slow down plan considerably, especially because golang's ssh client does not seem to support openssh's connection sharing (ControlPath).
Btw. the provider does not seem to connect to the API during plan stage.

Consider this example.tf file:

terraform {
  required_providers {
    proxmox = {
      source  = "bpg/proxmox"
      version = "0.37.0"
    }
  }
}
provider "proxmox" {
  endpoint  = "https://example.com:8006/"
  api_token = "definitelnynotroot@pam!invalidtoken=withinvalidvalue"
  insecure  = true
  tmp_dir = "/var/tmp"
}

resource "proxmox_virtual_environment_vm" "data_vm" {
  node_name = "node1"

  vm_id   = 1234
  name    = "test--vm1234"
  tags    = ["data-vm"]
  started = false
  on_boot = false

  disk {
    datastore_id = "local-zfs"
    file_format  = "raw"
    interface    = "scsi0"
    size         = 1
  }

  disk {
    datastore_id = "local-zfs"
    file_format  = "raw"
    interface    = "scsi1"
    size         = 4
  }

}

Place it in an empty directory, run terraform init and terraform plan. The plan succeeds.
It does not verify that 'example.com' is a proxmox host, it does not verify that the api token is valid, it does not verify that local-zfs datastore actually exists, or that the configured user has permissions to allocate disks there.

Btw. copying"data_vm" resource and to "data_vm2" while keeping the same vm_id (i.e. a collision) still passes the plan. And it looks like there is no way for a terraform provider to detect such collision.

That is not completely true, by default the provider tries to connect with username and password given for the normal API, which is probably also used by many users for SSH too. This means no additional configuration is needed for most new users.

Thanks! I completely missed that, just used to disabling PasswordAuthentication in sshd_config.

If it works for new users, the provider probably should not break that.
Displaying a security warning when the provider actually uses SSH client (with username and password instead of ssh agent), might be a very good idea.

Until then how about adding a check for the SSH connection in the Read functions for resources that use the SSH connection?

It looks like Read is not called before Create (see hashicorp/terraform#19017), it probably would not fix failure when creating resources.
Also it looks like Read does not have access to intended configuration (no way to check if vm or file resource even needs SSH access).

  1. If you really want to have additional check, try using CustomizeDiff and verify SSH access only if the resources actually need it. That is proxmox_virtual_environment_vm (but only if it has a disk with non-empty file_id) and proxmox_virtual_environment_file (but only if it has content_type snippets).

This verification must be skipped, if the user configured the SSH in any way. I.e. anything in ssh block is set, or any PROXMOX_VE_SSH_... environment variable is set. There is no reason to slow down plan for users, who know that the provider uses SSH.

Looking around the code a bit more, there might be even better option:

  1. The ssh client is recreated in every ExecuteNodeCommands and NodeUpload. It would be good idea to keep one-per-node lazily-created ssh client instance alive for the whole run (inside the proxmox/ssh/client). It looks like it should be possible to cleanly shutdown SSH client(s) at the end of the run, see stackoverflow. That would make the provider usable when using hardware token for ssh agent (only one tap of hardware button per node, instead one tap for every ssh operation). Having an ssh.eager_connection option would then make a lot of sense (connect ssh at provider initialization), likely even as a default. Both for beginners (ssh connection before any attempt to create resource) and for long time users (unlock ssh agent at the beginning of the run. e.g. even before starting long download of cloud image from distribution servers). Error message for failure to connect while attempting eager ssh connection could have reasonable guidance, something like: "everyone: configure ssh (expert: just configure ssh. if really really certain no resources need ssh, possibly disable ssh.eager_connection)"

Instead of adding checks causing additional ssh connections and slowing the provider down, make it a feature by reducing the number of ssh connections and make the provider faster.

@bpg
Copy link
Owner

bpg commented Nov 17, 2023

Btw. the provider does not seem to connect to the API during plan stage.

It does call read, but only if there is a local state to compare with. Your example may not be representative for the use case, as for a new deployment there is no existing state. So TF does not need to read from PVE to calculate the difference between local and remote states, whatever is in the plan that will be "new".

Which means there is no reliable way for provider to check communication with PVE before runing apply, unfortunately. CustomizeDiff won't work for new resources either, as there is no diff.

In these circumstances a clear warning at apply, showing that ssh connection is being used without configured ssh agent, with a link to the docs is probably the best what we can do.

I also like the idea of cached ssh connections, seems totally reasonable (i.e. we're using http connection pool in the http client, why not in ssh?). But I'm a bit skeptical about performance benefits it could give, unless for some edge-cases where someone is trying to deploy tens of files / VMs in one shot. I guess PVE won't be happy in this scenario anyway, having IO bottlenecks most likely.

@otopetrik
Copy link
Contributor

It does call read, but only if there is a local state to compare with. Your example may not be representative for the use case, as for a new deployment there is no existing state. So TF does not need to read from PVE to calculate the difference between local and remote states, whatever is in the plan that will be "new".

You are right. The API calls are skipped only for "new" resource, and any existing resource in terraform state will cause read, thereby verifying that provider is still configured correctly to use proxmox API.

Which means there is no reliable way for provider to check communication with PVE before runing apply, unfortunately. CustomizeDiff won't work for new resources either, as there is no diff.

According to the SDK documentation, it should be called for new resources.

It looks like the issue with values from other resources is limited to computed, it would likely work even for the "VM template and cloud-init clones on same datastore" example.

The problem is the large number of ssh connections it would cause, up to 5 connections per VM (boot disk file_id, and one for each file resource for network_data_file_id , user_data_file_id, vendor_data_file_id and meta_data_file_id).

I also like the idea of cached ssh connections, seems totally reasonable (i.e. we're using http connection pool in the http client, why not in ssh?). But I'm a bit skeptical about performance benefits it could give, unless for some edge-cases where someone is trying to deploy tens of files / VMs in one shot. I guess PVE won't be happy in this scenario anyway, having IO bottlenecks most likely.

Correct, the performance improvement compared to current state would not be noticable.
The idea is to avoid the decline in performace from performing SSH connection check in ReadContext of every cloud-init-file resource and every VM-created-from-image resource. Slowdown would not depend on number of new resources but on total number of such resources managed by terraform. Waiting few seconds on every plan for tens of needless ssh connections could get old fast, especially for larger terraform projects in a homelab environment on slow hardware.

CustomizeDiff could probably skip the check if the resource has not changed, making it work, but using cached ssh connections would have additional advantage.

The ability to start a long terraform run (with slow downloads), without having to be near the computer to ensure that ssh can connect later:

  • Some linux distributions add keys to ssh-agent automatically after login (if the ssh key is encrypted using the login password). Others do not, and it would be a quality of life improvement if the the provider created ssh connection only once, and early. The ssh-agent would then ask the user to unlock the key only once, at the beginning of the run (not minutes later, after the downloads completed).
  • User can have ssh-agent running with -t option, in that case it is possible that agent has ssh key ready at the beginning of the terraform run, but deletes it during terraform run, and then new ssh connections would fail.
  • Using ssh-agent with ecdsa-sk or ed25519-sk keys on USB dongle requires pushing a button every single time new ssh connection is created.

@DanielHabenicht
Copy link
Contributor Author

Seems like there is another (consistent) way to check the plan:

https://developer.hashicorp.com/terraform/plugin/framework/resources/plan-modification

@Tash1nka
Copy link

Tash1nka commented Dec 8, 2023

Hi, sorry to hop on the issue but it felt approriate.

I simply would like to add that they are two features about the SSH part of the provider that would be nice to have:

  1. Being able to connect throught an SSH bastion host
  2. Being able to resolve hostnames for the nodes addresses

@bpg
Copy link
Owner

bpg commented Dec 29, 2023

Hi @Tash1nka!

  1. Being able to connect throught an SSH bastion host

Will be addressed in #852

  1. Being able to resolve hostnames for the nodes addresses

Added in #848 and #824

@bpg-autobot
Copy link
Contributor

bpg-autobot bot commented Jun 27, 2024

Marking this issue as stale due to inactivity in the past 180 days. This helps us focus on the active issues. If this issue is reproducible with the latest version of the provider, please comment. If this issue receives no comments in the next 30 days it will automatically be closed. If this issue was automatically closed and you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thank you!

@bpg-autobot bpg-autobot bot added the stale label Jun 27, 2024
@bpg-autobot bpg-autobot bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants