Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAM Auth with interactive input is broken. #49028

Closed
programmerq opened this issue Nov 14, 2024 · 4 comments · Fixed by #49487
Closed

PAM Auth with interactive input is broken. #49028

programmerq opened this issue Nov 14, 2024 · 4 comments · Fixed by #49487
Assignees
Labels
bug c-co Internal Customer Reference PAM Label related to Pluggable Authentication Module (PAM) Submethod. ssh

Comments

@programmerq
Copy link
Contributor

Expected behavior:

On a Teleport node with PAM support and PAM auth enabled, users should be presented with output from configured PAM directives, and should be able to provide auth interactive input (like an PAM module that prompts for an OTP code)

Current behavior:

When attempting to open an SSH session, the session terminates after around 20 seconds. The Teleport node logs "Child process never became ready," and "trace.aggregate timed out waiting for continue signal." The user is presented with a generic "EOF" message.

Bug details:

Recreation steps

To reproduce this error, and demonstrate it was working with 7.x, I did the following.

/etc/pam.d/teleport

#%PAM-1.0
account    required   pam_exec.so stdout /etc/pam.d/otp_banner.sh
auth       required   pam_exec.so expose_authtok /etc/pam.d/otp_check.sh
session    required   pam_motd.so
session    required   pam_permit.so

/etc/pam.d/otp_banner.sh

#!/bin/bash

# Print welcome banner
username=${TELEPORT_USERNAME%@*}
echo ' ********    Welcome '$username'  '

# example/dummy OTP of "YES" for testing
echo -n ' Please input YES to continue: '
exit 0

/etc/pam.d/otp_check.sh

#!/bin/sh

read token

if [ "$token" = "YES" ]; then
 exit 0
fi

exit 1

/etc/teleport.yaml

ssh_service:
  enabled: yes
  pam:
    enabled: yes
    service_name: "teleport"
    use_pam_auth: true

Note: My 16.4.7 lab is a typical node instance that is connected to a cluster.

My Teleport 7.3.26 instance was an auth+proxy+node all in one teleport.yaml file. The ssh_service pam configuration was the same in both clusters. The teleport 7 instance used an alternate data_dir and bind address so both versions could run on the same host at the same time sudo /path/to/7.3.26/bin/teleport start -d -c /etc/teleport7.yaml

All tsh commands were run with a tsh version that matches the cluster version.

Expected behavior (as seen in Teleport 7.3.26):

% tsh ssh jeff@teleport7
 ********    Welcome jeff
 Please input YES to continue:
YES
Linux myhost 6.1.0-26-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.112-1 (2024-09-30) x86_64
...
jeff@myhost:~$ 

Current behavior (as seen in Teleport 16.4.7):

% tsh ssh --no-resume jeff@teleport16
ERROR: EOF

The tsh command hangs for 20 seconds before the EOF message appears.

tsh debug messages
jeff@macbook ~ % date; time tsh ssh -d --no-resume jeff@teleport16; echo; date
Thu Nov 14 16:44:47 EST 2024
2024-11-14T16:44:48-05:00 INFO [CLIENT]    ALPN connection upgrade required for "teleport.example.com:443": false. client/api.go:841
2024-11-14T16:44:48-05:00 INFO [CLIENT]    [KEY AGENT] Connected to the system agent: "/private/tmp/com.apple.launchd.2ZpKoTfyMF/Listeners" client/api.go:4619
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Reading certificates from path "/Users/jeff/.tsh/keys/teleport.example.com/jeff-ssh/teleport.example.com-cert.pub". client/keystore.go:357
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 INFO [KEYAGENT]  Loading SSH key for user "jeff" and cluster "teleport.example.com". client/keyagent.go:198
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [CLIENT]    Using provided host myhost client/api.go:1460
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Reading certificates from path "/Users/jeff/.tsh/keys/teleport.example.com/jeff-ssh/teleport.example.com-cert.pub". client/keystore.go:357
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Reading certificates from path "/Users/jeff/.tsh/keys/teleport.example.com/jeff-kube/teleport.example.com". client/keystore.go:357
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [CLIENT]    Attempting to issue a single-use user certificate with an MFA check. client/cluster_client.go:302
2024-11-14T16:44:48-05:00 DEBU [KEYSTORE]  Teleport TLS certificate valid until "2024-11-15 04:55:57 +0000 UTC". client/client_store.go:118
2024-11-14T16:44:48-05:00 DEBU [KEYAGENT]  "Checking key: ssh-rsa-cert-v01@openssh.com AAAA...\n." client/keyagent.go:370
2024-11-14T16:44:48-05:00 DEBU [KEYAGENT]  Validated host myhost:0@default@teleport.example.com. client/keyagent.go:376
2024-11-14T16:44:48-05:00 DEBU [CLIENT]    No Key Agent selected. client/session.go:284
2024-11-14T16:45:08-05:00 DEBU [CLIENT]    not using loopback pool for remote proxy addr: teleport.example.com:443 client/api.go:4574
2024-11-14T16:45:08-05:00 DEBU  Attempting request to Proxy web api method:GET host:teleport.example.com:443 path:/webapi/ping trace_id:1fddeb9c703dd9b6067c048421200fc1 span_id:07bca1541fd0cbbe webclient/webclient.go:131
2024-11-14T16:45:08-05:00 DEBU  ALPN connection upgrade test complete address:teleport.example.com:443 upgrade_required:false trace_id:1fddeb9c703dd9b6067c048421200fc1 span_id:07bca1541fd0cbbe client/alpn_conn_upgrade.go:96

ERROR REPORT:
Original Error: *errors.errorString EOF
Stack Trace:
	github.com/gravitational/teleport/api@v0.0.0/observability/tracing/ssh/session.go:269 github.com/gravitational/teleport/api/observability/tracing/ssh.(*Session).Shell
	github.com/gravitational/teleport/lib/client/session.go:533 github.com/gravitational/teleport/lib/client.(*NodeClient).RunInteractiveShell.(*NodeSession).runShell.func2
	github.com/gravitational/teleport/lib/client/session.go:318 github.com/gravitational/teleport/lib/client.(*NodeSession).interactiveSession
	github.com/gravitational/teleport/lib/client/session.go:522 github.com/gravitational/teleport/lib/client.(*NodeSession).runShell
	github.com/gravitational/teleport/lib/client/client.go:403 github.com/gravitational/teleport/lib/client.(*NodeClient).RunInteractiveShell
	github.com/gravitational/teleport/lib/client/api.go:2019 github.com/gravitational/teleport/lib/client.(*TeleportClient).runShellOrCommandOnSingleNode
	github.com/gravitational/teleport/lib/client/api.go:1750 github.com/gravitational/teleport/lib/client.(*TeleportClient).SSH
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:3615 github.com/gravitational/teleport/tool/tsh/common.onSSH.func1.1
	github.com/gravitational/teleport/lib/client/api.go:607 github.com/gravitational/teleport/lib/client.RetryWithRelogin
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:3620 github.com/gravitational/teleport/tool/tsh/common.onSSH.func1
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:3421 github.com/gravitational/teleport/tool/tsh/common.retryWithAccessRequest
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:3608 github.com/gravitational/teleport/tool/tsh/common.onSSH
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:1363 github.com/gravitational/teleport/tool/tsh/common.Run
	github.com/gravitational/teleport/tool/tsh/common/tsh.go:614 github.com/gravitational/teleport/tool/tsh/common.Main
	github.com/gravitational/teleport/tool/tsh/main.go:26 main.main
	runtime/proc.go:271 runtime.main
	runtime/asm_amd64.s:1695 runtime.goexit
User Message: EOF
tsh ssh -d --no-resume jeff@teleport16  0.14s user 0.23s system 1% cpu 21.263 total

Thu Nov 14 16:45:08 EST 2024
jeff@macbook ~ %

Here are the debug logs on the 16.4.7 ssh node:

Nov 14 14:44:48 myhost teleport[82227]: 2024-11-14T14:44:48-07:00 INFO [SESSION:N] New party ServerContext(203.0.113.183:64776->192.168.168.191:57822, user=jeff, id=3) party(id=ca8859c3-b1c2-5b73-5d96-d9ba4aeb8f8a) joined the session with participant mode: peer. session_id:1027580d-efd5-2796-e821-85433b475b94 srv/sess.go:1986
Nov 14 14:44:48 myhost teleport[82227]: 2024-11-14T14:44:48-07:00 DEBU [SESSION:N] Launching session session_id:1027580d-efd5-2796-e821-85433b475b94 srv/sess.go:1240
Nov 14 14:44:57 myhost teleport[82227]: 2024-11-14T14:44:57-07:00 DEBU [PROXY:AGE] Received discovery request: [1a80270a-9b74-633c-5c8d-42dda5cac488, 7ecef897-2136-6f33-bab0-1f1a32f60d6b, c28edef9-64dc-8c79-e2a5-1ba6dfa57747] leaseID:1 target:teleport.example.com:443 localCluster: targetCluster:teleport.example.com reversetunnel/agent.go:652
Nov 14 14:44:58 myhost teleport[82227]: 2024-11-14T14:44:58-07:00 DEBU [PROXY:AGE] Received discovery request: [1a80270a-9b74-633c-5c8d-42dda5cac488, c28edef9-64dc-8c79-e2a5-1ba6dfa57747, 7ecef897-2136-6f33-bab0-1f1a32f60d6b] leaseID:2 target:teleport.example.com:443 localCluster: targetCluster:teleport.example.com reversetunnel/agent.go:652
Nov 14 14:45:08 myhost teleport[82227]: 2024-11-14T14:45:08-07:00 ERRO [SESSION:N] Child process never became ready error:[
Nov 14 14:45:08 myhost teleport[82227]: ERROR REPORT:
Nov 14 14:45:08 myhost teleport[82227]: Original Error: trace.aggregate timed out waiting for continue signal
Nov 14 14:45:08 myhost teleport[82227]: Stack Trace:
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/term.go:268 github.com/gravitational/teleport/lib/srv.(*terminal).WaitForChild
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/sess.go:1347 github.com/gravitational/teleport/lib/srv.(*session).startInteractive
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/sess.go:389 github.com/gravitational/teleport/lib/srv.(*SessionRegistry).OpenSession
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/termhandlers.go:126 github.com/gravitational/teleport/lib/srv.(*TermHandlers).HandleShell
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/regular/sshserver.go:1921 github.com/gravitational/teleport/lib/srv/regular.(*Server).dispatch
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/regular/sshserver.go:1804 github.com/gravitational/teleport/lib/srv/regular.(*Server).handleSessionRequests
Nov 14 14:45:08 myhost teleport[82227]:         github.com/gravitational/teleport/lib/srv/regular/sshserver.go:1571 github.com/gravitational/teleport/lib/srv/regular.(*Server).HandleNewChan.func1
Nov 14 14:45:08 myhost teleport[82227]:         runtime/asm_amd64.s:1695 runtime.goexit
Nov 14 14:45:08 myhost teleport[82227]: User Message: timed out waiting for continue signal] session_id:1027580d-efd5-2796-e821-85433b475b94 srv/sess.go:1348
Nov 14 14:45:08 myhost teleport[82227]: 2024-11-14T14:45:08-07:00 INFO [SESSION:N] Closing session session_id:1027580d-efd5-2796-e821-85433b475b94 srv/sess.go:957
Nov 14 14:45:08 myhost teleport[82227]: 2024-11-14T14:45:08-07:00 INFO [SESSION:N] Closing party ca8859c3-b1c2-5b73-5d96-d9ba4aeb8f8a srv/sess.go:2155
Nov 14 14:45:08 myhost teleport[82227]: 2024-11-14T14:45:08-07:00 INFO [SESSION:N] Removing party ServerContext(203.0.113.183:64776->192.168.168.191:57822, user=jeff, id=3) party(id=ca8859c3-b1c2-5b73-5d96-d9ba4aeb8f8a) from session session_id:1027580d-efd5-2796-e821-85433b475b94 srv/sess.go:1579

Strace analysis

When reviewing the behavior of both versions of Teleport via strace, there were a few subtle differences in which calls were made. The pid that actually launches the otp_banner.sh is cloned in both versions.

click to expand
mmap(NULL, 242344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 21, 0) = 0x7fc978625000
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fca01a3fbd0) = 78625
set_robust_list(0x7fca01a3fbe0, 24) = 0
close(16 <unfinished ...>
<... close resumed>) = 0
dup2(20, 1 <unfinished ...>
<... dup2 resumed>) = 1
dup2(1, 2)        = 2
pipe2([16, 21], 0) = 0
close(21)         = 0
dup2(16, 0)       = 0
close(16)         = 0
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=8*1024, rlim_max=8*1024}) = 0

This is an strace snippet from the non-working version. It is nearly identical to the working version:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f8aedb42a10) = 77955
set_robust_list(0x7f8aedb42a20, 24) = 0
close(11 <unfinished ...>
<... close resumed>) = 0
dup2(12, 1)       = 1
dup2(1, 2)        = 2
pipe2([11, 13], 0) = 0
close(13)         = 0
dup2(11, 0)       = 0
close(11)         = 0
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024*1024, rlim_max=1024*1024}) = 0

The new version has that mmap call that seems to be absent in the old version. The clone options are also slightly different.

I think there may be a change in how go and/or cgo handles go threads and loading libraries, which leads to the callback never being processed for this type of PAM message.

@programmerq programmerq added bug c-co Internal Customer Reference PAM Label related to Pluggable Authentication Module (PAM) Submethod. ssh labels Nov 14, 2024
@programmerq programmerq changed the title Some PAM messages are not handled by Teleport PAM Auth with interactive input is broken. Nov 18, 2024
@rosstimothy
Copy link
Contributor

rosstimothy commented Nov 20, 2024

I suspect that #43756 might have improved or possibly resolved this issue. Running through the same reproduction steps as above, but using 17.0.1 instead of 16.4.7 I see the following:

$ tsh ssh tim@vm
 ********    Welcome tim
 Please input YES to continue:
tim@vm:~$ uptime
 11:56:31 up 1 day,  1:59,  2 users,  load average: 0.03, 0.05, 0.00
tim@vm:~$ exit
logout
the connection was closed on the remote side at  20 Nov 24 11:56 EST

Edit: HRmm, maybe not, this only seems to get this far if use_pam_auth: false is set, as soon as you update the config to use_pam_auth: true things go south.

@rosstimothy rosstimothy self-assigned this Nov 20, 2024
@rosstimothy
Copy link
Contributor

I've been able to dig into this a bit further, and it appears that nothing has changed in regards to PAM, or go runtime/CGO behavior. This looks to be a deadlock in handling shell requests.

When an interactive PAM auth is attempted, we execute the auth modules while processing the shell request. When any user interaction is required the target host writes to stdout/stdin and waits for the user to proceed with the ceremony. However, because all of this is happening before replying to the shell request, the stdin/stdout/stderr of the session are not yet set up to be copied back to the user. This results in a 20s deadlock where the target host is waiting for the user to react to the prompt until the child wait deadline is exceeding and terminates the session.

@rosstimothy
Copy link
Contributor

I believe this stopped working as a result of #29279.

@rosstimothy
Copy link
Contributor

This is likely the same problem captured in #12214 and #29628

rosstimothy added a commit that referenced this issue Nov 26, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
rosstimothy added a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
github-actions bot pushed a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
rosstimothy added a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
rosstimothy added a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
github-merge-queue bot pushed a commit that referenced this issue Nov 27, 2024
#29279 caused PAM to
deadlock when performing interactive authentication. To restore
the previous semblance of functional PAM, this reverts waiting
for PAM to be complete if BPF is disabled. #29279 was specifically
added to prevent systemd, which may be invoked via a PAM module,
from moving the exec subprocess to a different cgroup. Since
cgroups are not used outside of Enhanced Session Recording this
is a stop-gap measure that can allow mose users of PAM to get an
immediate restoration of behavior while a more long term and sane
approach to performing PAM during the SSH handshake can be
considered, evaluated, and tested.

Closes #49028.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug c-co Internal Customer Reference PAM Label related to Pluggable Authentication Module (PAM) Submethod. ssh
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants