Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lease-shell breaks with remote server returned 404 once provider service gets restarted (.manifest.deployments track breaks as well) #87

Closed
andy108369 opened this issue Apr 11, 2023 · 6 comments
Assignees
Labels
P2 repo/provider Akash provider-services repo issues

Comments

@andy108369
Copy link
Contributor

andy108369 commented Apr 11, 2023

lease-shell breaks with remote server returned 404 once provider service gets restarted.

.manifest.deployments track breaks as well.

internally tracked https://github.com/ovrclk/engineering/issues/538

This issue appeared in akash 0.16.4 through provider-services 0.2.1.

This issue gets resolved if I revert this commit akash-network/node@1ab8ee6

looks like the ctx is not getting updated with the active leases (upon provider restart) for IsActive to work.


This commit might be also related to manifest.deployments is reporting 0 now (or mainnet4 upgrade-related [provider-services 0.1.0]):

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.manifest.deployments'
0

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.cluster.inventory.active | length'
60

$ curl -sk https://provider.provider-2.prod.ewr1.akash.pub:8443/status | jq '.cluster.leases'
60

Update: 23 Jan 2023

Akash Provider reports:

@andy108369 andy108369 added repo/provider Akash provider-services repo issues P2 labels Apr 11, 2023
@troian troian removed their assignment Apr 11, 2023
@troian troian added P2 and removed P2 labels Apr 11, 2023
@andy108369
Copy link
Contributor Author

andy108369 commented Oct 12, 2023

workarounds

One can simply add openssh server to their deployment and their public keys to keep a permanent SSH access to the deployment.

For Ubuntu-based image

Make sure to set your public ssh key in SSH_PUBKEY

    image: ubuntu:22.04
    env:
      - 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc...'
    command:
      - sh
      - -c
      - |
        apt-get update
        apt-get install -y --no-install-recommends -- tini ssh
        mkdir -p -m0755 /run/sshd
        mkdir -m700 ~/.ssh
        echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys
        chmod 0600 ~/.ssh/authorized_keys
        cat /proc/1/environ |xargs -0 -n1 | tee -a /etc/environment
        /usr/sbin/sshd
        exec /usr/bin/tini -- tail -f /dev/null
    expose:
      # HTTP/HTTPS port
      - port: 80
        as: 80
        to:
          - global: true
      # SSH port
      - port: 22
        as: 22
        to:
          - global: true

Ollama + SSHD example

https://gist.githubusercontent.com/andy108369/b633153179e08cae4115957a2d294643/raw/888e0b9ccb713d81c3e05d23a1e533323bc2a080/ollama-ssh.yaml

For alpine-based image

Make sure to set your public ssh key in SSH_PUBKEY

    image: alpine:3.18.4
    env:
      - 'SSH_PUBKEY=ssh-rsa AAAAB3NzaC1yc...'
    command:
      - sh
      - -c
      - |
        apk update
        apk add tini openssh-server
        ssh-keygen -t rsa -f /etc/ssh/ssh_host_rsa_key -N ""
        ssh-keygen -t ed25519 -f /etc/ssh/ssh_host_ed25519_key -N ""
        mkdir -m700 ~/.ssh
        echo "$SSH_PUBKEY" | tee ~/.ssh/authorized_keys
        chmod 0600 ~/.ssh/authorized_keys
        cat /proc/1/environ |xargs -0 -n1 | tee -a /etc/environment
        /usr/sbin/sshd
        exec /sbin/tini -- tail -f /dev/null
    expose:
      # HTTP/HTTPS port
      - port: 80
        as: 80
        to:
          - global: true
      # SSH port
      - port: 22
        as: 22
        to:
          - global: true

And to combine the sshd dameon with running the app(s), one can simply add them one by one:

      app1 &
      app2 &
      exec /usr/sbin/sshd -D

To figure what one has to run (and how) in a specific image:

docker pull <image>
docker image history <image> --no-trunc --format '{{.CreatedBy}}' | grep -E '^WORKDIR|^ENTRYPOINT|^CMD|^USER'

@SGC41
Copy link

SGC41 commented Dec 11, 2023

Would be nice with a fix for this...
a lot of customers, have a bad experience because of it.

@anilmurty
Copy link

anilmurty commented Jan 14, 2024

Added this to the "Up Next" list on the product/ eng roadmap https://github.com/orgs/akash-network/projects/5/views/1

@rekpero
Copy link

rekpero commented Jan 26, 2024

Hey team, fixing this issue quickly would really help us out at Spheron. We've got a bunch of users struggling to connect shell for their keys or to check status, and it's becoming a bit of a headache. Could we get this sorted out as soon as possible? We're more than happy to give it a test run even before it goes live on the main provider code. Thanks a bunch for jumping on this quickly!

@brewsterdrinkwater
Copy link
Collaborator

April 2nd, 2024

  • This will be addressed via GRPC migation.

troian added a commit to akash-network/provider that referenced this issue May 21, 2024
check service status prior trying shell using cluster API
refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue May 21, 2024
check service status prior trying shell using cluster API
refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 17, 2024
fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 19, 2024
fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 19, 2024
fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 19, 2024
fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 19, 2024
fix(lease/shell): use cluster check for lease shell

fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
troian added a commit to akash-network/provider that referenced this issue Aug 19, 2024
fix(lease/shell): use cluster check for lease shell

fixes issue when provider restarts and tenant attempts
to shell into the deployment as it takes time for provider
to load all leases into deployment manager

refs akash-network/support#87

Signed-off-by: Artur Troian <troian.ap@gmail.com>
@troian troian closed this as completed Aug 19, 2024
@andy108369
Copy link
Contributor Author

Provider 0.6.4 fixed this issue! 🚀
We'll be rolling the update ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 repo/provider Akash provider-services repo issues
Projects
Status: Released (in Prod)
Development

No branches or pull requests

7 participants