Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Culler not working when there are too many users? #88

Open
CarlosDominguezBecerril opened this issue Jan 23, 2025 · 1 comment
Open
Labels
bug Something isn't working

Comments

@CarlosDominguezBecerril
Copy link

CarlosDominguezBecerril commented Jan 23, 2025

Bug description

Culler is not culling despite the last activity clearly being surpassed by the maximum time allowed.

Current setup from our k8s logs:
Starting service 'cull-idle': ['python3', '-m', 'jupyterhub_idle_culler', '--url=https://<url>:5081/hub/api', '--timeout=14400', '--cull-every=600', '--concurrency=10']

We observe two things:

  • user is using the default server
"last_activity": "2025-01-23T16:01:47.564284Z",
"started": "2025-01-13T09:57:08.123732Z",
  • user is not using the analytics named server
"last_activity": "2025-01-20T12:06:54.100549Z",
"started": "2025-01-13T13:51:04.802281Z",

In the second case the last_activity was over 3days ago which is way more than the timeout of 14400 (4 hours)

Example obtained from /api/users/<user>

{
    "server": "/user/<user>/",
    "created": "2023-02-06T09:37:18.564165Z",
    "admin": false,
    "name": "<user>",
    "kind": "user",
    "last_activity": "2025-01-23T16:06:51.635146Z",
    "roles": [
        "user"
    ],
    "auth_state": {
        "email": "<email>"
    },
    "pending": null,
    "groups": [
        "<groups>"
    ],
    "servers": {
        "": {
            "name": "",
            "full_name": "<user>/",
            "last_activity": "2025-01-23T16:01:47.564284Z",
            "started": "2025-01-13T09:57:08.123732Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/",
            "user_options": {
                "profile": "scipy-spark",
                "server_size": "model_training",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/server/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>.jupyter-hub.svc.cluster.local"
            }
        },
        "analytics": {
            "name": "analytics",
            "full_name": "<user>/analytics",
            "last_activity": "2025-01-20T12:06:54.100549Z",
            "started": "2025-01-13T13:51:04.802281Z",
            "pending": null,
            "ready": true,
            "stopped": false,
            "url": "/user/<user>/analytics/",
            "user_options": {
                "profile": "python-3-analytics",
                "server_size": "default",
                "service_account": "data-science"
            },
            "progress_url": "/hub/api/users/<user>/servers/analytics/progress",
            "full_url": null,
            "full_progress_url": null,
            "state": {
                "pod_name": "jupyter-<first_name>-2d<last_name>--analytics",
                "namespace": "jupyter-hub",
                "dns_name": "jupyter-<first_name>-2d<last_name>--analytics.jupyter-hub.svc.cluster.local"
            }
        }
    }
}

How to reproduce

I don't know how to reproduce this, we have jupyterhub deployed in several datacenters but only fails in our busiest one with over 150 servers simultaneously at the same time (not sure if it's an issue).

I'm afraid that the service is blocked / not running as expected because there are too many users

Expected behaviour

To cull the servers after they timeout

Actual behaviour

Is not culling the servers after the timeout

Your personal set up

Latest version of jupyterhub and idle culler

Logs

I can see the following logs from the culler:
⚠ Not sure if important but the fetching page 2 doesn't show the port just the url

Jan 23 16:41:12.885
nbt-hub
[I 250123 15:41:11 __init__:156] Fetching page 2 https://<url without port>/hub/api/users?state=ready&offset=50&limit=50

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch

Jan 23 16:31:12.600
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated

Jan 23 16:31:12.599
nbt-hub
File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle

Sometimes I also get

/usr/lib/python3.10/collections/__init__.py:431: RuntimeWarning: coroutine 'cull_idle.<locals>.handle_user' was never awaited
@CarlosDominguezBecerril CarlosDominguezBecerril added the bug Something isn't working label Jan 23, 2025
@CarlosDominguezBecerril
Copy link
Author

CarlosDominguezBecerril commented Jan 23, 2025

After looking into the issue I found the problem.

During paginationnext_info["url"] doesn't contain the port in the url

req.url = next_info["url"]

https://github.com/jupyterhub/jupyterhub-idle-culler/blob/main/jupyterhub_idle_culler/__init__.py#L158

which throws an error during fetching

[I 250123 20:02:44 __init__:156] Fetching page 2 https://<my_domain>/hub/api/users?state=ready&offset=50&limit=50
[E 250123 20:02:44 ioloop:770] Exception in callback functools.partial(<bound method IOLoop._discard_future_result of <tornado.platform.asyncio.AsyncIOMainLoop object at 0xf73e1c04f580>>, <Task finished name='Task-1' coro=<cull_idle() done, defined at /usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py:78> exception=ConnectionRefusedError(111, 'Connection refused')>)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 750, in _run_callback
        ret = callback()
      File "/usr/local/lib/python3.10/dist-packages/tornado/ioloop.py", line 774, in _discard_future_result
        future.result()
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 436, in cull_idle
        async for user in fetch_paginated(req):
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 142, in fetch_paginated
        response = await resp_future
      File "/usr/local/lib/python3.10/dist-packages/jupyterhub_idle_culler/__init__.py", line 124, in fetch
        return await client.fetch(req)
    ConnectionRefusedError: [Errno 111] Connection refused

Dummy fix for my use case would be:

req.url = next_info["url"].replace(
    "<my_url>",
    "<my_url>:5081"
)

Not ideal, but I was wondering, Is this something jupyterhub should return properly? or the problem is the culler?

I believe it could be jupyterhub since if we have hub on

localhost:8000

first call would look like

localhost:8000/hub/api/users?state=ready&offset=0&limit=50

and the pagination would return as next

localhost/hub/api/users?state=ready&offset=0&limit=50 without the port which is incorrect

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant