Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: resolve service name conflict for notebooks and tensorboards #7468

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

mishraprafful
Copy link
Contributor

@mishraprafful mishraprafful commented Feb 20, 2024

Related Issues/PRs

Closes kubeflow/dashboard#50

Changes

Notebook Controller
  • Update names of service, virtualService and statefulset resource to include
    • nb-{notebook_name}- with generateName
  • Added a reconcile check for name length check as we are adding prefixes
  • Added a function that deletes obsolete services, so as to not leave dangling services after folks update.
    • Added a function that deletes obsolete services, so as to not leave dangling services after folks update.
  • Update servicename in virtualService after reconciliation of service
  • Upated code and tests to check owner reference as the source of resource connection
Tensorboard Controller
  • Update name of service and virtualService to tb-{notebook-name}
  • Add label tensorboard-name in pods and for service label selector
  • Add label notebook-name in service selector
  • Added a reconcile check for name length check as we are adding prefixes

Comments

  • This might be a breaking change, and if there are notebook servers with long names, they might fail to reconcile as we are adding name, and namespace prefix.

Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kimwnasptd for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch 2 times, most recently from 3c9608b to 587555e Compare February 20, 2024 09:13
@google-oss-prow google-oss-prow bot added size/S and removed size/XS labels Feb 20, 2024
@thesuperzapper
Copy link
Member

We need to consider what will happen to existing TensorBoard instances when this update is applied to the cluster.

We should test what happens, it looks like we are setting the ownerReferences of the Service/VirtualService to the TensorBoard CRD instance, so at worst, we can tell people to delete and recreate all their TensorBoard CRD instances:

@mishraprafful
Copy link
Contributor Author

mishraprafful commented Feb 29, 2024

@thesuperzapper Could we re-run the tests it seems like the unit test of notebook-controller fails on some dependency, and is unrelated to the changes in the PR. 🙏🏻

Also, the integration tests and multi-arch build for the same pass. 😄

@thesuperzapper
Copy link
Member

@mishraprafful can you please check what happens to the existing resources when you upgrade from the old version of the controller to the new version (specifically, does it remove the old Service/{notebook_name}, or leave it dangling).

If it leaves it, we should consider having the controller look for Services / VirtualSerivces which have an owner reference of a Notebook, but are not named with the notebook-XXX or tensorboard-XXX prefix.

@rimolive
Copy link
Member

rimolive commented May 3, 2024

@mishraprafful There are failling tests. Can you take a look?

@mishraprafful
Copy link
Contributor Author

Sure @rimolive Looking into it.

@thesuperzapper thesuperzapper added this to the v1.9.0 milestone May 9, 2024
@thesuperzapper
Copy link
Member

Also, there was a historical PR to fix this issue, which also accounted for the case that the name is too long #6701

They put 63 characters, but I am not sure if VirtualServices/Services can actually be longer than that?

@mishraprafful
Copy link
Contributor Author

@rimolive the tests are failing to pull some dependencies, the tests pass when I try them locally. Maybe we could re-trigger the tests as this seems like an intermediate issue.

Feel free to point out if I'm missing something 🙌🏻

@mishraprafful
Copy link
Contributor Author

@thesuperzapper Service name has to be less than 64

The name of a Service object must be a valid RFC 1035 label name.

ref

@mishraprafful
Copy link
Contributor Author

Just an update: now that #7589 is merged, I'm testing a last few things, should add some more commits soon to finish the functionality.

@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch from 966afe2 to 774f001 Compare June 2, 2024 13:58
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Jun 2, 2024
@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch from cd2229d to 4a57f11 Compare June 2, 2024 14:02
@mishraprafful
Copy link
Contributor Author

@thesuperzapper @rimolive @kimwnasptd All issues addressed, tests passing, PR description updated. 🚀

@thesuperzapper thesuperzapper modified the milestones: v1.9.0, v1.9.1 Jun 24, 2024
@thesuperzapper
Copy link
Member

@mishraprafful I am going to push this to 1.9.1, because we came across a better way to do this while building Notebooks 2.0.

The key idea is to use the generateName field of every Kubernetes object, rather than using the name directly, which is how Kubernetes creates Pod names from deployments with those {DEPLOYMENT_NAME}-{POD_HASH}-{RANDOM_SUFFIX} names.

Obviously, this means we need to account for random names (which is also great for backwards compatibly).

Rather than assuming the Service/VirtualService/ReplicaSet has the same name as the Notebook (or notebook-{NOTEBOOK_NAME}), we should instead loop over the corresponding resource type in the target namespace and find the ones which have metadata.ownerReferences set to the Notebook being reconciled.

If there are no such resources, we can create a new one using the generateName set to nb-{NOTEBOOK_NAME}- (with the owner reference already set to avoid problems if the controller crashes).

We will need to be careful about the Service names, because obviously the VirtualService needs to point to the correct Service name, but this can simply be handled by ordering the reconciliation correctly, so we know the Service name before reconciling the VirtualService.

@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Nov 17, 2024
@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch from 616cc4a to 1816a25 Compare November 17, 2024 20:26
@mishraprafful mishraprafful marked this pull request as draft November 17, 2024 20:26
@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch from 1816a25 to 26f7c81 Compare November 17, 2024 20:28
mishraprafful and others added 5 commits November 17, 2024 21:38
Signed-off-by: mishraprafful <mishraprafful@gmail.com>
Signed-off-by: mishraprafful <mishraprafful@gmail.com>
Signed-off-by: mishraprafful <mishraprafful@gmail.com>
Signed-off-by: mishraprafful <mishraprafful@gmail.com>
)

Signed-off-by: mishraprafful <mishraprafful@gmail.com>
@mishraprafful mishraprafful force-pushed the fix/tensorboard-notebook-name-conflict branch from b0af1d6 to a8fe655 Compare November 17, 2024 20:38
@mishraprafful mishraprafful marked this pull request as ready for review November 17, 2024 20:59
@mishraprafful
Copy link
Contributor Author

@thesuperzapper Apologies for the huge delay (life happened 😉 ), but I've updated the PR according to your last comment now.

@jiridanek
Copy link
Member

The key idea is to use the generateName field of every Kubernetes object, rather than using the name directly, which is how Kubernetes creates Pod names from deployments with those {DEPLOYMENT_NAME}-{POD_HASH}-{RANDOM_SUFFIX} names.

Thought that the default advice is to use deterministic names and take advantage of these subsequently, as per https://github.com/kubernetes-sigs/controller-runtime/blob/main/FAQ.md#q-my-cache-might-be-stale-if-i-read-from-a-cache-how-should-i-deal-with-that.

Using generateName means extra work, as per that FAQ

In the few cases when you cannot take advantage of deterministic names (e.g. when using generateName), it may be useful in to track which actions you took, and assume that they need to be repeated if they don't occur after a given time (e.g. using a requeue result). This is what the ReplicaSet controller does.

@jiridanek
Copy link
Member

jiridanek commented Nov 18, 2024

After a quick look, seems to me Indexer is not used, as is in Notebooks 2.0

	// setup field indexers on the manager cache. we use these indexes to efficiently
	// query the cache for things like which Workspaces are using a particular WorkspaceKind
	if err := helper.SetupManagerFieldIndexers(mgr); err != nil {
		setupLog.Error(err, "unable to setup field indexers")
		os.Exit(1)
	}

https://github.com/kubeflow/notebooks/blob/9d02acfdd707282385221590bc953f892099a94c/workspaces/controller/cmd/main.go#L129-L134

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tensorboard fails to connect if there is a notebook with the same name
4 participants