Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T321886 #225

Merged
merged 9 commits into from
Nov 28, 2022
Merged

T321886 #225

merged 9 commits into from
Nov 28, 2022

Conversation

vivian-rook
Copy link
Collaborator

This would remove the need to mount nfs into worker nodes, and have pods mount the nfs through a hostpath to the worker node. Thus freeing us from needing to have special worker nodes for k8s.

I am aware of one breaking change.
Currently one can access nfs through a series of sim links on the worker node as follows:

rook@paws-k8s-worker-3:~$ ls -l /public/dumps/
total 8
lrwxrwxrwx 1 root root 54 Oct  3 13:05 incr -> /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/incr
lrwxrwxrwx 1 root root 70 Oct  3 13:05 pagecounts-all-sites -> /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-all-sites
lrwxrwxrwx 1 root root 64 Oct  3 13:05 pagecounts-raw -> /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-raw
lrwxrwxrwx 1 root root 59 Oct  3 13:05 pageviews -> /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pageviews
lrwxrwxrwx 1 root root 44 Oct  3 13:05 public -> /mnt/nfs/dumps-clouddumps1001.wikimedia.org/

It is unclear to me on why these links to other/ exist, as they are all accessible through public. As such anyone accessing data through /public/dumps/<incr/pagecounts-all-sites/pagecounts-raw/pageviews> would experience a file not found kind of error with this change. This seems manageable through an announcement of the change. Though is anyone aware of why these links exist to begin with?

Copy link
Contributor

@david-caro david-caro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good 👍 , have not tested it yet though, will approve when I get to it.

I got some questions also, I'm not very familiar with paws/jupyterhub.

Thanks!

paws/templates/legacy.yaml Outdated Show resolved Hide resolved
- 'persistentVolumeClaim'
allowedHostPaths:
- pathPrefix: '/public/dumps'
- pathPrefix: '/mnt/public/dumps'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, this is needed only for the dev environment right? If so, can we put it behind a flag like the other dev specific bits?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good point. I think we can drop this entirely as psp is not used in a local dev env. Also this would likely vanish with https://phabricator.wikimedia.org/T317787

paws/values.yaml Show resolved Hide resolved
paws/values.yaml Show resolved Hide resolved
Copy link
Member

@supertassu supertassu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The symlinks in /public/dumps still need to be there.

paws/templates/legacy.yaml Outdated Show resolved Hide resolved
@vivian-rook
Copy link
Collaborator Author

The symlinks in /public/dumps still need to be there.

What is their purpose?

@supertassu
Copy link
Member

The symlinks in /public/dumps still need to be there.

What is their purpose?

Those are the main documented path for the dump files, everything else is an implementation detail that people shouldn't need to care about.

@vivian-rook
Copy link
Collaborator Author

The symlinks in /public/dumps still need to be there.

What is their purpose?

Those are the main documented path for the dump files, everything else is an implementation detail that people shouldn't need to care about.

Where are these documented? Beyond the main public directory, which would still exist, I'm finding /public/dumps/pagecounts-raw in wikitech (https://wikitech.wikimedia.org/wiki/Help:Toolforge/Dumps) though not the others. Not sure how much I should trust wikitech search though.

paws/values.yaml Show resolved Hide resolved
paws/values.yaml Show resolved Hide resolved
@david-caro
Copy link
Contributor

I'm having unrelated issues trying to test this locally, for some reason the minikube ingress controller addon refuses to start (it seems it fails to connect to the apiserver)

@david-caro
Copy link
Contributor

david-caro commented Nov 2, 2022

I'm having unrelated issues trying to test this locally, for some reason the minikube ingress controller addon refuses to start (it seems it fails to connect to the apiserver)

Found a workaround kubernetes-sigs/kind#2240 will test soon Called it too soon, not working yet

@vivian-rook
Copy link
Collaborator Author

I'm having unrelated issues trying to test this locally, for some reason the minikube ingress controller addon refuses to start (it seems it fails to connect to the apiserver)

Found a workaround kubernetes-sigs/kind#2240 will test soon Called it too soon, not working yet

You are running minikube addons enable ingress to start the ingress?

@david-caro
Copy link
Contributor

I'm having unrelated issues trying to test this locally, for some reason the minikube ingress controller addon refuses to start (it seems it fails to connect to the apiserver)

Found a workaround kubernetes-sigs/kind#2240 will test soon Called it too soon, not working yet

You are running minikube addons enable ingress to start the ingress?

Yep, upgraded to the latest minikube too, tried changing the default iptables input to accept all (both on my laptop and the minikube VM), tried using the legacy iptables (iptables-legacy), so far no luck.

david-caro
david-caro previously approved these changes Nov 3, 2022
Copy link
Contributor

@david-caro david-caro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finally was able to run it locally, on fedora with podman xd
Debian was unable to start the storage provisioner or the ingress :S

Seems to work ok locally 👍

+1 from me after addressing taavi's comment

@supertassu
Copy link
Member

Where are these documented? Beyond the main public directory, which would still exist, I'm finding /public/dumps/pagecounts-raw in wikitech (wikitech.wikimedia.org/wiki/Help:Toolforge/Dumps) though not the others. Not sure how much I should trust wikitech search though.

https://wikitech.wikimedia.org/wiki/Help:Shared_storage#.2Fpublic.2Fdumps documents the root path. Paths below it aren't documented, but people's scripts depend on them not changing. Also it'd be very confusing if the setups on Toolforge and PAWS were different.

paws/production.yaml Outdated Show resolved Hide resolved
In order to simplify transition to dropping links, updating
dumps path from /public/dumps to /public/dumps/public which
resembles what is currently in place. This should allow for
a seamless transition between the current config and this
config.
@vivian-rook
Copy link
Collaborator Author

@supertassu I apologize, but I don't have a way to address your concerns. That we should not update the paths because the current paths are what people expect, or that the change will cause PAWS to diverge from toolforge are instances of argumentum ad antiquitatem and argumentum ad populum. As such they cannot be addressed.

You are correct, assuming any scripts call these paths, they will break until patched with an updated path. This can be remedied with an email to cloud announce. It does not justify forever having links built into the container with a note announcing that we have no real reason to be doing this, it just has always been done that way. Unnecessary code and configurations need not be preserved when located.

That toolforge is similarly poorly configured in this regard is not an argument that PAWS must be as well. It is rather an argument that toolforge should be updated as well. I leave it to those who manage it to decide if they are up to making that transition.

@Krinkle
Copy link

Krinkle commented Nov 7, 2022

@vivian-rook Please forgive my ignorance of having no familiarity with the PAWS setup.

I believe the objective is to remove a major obstacle that otherwise requires unusual things to be maintained by WMCS, layers of complex mounts or something, that sounds worthwhile and good! It also sounds like maintaining a simple symlink for ecosystem support is deemed infeasible. I don't understand why, but I won't question that.

Instead, as a external bystander familiar with most concepts at play but unfamiliar with WMCS/PAWS internals, I would ask that you explain for future reference what technical specifics in the present make it expensive/difficult/impossible to maintain the stable API through a symlink? It would seem to a novice like me, that one could both rid the complex mounts and also leave a symlink. It seems that way because the migration guide says to reference /A instead of /B from otherwise the same apparent execution context.

I recognise the potential appeal to tradition here, but I also recognise an as of yet unexplained benefit to naming it A instead of B for end-users (or argumentum ad novitatem, if you will). I presume we don't hold a principal belief that symlinks/redirects are inherently not worthwhile in our ecosystem. There's presumably a technical reason that hinders this specific case. Understanding that reason a bit better would help encourage advocates like myself to help fellow community members migrate, fueled by feeling that the change is justified. 🙂

@vivian-rook
Copy link
Collaborator Author

vivian-rook commented Nov 7, 2022

@Krinkle your input is appreciated, bystander or no.

You have given an accurate overview of the current setup and proposed change. The reason that the links came up is how they are implemented would no longer be viable with a move to a more cloud native setup. Which is to say, a setup where we could drop PAWS on any k8s that we are given to drop it on. Currently that cannot happen. This patch would fix one such issue, specifically that PAWS currently relies on its worker nodes to mount NFS, which a PAWS pod would hostPath into the pod. This change would have NFS mounted directly into the pod, cutting out the need for a VM level configuration.

The problem with the links comes from their creation mechanism which would, effectively, cease to exist. As such it would have to be intentionally recreated. And it could be. Something like:

RUN ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/incr /public/dumps/incr
RUN ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-all-sites /public/dumps/pagecounts-all-sites
RUN ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pagecounts-raw /public/dumps/pagecounts-raw 
RUN ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/other/pageviews /public/dumps/pageviews
RUN ln -s /mnt/nfs/dumps-clouddumps1001.wikimedia.org/ /public/dumps/public

Technically, easy as can be. We run into a problem when we document it, as good documentation doesn't tell us what we are doing, but why we are doing it:

# The above serves no purpose. It was discussed, and recognized that it serves no purpose, 
# but we don't remove things that could pose a breaking change. So the links shall remain,
# and be rebuilt in the situation that they would no longer work.

Everyone who contributes here, has reasons to. Projects that make it clear that if you find arbitrary complexity, you shouldn't fix it if anyone will be able to notice, are not terribly attractive to work on in my perspective. I want to make it clear that this is a project where if you have something to contribute, and it is better on the other side, we can go through the discomfort of updating it. I do not want to make it clear that we will not do that.

On a technical side, we are not replacing A with B. Rather we are removing B, because A is already inclusive of B

"It seems that way because the migration guide says to reference /A instead of /B from otherwise the same apparent execution context" Could you link to this?

@vivian-rook
Copy link
Collaborator Author

vivian-rook commented Nov 7, 2022

I feel it is important to describe another perspective that doesn't seem as recognized as I feel it should be.

I do not believe that there is any argument that the proposed patch does not reduce complexity of the code. If there is such an argument, please bring it forward. Though it appears that there is uniform agreement that the patch would make the code less complex. The argument against removing the links, is that it is "not important enough" to do. Which is a very fuzzy statement to make. We should not be doing that, because it allows for bias to enter the conversation. It isn't a clear cut "it's more complex" or "it's less complex" conversation. If we accept that we can say something is insufficiently important, yet clearly an improvement, that standard can be used to mask prejudice. We should limit that where we can.

@vivian-rook
Copy link
Collaborator Author

While I don't find it relevant, I suspect it will be deemed compelling. Having checked the home directories of the PAWS instances running today none of them reference the links that will be removed with this pr. Only two reference /public/dumps/public, which will not be removed, one of which is my instance while I was poking around.

@dhinus
Copy link
Member

dhinus commented Nov 14, 2022

My 2c: I vote for removing the links, especially after the last comment from @vivian-rook which implies there will be no impact whatsoever to users.

In general, I think it's not just an argumentum ad novitatem (though I'm conscious I'm personally biased in favour of those kind of arguments 😛) but also a matter of clarity and simplicity for end users. If we have two ways or doing something, I think it's (in general) good to choose one and deprecate the other, giving sufficient advance warning to our existing users. I might be convinced to retain two in case removing one is likely to cause big inconveniences to our users, but that doesn't seem to be the case in this specific instance.

@david-caro
Copy link
Contributor

david-caro commented Nov 14, 2022

Adding my opinion here: I think that the links should be removed, and the new paths documented, but in both toolforge and paws at the same time (or one right after the other, point being after one changing making the other change the next priority), being part of the same ecosystem there's value in consistency and following the principle of least surprise (https://en.wikipedia.org/wiki/Principle_of_least_astonishment).

That makes the change a bit bigger than just this, but makes sure we keep both systems access to dumps consistent.

@nskaggs
Copy link

nskaggs commented Nov 18, 2022

I appreciate everyone's input on this proposed change. Several good points of consideration were raised, including concern over breakage. I believe @vivian-rook has done some due diligence to understand the removal of those specific symlinks shouldn't cause widespread breakage. If they do, we can react, fix, and move on. Either way, I would encourage us to not fear the unknown or what might happen. https://en.wikipedia.org/wiki/Wikipedia:Be_bold. Making changes to improve PAWS will come with some cost. Let's be mindful of those costs as they occur, but nevertheless, let's move forward in making a better tool for everyone. There are other bold ideas for PAWS. Let's pursue them!

On toolforge, it sounds like this may also be a useful change to make. Let's file a phabricator ticket for separate consideration and prioritization in that system. I wouldn't want us to accidentally prevent this change from occurring by linking it unnecessarily to other work.

+1 from me. Please merge.

Copy link
Member

@chicocvenancio chicocvenancio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving away from hostMounts is great. I do think this kind of change warrants wide announcement though, perhaps a link in c.JupyterHub.template_vars['announcement'] as well as an announcement in tech news and cloud-announce.

@vivian-rook
Copy link
Collaborator Author

Moving away from hostMounts is great. I do think this kind of change warrants wide announcement though, perhaps a link in c.JupyterHub.template_vars['announcement'] as well as an announcement in tech news and cloud-announce.

I hadn't considered the banner, or tech news, though I did note the change in cloud-announce a few weeks ago. Also it would appear that while the change seems notable, it won't have an impact, at least the sample of running instances were not using the paths that are known to be removed. Also also, the patch is actually already out, has been since this morning, I was just giving it a few hours to see if anything crashed dramatically.

@chicocvenancio
Copy link
Member

Fair enough, I missed the email to cloud-announce a few weeks ago.

@vivian-rook vivian-rook merged commit ce2b1f5 into main Nov 28, 2022
@vivian-rook vivian-rook deleted the T321886 branch April 24, 2023 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
7 participants