Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Windows #1826

Closed
imjasonh opened this issue Jan 8, 2020 · 43 comments
Closed

Support Windows #1826

imjasonh opened this issue Jan 8, 2020 · 43 comments
Labels
design This task is about creating and discussing a design Epic Issues that should be considered as Epics (aka multiple sub-tasks, …) kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.

Comments

@imjasonh
Copy link
Member

imjasonh commented Jan 8, 2020

This issue tracks work needed to make Tekton run on a Kubernetes cluster with Windows nodes. Please contribute your use cases, ideas and experience.

First, some assumptions I've been operating under:

  1. Tekton controller components themselves don't need to run on Windows nodes. The cluster could have N Linux nodes and M Windows nodes, run the controller on one of the Linux nodes, and run requested workloads on Windows nodes. If we decide full-Windows clusters are the goal, then we'll need to expand this work to cover getting controller components built and running for Windows.
  2. The user has to explicitly specify they want a TaskRun to execute on a Windows node, using nodeSelector -- Tekton isn't responsible for somehow detecting that a TaskRun should run on a Windows node. We might change how a TaskRun's Pod is generated based on the user's specification, but ideally we wouldn't have to, for simplicity.

If either of these are contentious, or you have more to add, please comment.

Next, some things I'm fairly sure will break without trying:

  1. Building Images: Entrypoint binary injection is done today by prepending a step containing the entrypoint binary, which copies that binary to /tekton/entrypoint -- to support Windows, that binary will have to be built to run on Windows, and will have to be based on some minimal Windows image (i.e., not distroless as it is today). This is doable with manifest lists, which ko has some support for. Likewise, internal support images used to implement resources (git-init, creds-init, google/cloud-sdk, etc., will need to be able to be built for Windows)

  2. File Paths: Step ordering works by looking for files to exist at some predetermined path, and those paths are specified using /-separated paths (/tekton/tools, /tekton/downward/ready, etc.) -- this is likewise true for /workspace, /tekton/home, etc. If we decide in the future to use fsnotify to watch files instead of polling, we'll have to make sure that works for Windows too (or fallback to polling).

  3. Script Mode: Script mode is implemented today by writing an executable file to /tekton/scripts/something, then invoking it as the Command in the step container. Writing this file and making it executable, as well as where it exists, will require work to support Windows.

Sources:

@imjasonh imjasonh added kind/feature Categorizes issue or PR as related to a new feature. design This task is about creating and discussing a design Epic Issues that should be considered as Epics (aka multiple sub-tasks, …) priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Jan 8, 2020
@vdemeester vdemeester added the kind/design Categorizes issue or PR as related to design. label Jan 8, 2020
@afrittoli
Copy link
Member

@imjasonh any special reason for this to be on the priority list? I will remove it for now, feel free to bring it back to the API WG if needed

@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot
Copy link
Collaborator

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 14, 2020
@tekton-robot
Copy link
Collaborator

@tekton-robot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Send feedback to tektoncd/plumbing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vdemeester
Copy link
Member

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

@tekton-robot tekton-robot reopened this Aug 17, 2020
@tekton-robot
Copy link
Collaborator

@vdemeester: Reopened this issue.

In response to this:

/remove-lifecycle rotten
/remove-lifecycle stale
/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2020
@tekton-robot
Copy link
Collaborator

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close.

/lifecycle stale

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020
@tekton-robot
Copy link
Collaborator

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten with a justification.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close with a justification.
If this issue should be exempted, mark the issue as frozen with /lifecycle frozen with a justification.

/lifecycle rotten

Send feedback to tektoncd/plumbing.

@tekton-robot tekton-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 15, 2020
@lippertmarkus
Copy link

/remove-lifecycle rotten

@tekton-robot tekton-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 15, 2020
@lukehb
Copy link

lukehb commented Jan 20, 2021

@imjasonh In the spirit of your "please contribute your use cases, ideas and experience" statement above, I'd like to chime in here. We are building out a CI/CD product on top of Tekton, specifically for building Unreal Engine projects. We are quite close to getting something in the hands of users; however, Windows support is a blocker for adoption of our product so we are highly motivated to see Windows support land in Tekton. I have been following this issue basically since it was opened and was wondering how things are going internally on this front?

If it is a matter of putting in the legwork, we may be able to help by collaborating with the Tekton maintainers to make it happen and get it upstreamed.

@imjasonh
Copy link
Member Author

If it is a matter of putting in the legwork, we may be able to help by collaborating with the Tekton maintainers to make it happen and get it upstreamed.

Thanks @lukehb that's really helpful. I'm not aware of any active effort to work on this, so your help will be useful. :)

I think the first step would be figuring out what doesn't work currently. What happens when a TaskRun pod executes on a Windows node? What happens when the controller and webhook components run on a Windows node? I suspect basically nothing will work at first, but having an idea exactly what breaks will help us move forward.

The entrypoint binary we inject into each TaskRun pod is not currently built for Windows, but I believe it could be relatively easily. Same for the controller components. I suspect this will end up being the first work, and likely far from the last. :)

@lukehb
Copy link

lukehb commented Jan 20, 2021

I'm not aware of any active effort to work on this, so your help will be useful. :)

Okay, in the case we will have some internal discussions about prioritising this on our roadmap. We will likely have a lot of questions to make sure we are on the right track, what is the best place to riff back and forth? The Tekton slack?

@vdemeester
Copy link
Member

This is slightly related to #856 although more complicated 😅. Following up on @imjasonh thougts, we also need to discuss what would be the minimum viable first iteration of supporting windows ; I would guess being able to run TaskRun on Windows (with having the controller & co on Linux) would be a good start but we can discuss this.

@lukehb there is different medium. For short discussion, slack or the tekton-dev mailing would be appropriate. To start designing the feature and dig into why we want this and detail how we could do this, there is the TEP process. An example related to other architectures is TEP-0019. We can bootstrap this by having discussion on the working groups (the main WG or the API one).

@aiden-deloryn
Copy link
Contributor

@imjasonh Is there any way I can help out with the entrypoint issue to keep this moving forward? I'd be happy to make some tweaks and test the options discussed here if needed.

@aiden-deloryn
Copy link
Contributor

I was able to get ko to build a valid entrypoint image by removing the file extension .exe on: pkg/build/gobuild.go#L440 and pkg/build/gobuild.go#L700 which means that the file extension is not needed.

With this change the entrypoint is working and we don't need to modify the current behaviour of Tekton. I did notice though that I can't pull the image if it includes files from kodata. I haven't figured out why that is yet...

The error I'm getting with kodata included is:

failed to register layer: re-exec error: exit status 1: output: open \\?\C:\ProgramData\docker\tmp\hcs370407523\Files\var\run\ko\refs\heads\controller-windows-images: The system cannot find the path specified.

I've checked the image layer on disk and the files do exist.

@TBBle
Copy link

TBBle commented Jul 20, 2021

If you run dockerd with the DOCKER_WINDOWSFILTER_NOREEXEC env-var set non-empty, you should get more information about that failure in the logs.

My immediate guess is that the file or one of the elements of the path is actually a link to an earlier layer, but the target is missing for some reason. It also could be a symlink with no target, which should work, but possibly something in the import process is trying to follow the link when it shouldn't be.

@imjasonh
Copy link
Member Author

imjasonh commented Jul 28, 2021

The ko Windows PR is merged! 🎉

I was able to get ko to build a valid entrypoint image by removing the file extension .exe on: pkg/build/gobuild.go#L440 and pkg/build/gobuild.go#L700 which means that the file extension is not needed.

Oh the .exe suffix isn't needed? That simplifies things, I'll make that change to ko then (ko-build/ko#400) -- thanks!

I've rebuilt the gcr.io/imjasonh/combined base image, and entrypoint with the exe-dropping change, please try them out and see what else breaks 😆

gcr.io/imjasonh/github.com/tektoncd/pipeline/cmd/entrypoint@sha256:7f6d879ca7b359324121e0196c55f5185173bad75ca0b3f0a2e9201a00981c09

edit: and nop image:

gcr.io/imjasonh/github.com/tektoncd/pipeline/cmd/nop@sha256:4a3b8ff0e78d404b842edf732f08788f8970319f1878c2b8de0620025dd60e02

With this change the entrypoint is working and we don't need to modify the current behaviour of Tekton. I did notice though that I can't pull the image if it includes files from kodata. I haven't figured out why that is yet...

Yeah, while I was working on the ko PR I had quite a lot of trouble with the kodata part. The PR as merged should support kodata now, with the exception of files in kodata that are symlinks, which will just be ignored for Windows images. If this is something we think we need we can investigate re-adding it, I don't think it should block entrypoint work at least.

@aiden-deloryn
Copy link
Contributor

@imjasonh Good job on the ko PR. I've done some testing with the ko built images you provided and it appears everything is working! 😄 🎉

The only outstanding problem for Windows that I am aware of is support for script mode which is being addressed by #4128.

windows-task.yaml
apiVersion: tekton.dev/v1beta1
kind: Task
metadata:
  name: windows-task
spec:
  workspaces:
    - name: output-workspace
      mountPath: c:/output
  steps:
  - name: clone-repo
    image: aidendeloryn/windows-base-example:latest
    command: ["git"]
    args:
      - "clone"
      - "--depth"
      - "1"
      - "--single-branch"
      - "https://github.com/google/ko.git"
      - "C:\\workspace"
  - name: build-binary
    image: aidendeloryn/windows-base-example:latest
    command: ["go"]
    args:
      - "build"
      - "-a"
      - "-v"
      - "-o"
      - "ko.exe"
      - "main.go"
  - name: print-version
    image: aidendeloryn/windows-base-example:latest
    command: ["ko"]
    args:
      - "version"
  - name: print-help
    image: aidendeloryn/windows-base-example:latest
    command: ["ko"]
    args:
      - "--help"
  - name: copy-bin
    image: aidendeloryn/windows-base-example:latest
    command: ["cmd"]
    args:
      - "/c"
      - "copy"
      - "C:\\workspace\\ko.exe"
      - "C:\\output\\ko.exe"
---
apiVersion: tekton.dev/v1beta1
kind: TaskRun
metadata:
  name: windows-taskrun
spec:
  workspaces:
    - name: output-workspace
      persistentVolumeClaim:
        claimName: local-workspace-pvc
  taskRef:
    name: windows-task
  podTemplate:
    nodeSelector:
      kubernetes.io/os: windows
$ tkn taskrun list
NAME              STARTED         DURATION   STATUS
windows-taskrun   8 minutes ago   1 minute   Succeeded

$ kubectl get pod windows-taskrun-pod-tfwvq
NAME                        READY   STATUS      RESTARTS   AGE
windows-taskrun-pod-tfwvq   0/5     Completed   0          2m26s

$ kubectl describe pod windows-taskrun-pod-tfwvq
Name:         windows-taskrun-pod-tfwvq
Namespace:    default
Priority:     0
Node:         k8s-windows/10.115.11.52
Start Time:   Mon, 02 Aug 2021 11:34:24 +1000
Labels:       app.kubernetes.io/managed-by=tekton-pipelines
              tekton.dev/task=windows-task
              tekton.dev/taskRun=windows-taskrun
Annotations:  pipeline.tekton.dev/release: devel
              tekton.dev/ready: READY
Status:       Succeeded
IP:           10.244.1.132
IPs:
  IP:           10.244.1.132
Controlled By:  TaskRun/windows-taskrun
Init Containers:
  place-tools:
    Container ID:  docker://275d296316b81c314e069daad78989dbe243af9777c16e76224560cd55cf84f3
    Image:         gcr.io/imjasonh/github.com/tektoncd/pipeline/cmd/entrypoint@sha256:7f6d879ca7b359324121e0196c55f5185173bad75ca0b3f0a2e9201a00981c09
    Image ID:      docker-pullable://gcr.io/imjasonh/github.com/tektoncd/pipeline/cmd/entrypoint@sha256:7f6d879ca7b359324121e0196c55f5185173bad75ca0b3f0a2e9201a00981c09
    Port:          <none>
    Host Port:     <none>
    Command:
      /ko-app/entrypoint
      cp
      /ko-app/entrypoint
      /tekton/tools/entrypoint
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 02 Aug 2021 11:34:26 +1000
      Finished:     Mon, 02 Aug 2021 11:34:27 +1000
    Ready:          True
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /tekton/tools from tekton-internal-tools (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-ndjzr (ro)
...

@imjasonh
Copy link
Member Author

imjasonh commented Aug 2, 2021

@imjasonh Good job on the ko PR. I've done some testing with the ko built images you provided and it appears everything is working! 😄 🎉

The only outstanding problem for Windows that I am aware of is support for script mode which is being addressed by #4128.

This is great news!! 🎉

Before we release and document Windows support, we also need to:

  • bring in the combine script that makes a distroless+windows frankenbase image, and integrate it into the release process
  • write some e2e tests that cover basic Windows support, and script mode support for Windows. This will require some modifications to our dogfooding/CI cluster to add a Windows node pool.

Once we have e2e tests and nightly builds verifying the behavior, we can get this out the door for end users. That shouldn't block our own internal guinea pig testing, though.

@imjasonh
Copy link
Member Author

With the last few PRs merged, we have a Tekton nightly release that includes an entrypoint image built for Windows! 🎉

Please try it out and report any bugs or issues:

kubectl apply -f https://storage.googleapis.com/tekton-releases-nightly/pipeline/previous/v20211013-405c0093a8/release.yaml

Unless there are any show-stoppers, Windows support should be included in the next release, v0.29 scheduled to be released very soon.

@lippertmarkus
Copy link

I just tried out hybrid workflows:

apiVersion: tekton.dev/v1beta1
kind: Pipeline
metadata:
  name: mypipeline
spec:
  tasks:
    - name: task-win
      taskSpec:
        steps:
          - name: hello-windows
            image: mcr.microsoft.com/windows/nanoserver:1809
            command: ["cmd", "/c"]
            args: ["echo", "Hello from Windows Container!"]
    - name: task-lin
      taskSpec:
        steps:
          - name: hello-linux
            image: alpine
            command: ["echo"]
            args: ["Hello from Linux Container!"]        
---
apiVersion: tekton.dev/v1beta1
kind: PipelineRun
metadata:
  name: mypipelinerun
spec:
  pipelineRef:
    name: mypipeline
  taskRunSpecs:
    - pipelineTaskName: task-win
      taskPodTemplate:
        nodeSelector:
          kubernetes.io/os: windows
        securityContext:
          # without this I get the error: Creating a symlink "/tekton/steps/0": symlink \tekton\steps\step-hello-windows /tekton/steps/0: A required privilege is not held by the client.
          windowsOptions:
            runAsUserName: "ContainerAdministrator"
    - pipelineTaskName: task-lin
      taskPodTemplate:
        nodeSelector:
          kubernetes.io/os: linux

I found that running the Windows container with ContainerAdministrator user is mandatory (see comment within yaml). Is there a way we can work around that? Like with Linux containers, running Windows containers with full privileges isn't a good practice.

Apart from that it works great, awesome work!

@TBBle
Copy link

TBBle commented Oct 14, 2021

Oh that's tricky. The things that would normally grant symlink-making privileges to non-admin users (Developer Mode, Local Security Policy, Group Policy) are not really container-suitable. I'm kind-of surprised no one's hit this before, but I couldn't see anyone else bouncing off this in Windows containers.

I feel like this is a kubernetes-containerd-hcsshim-Microsoft problem, and would ideally have an option to grant the relevant privilege (SeCreateSymbolicLinkPrivilege) in securityContext.windowsOptions, similar to the existing (Linux) securityContext.capabilities field.

In the meantime, it might be possible for an appropriate RUN command to be added to the Windows container entrypoint, something like

RUN ntrights +r SeCreateSymbolicLinkPrivilege -u "User Manager\ContainerUser"

referencing this and this, and assuming a copy of ntrights.exe is sourced from the Windows 2003 resource kit.

There's probably other ways of doing this, I saw a wide variety of PowerShell scripts to do it too, however the entrypoint image is nanoeserver-based, so PowerShell is not present.

So perhaps someone can knock together a Go program to RUN to grant the privilege, since we have Go in the entrypoint image pipeline already.

@lippertmarkus
Copy link

thanks for the workaround!

@afrittoli
Copy link
Member

Hello folks, great work on this feature!

I plan to make Tekton v0.29 release tomorrow .
Questions:

  • Do you think 🪟 Windows support 🪟 is ready to be marked as available in the release notes?
  • In terms of limitations, I understand Windows containers need to run in privileged mode. Anything else I should be mention in the release notes?
    Thank you!

@imjasonh
Copy link
Member Author

Do you think 🪟 Windows support 🪟 is ready to be marked as available in the release notes?

Yes please! I would like to try to use clear wording that we "provide an entrypoint image that can execute TaskRuns on Windows nodes", and not that we "support" it. This behavior is today not covered by e2e tests, and it's very early days. This is similar to conversations we had around including other CPU architectures in Tekton releases -- they're "provided", but might not be guaranteed to be "supported" if you experience bugs. The community will do its best. 😄

But at the same time, I want to get as much use and feedback and bug reports as we can get, since this will be helpful in our path toward "supporting" it in the future. So having a release that people can use to try it out is a very exciting step forward. 🎉

In terms of limitations, I understand Windows containers need to run in privileged mode. Anything else I should be mention in the release notes?

Aside from lack of reliable support (😅), nothing comes to mind. If others on this thread are more aware of limitations we're currently imposing, please add them.

@aiden-deloryn
Copy link
Contributor

In terms of limitations, I understand Windows containers need to run in privileged mode. Anything else I should be mention in the release notes?

I have just a few things that might be worth mentioning:

  • PipelineResources have not been tested for Windows tasks and probably won't work.
  • If a Windows task and a Linux task are both sharing a PersistentVolumeClaim workspace, the affinity assistant may need to be disabled (those two tasks cannot run on the same node).
  • There is a Windows page in the docs folder which outlines some important details about running Windows workloads in Tekton.

@afrittoli
Copy link
Member

Shall we close this one and open separate issue for the remaining limitations, or would you prefer to keep this one open?

@imjasonh
Copy link
Member Author

I'm fine with closing this one. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design This task is about creating and discussing a design Epic Issues that should be considered as Epics (aka multiple sub-tasks, …) kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete.
Projects
None yet
Development

No branches or pull requests

10 participants