Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support "antctl query endpoint" command out-of-cluster #1137

Closed
antoninbas opened this issue Aug 22, 2020 · 17 comments
Closed

Support "antctl query endpoint" command out-of-cluster #1137

antoninbas opened this issue Aug 22, 2020 · 17 comments
Labels
good first issue Good for newcomers kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@antoninbas
Copy link
Contributor

antoninbas commented Aug 22, 2020

Describe what you are trying to solve
Currently "antctl query endpoint" can only be run from inside the antrea-controller Pod.

Describe the solution you have in mind
We should be able to run "antctl query endpoint" from out-of-cluster, by providing the appropriate Kubeconfig file.

Describe how your solution impacts user flows
It makes the life of users easier by letting them use the command without exec'ing into the antrea-controller Pod first.

Describe the main design/architecture of your solution

  • The "antctl query endpoint" command becomes a "raw" command in antctl.
  • We cannot use the K8s apiserver aggregation layer to proxy requests from the k8s apiserver to the Antrea Controller, which would require following the K8s conventions and defining an actual API resource to encode queries and their replies. Defining a CRD for this seems a bit excessive as well. So my plan is to connect directly to the Controller apiserver and keep using a raw HTTP request (/query-endpoint URL).
  • antctl needs to be able to establish a secure TLS connection to the Antrea Controller, so it needs to retrieve the CA certificate from the antrea-ca ConfigMap. It seems that this would be useful for the "antctl supportbundle" command, which correctly uses an insecure connection (see Antctl support-bundle command doesn't verify server certificate #758), so we should be able to share code.
  • To connect to the API Server, we can use retrieve the ClusterIP for the the Antrea Service from the K8s apiserver. "antctl supportbundle" uses the AntreaControllerInfo CRD and the Node IP directly, but I am not sure why it chose this option instead.

Alternative solutions that you considered

  • K8s apiserver aggregation
  • CRD

See above for discussion.

Test plan

  • Unit tests for the raw command
  • e2e test to ensure that the command can be run from out-of-cluster
@antoninbas antoninbas added the kind/design Categorizes issue or PR as related to design. label Aug 22, 2020
@antoninbas
Copy link
Contributor Author

@tnqn I would appreciate your feedback on this
I started prototyping it, but did not tackle the "big" item (use a secure connection by retrieving the Antrea CA) yet.

@tnqn
Copy link
Member

tnqn commented Aug 24, 2020

@antoninbas the proposal LGTM overall, just have a question on the API endpoint reachability.
By running "antctl query endpoint" from out-of-cluster, I assume you mean it could be somewhere that is not a K8s Node. The clusterIP may be not reachable, will you consider changing the service type to NodePort too?
Sharing something you may already know, there are 3 ways to access services from a machine outside the cluster:

  1. The first uses "NodePort" or "LoadBalancer" type.
  2. I investigated the second way "Access services, nodes, or pods using the Proxy Verb", however, antrea uses the same Header "Authorization" to pass the token and it would be erased by kube-apiserver when it proxies the requests, so we cannot authenticate and authorize the request so not feasible for now
  3. The third way runs a pod in the cluster and users' requests are sent from the Pod by using "kubectl exec" (we could also make antctl access the pod's "exec" subresource without invoking kubectl. Calicoctl is similar to this way.

@antoninbas
Copy link
Contributor Author

@tnqn, yes my bad, the last bullet point is incorrect:

To connect to the API Server, we can use retrieve the ClusterIP for the the Antrea Service from the K8s apiserver. "antctl supportbundle" uses the AntreaControllerInfo CRD and the Node IP directly, but I am not sure why it chose this option instead.

We can use the Node IP to access the Antrea Controller, which is what "antctl supportbundle" is doing now and that's what I tested. However, how can I enable TLS for the connection? The certificate used by the K8s apiserver includes the Node IP in the SAN list, but that's not the case for the Antrea apiserver. We could potentially make it work for the self-signed case, since we regenerate a certificate every time the controller starts, but not for the user-provided case.

I feel like option 3 is best here. I don't think using a NodePort Service would solve anything. What do you think of antctl exec'ing into the antrea-controller Pod directly, instead of using a different Pod? We could use that for "antctl supportbundle" (exec onto antrea-controller / exec into antrea-agent) as well, and avoid using insecure connections.

@tnqn
Copy link
Member

tnqn commented Aug 25, 2020

We can use the Node IP to access the Antrea Controller, which is what "antctl supportbundle" is doing now and that's what I tested. However, how can I enable TLS for the connection? The certificate used by the K8s apiserver includes the Node IP in the SAN list, but that's not the case for the Antrea apiserver. We could potentially make it work for the self-signed case, since we regenerate a certificate every time the controller starts, but not for the user-provided case.

This is resolvable, when authenticating server certificate, a server name can be provided to be checked against. It will only check hostname (node ip) if server name is empty. Even when antrea-agent authenticates antrea-controller, it uses the server name (antrea.kube-system.svc) instead of service IP or node IP.

type TLSClientConfig struct {
	// ServerName is passed to the server for SNI and is used in the client to check server
	// ceritificates against. If ServerName is empty, the hostname used to contact the
	// server is used.
	ServerName string

I feel like option 3 is best here. I don't think using a NodePort Service would solve anything. What do you think of antctl exec'ing into the antrea-controller Pod directly, instead of using a different Pod? We could use that for "antctl supportbundle" (exec onto antrea-controller / exec into antrea-agent) as well, and avoid using insecure connections.

Yes, I agree this will work well and can address the security issue and may even get rid of APIServices. I don't quite remember why we didn't go this approach at the very beginning. @jianjuns @weiqiangt do you have opinions on this?

@antoninbas
Copy link
Contributor Author

Thanks @tnqn. for the pointer. The ServerName in TLSClientConfig works for my use case since I only need to connect to the Controller apiserver and I can configure the client as needed. However it does not work for Agent connections for "antctl supportbundle", since we do not generate a certificate for Agents. I don't know about the drawbacks of option 3. Maybe one of the reasons for keeping APIServices was to facilitate integration with other software that may want to consume Antrea's internal APIs (which I know sounds a bit strange since these are internal) by letting them connect to the K8s apiserver.

@jianjuns
Copy link
Contributor

Maybe one of the reasons for keeping APIServices was to facilitate integration with other software that may want to consume Antrea's internal APIs (which I know sounds a bit strange since these are internal) by letting them connect to the K8s apiserver.

Right. I would keep the possibility of exposing public API which can enable potential integrations, even now we do not have many use cases. It is also a more direct and efficient way than Pod exec. And I do not see big issues with APIService.

I do not like to support many ways to access API, but if you really believe remote execution and Pod exec are required for "antctl query", I would not be against to it either. Sounds to me APIService or CRD are better solutions though if we want to consider integration with an external manager or UI later.
How can Pod exec implement supportbundle - how the bundle can be copied out of the Pod?

@antoninbas
Copy link
Contributor Author

@tnqn so I was going to resolve this issue by connecting directly to the Controller and enabling secure mode by using ServerName in TLSClientConfig. But what do you think of upgrade concerns when using nonResourceURL? In this case the API is only consumed by antctl, but we did talk earlier about providing some backwards-compatibility guarantees even for antctl. The alternative is to use a standard K8s versioned API group with potentially an APIService. Do you think it's worth it here and do you think it will work well for "query endpoint"? The only issue is that it seems a new API group may be needed. I cannot use "security" which is used for CRDs...

@tnqn
Copy link
Member

tnqn commented Sep 16, 2020

@antoninbas I think using a standard K8s versioned API makes more sense now if we want to improve the feature later and keep compatibility. Although it's a "query", it's similar the TokenReview and SubjectAccessReview API in K8s, which uses the resources to represent actions. The APIs have only "create" verb, the query is in Spec of the Review resource and the response is in the Status, and no resources will be persistent anywhere. Examples as below:
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/authentication/v1/types.go
https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/api/authorization/v1/types.go

Do you think it could fall into ops as it's kind of troubleshooting action as well?

@antoninbas
Copy link
Contributor Author

That makes sense. I assume that the fact that the Status can be very large here (e.g. it could be the entire set of Pods when the policy applies to all Pods) is not an issue, since we are not using a CRD and this is not stored in etcd. Will take a stab at this and use the "ops" API group.

@antoninbas antoninbas self-assigned this Sep 18, 2020
@antoninbas
Copy link
Contributor Author

@tnqn "ops" is also used for CRDs so it doesn't seem that I can use "ops". Do you think I should use "system" (which does not seem like a great fit) or define a new group?

@tnqn
Copy link
Member

tnqn commented Sep 21, 2020

While I was thinking a new group or existing groups, I realized the operation is essentially a series of queries against controlplane resources, and most logic should be the same even if it's executed on client side, unless it's kept on server side for other API consumers too.
Just like the way kubectl describe is implemented: could we query the appliedToGroup and addressGroups by specifying the Pod first, then query networkPolicies by returned appliedToGroups and addressGroups, finally format the output on client side? We don't support filtering with such conditions with current APIs but it's not hard to have them, the networkpolicy store already has the required indices, we don't have a pod indexer for groups but a simple O(n) search should be fine for now and not different from current implementation, and we can still add the indexer when we can leverage it for policy calculation as discussed before.
The new filter conditions might be leveraged by existing antctl get networkpolicy/appliedtogroups/addressgroups as well.
What do you think? If there are reasons why it should be kept on server side, I feel none of the current aggregated API groups are proper, maybe it needs a new group.

@antoninbas
Copy link
Contributor Author

Thanks for commenting @tnqn. I do agree that handling everything client-side in antctl is an option, but don't you think we need a public API for these queries? I know that right now we are only looking at antctl as a client, but as part of possible integrations, there could be value in having this API be consumed by other tools, e.g. a UI. In that case, it would be better to have a public API so that: 1) clients don't have to implement the functionality locally using multiple dependent queries and 2) clients don't have to consume our internal "controlplane" API.

I am not in a rush to implement this (can be pushed to v0.11), so I'm happy to bring this up at the Community Meeting next week.

@tnqn
Copy link
Member

tnqn commented Sep 22, 2020

sure, I agree if we plan to have other clients to use the feature, server side implementation is better. We can discuss this in the community meeting.

@antoninbas
Copy link
Contributor Author

At the September 28th Community Meeting (https://github.com/vmware-tanzu/antrea/wiki/Community-Meetings#september-28-2020), we decided that introducing this new API in the controlplane API group would be a good place to start.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2021

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@antoninbas antoninbas removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2021

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 3, 2021
@antoninbas antoninbas added good first issue Good for newcomers and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 5, 2021
@antoninbas antoninbas removed their assignment Oct 5, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Jan 4, 2022

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 4, 2022
@github-actions github-actions bot closed this as completed Apr 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers kind/design Categorizes issue or PR as related to design. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants