Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fouled Authentication when using Concurrency/Election by multiple clients #17502

Open
4 tasks done
RodionGork opened this issue Feb 27, 2024 · 15 comments
Open
4 tasks done

Comments

@RodionGork
Copy link

RodionGork commented Feb 27, 2024

Bug report criteria

What happened?

Briefly

When using Concurrency / Election functionality by several clients - it may happen that internal GRPC connection uses wrong authentication (wrong user) for request due to auth preserved for the connection (in the "context" of the serverWatchStream).
In our case, with TLS authentication, with different certificates.
Experiencing this on versions from 3.4.19 to 3.5.12 (recent).

UPD - please read my next comment which contains docker image to demonstrate the issue and explains steps. It happens both with authorization by certificates and with basic auth.

Details

Concurrency / Election functionality accepts GRPC requests (like Campaign or Observe) and then does several internal GRPC-requests to KV-storage. With GRPC v3 approach of "auth per connection" (rather than "per request) these internal requests may reuse existing authorization. Particularly this happens with serverWatchStream, e.g. here:

https://github.com/etcd-io/etcd/blob/main/server/etcdserver/api/v3rpc/watch.go#L236

the call boils down to AuthInfoFromCtx (in v3_server.go, not store.go) from where it delegates to AuthInfoFromTLS (in store.go that time). And here peer is fetched from the context, but context is the field of the "watch stream" and may remain from some preceding call by very different client.

https://github.com/etcd-io/etcd/blob/main/server/auth/store.go#L1012

This leads to sudden Permission Denied when app is experiencing load by several clients simultaneously, despite working fine if the interfering clients are shut down.

What did you expect to happen?

Behavior should not differ if some unrelated clients are making unrelated requests simultaneously. There should be no sudden "permission denied" (as no permission changes happened).

How can we reproduce it (as minimally and precisely as possible)?

As it requires running etcd leader election with TLS authentication and several client, reproducing is hardly "minimal", but I hope you recognize the problem from the description owing to your better knowledge of code. If necessary though, please write back, I'll try to create docker file / image with everything necessary.

Anything else we need to know?

No response

Etcd version (please run commands below)

etcd Version: 3.4.19
Git SHA: 4636a5fab
Go Version: go1.21.6
Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

(this seems irrelevant)

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

(this seems irrelevant)

Relevant log output

(no specific logs besides manually added for debugging)
@ahrtr
Copy link
Member

ahrtr commented Feb 29, 2024

It isn't clear what exactly issue you were running into.

  • Please clearly clarify what's the issue from user's perspective. It's great if you can provide a detailed steps to reproduce the issue.
  • I assume that you have already enabled auth (either called the AuthEnable or executed command etcdctl auth enable), please double confirm this.
  • Have you configured the flag --client-cert-auth when starting etcd? Please provide the complete parameters.
  • Please also provide complete etcd log.

In our case, with TLS authentication, with different certificates.

What's the CN (Common Name) field in these certificates? FYI. https://etcd.io/docs/v3.5/op-guide/authentication/rbac/#using-tls-common-name

@ahrtr
Copy link
Member

ahrtr commented Mar 1, 2024

Possibly the same issue as #12385

@RodionGork
Copy link
Author

RodionGork commented Mar 5, 2024

Possibly the same issue as #12385

Not exactly, but the cause is somewhat similar

This issue #14850 is probably manifestation of the same problem I'm talking about. I even found it fails with basic auth.

I prepared docker image which demonstrates the problem so you may have ready "stand" for trying and researching it. Files will work outside of docker too I believe (or you can instead run your custom-build etcd binary inside this image to debug etc). It also uses basic auth to demonstrate the issue.

What the image does, briefly (direct instructions to run it are somewhere below):

  • downloads ready binary for etcd v3.4.28 in ubuntu 22.04 (issue exists in most versions 3.4...., and 3.5... I believe)
  • also few tools, particularly grpcurl to make calls.
  • there is setup file to start etcd and create two users with their roles
  • then there is bug demo file - below I describe separately what it does to demonstrate faulty behavior

How the bug happens: If we call elections function Campaign twice, we expect the first call to finish immediately, the second waits until the lease used for the first one expires. It works for single user. It works for two users also, if their calls do not interfere in time. It fails if they do interfere:

  1. Create Lease 10001 and use it to call Campaign with User1 (returns immediately)
  2. Create Lease 10002 and use it to call Campaign with User1 (waits until Lease 10001 expires, then returns)
  3. Create Lease 20001 and use it to call Campaign with User2 (returns immediately)
  4. Create Lease 20002 and use it to call Campaign with User2 (should return after waiting for lease 20001 expiration)

The step 4 fails with "Permission denied" if step 2 is still waiting, which we achieve in launching it into background and adding small sleep before step 3. If sleep is incremented so that step 2 finishes before step 4 - there is no error.

As I understand it happens because steps 2 and 4 call for Watch inside Campaign and it uses Watch-stream inside for which credentials are shared internally since it is still single connection (internally) so on the step-4 Watch tries to use credentials of User-1.

Instructions to run

Download the attached file: etcd-issue-17502.zip - it contains dockerfile to build the image, 3 scripts to build and run everything, proto-files for grpcurl.

Unpack the file and launch test-issue-17502.sh - it will build the image and start the container. After it is ready, you find yourself in the bash shell in the /root folder of the container, with proto files and two remaining shell scripts.

Launch ./etcd-17502-setup.sh here - it will start etcd in background and create user1 and user2 with appropriate roles etc.

Launch ./demo-17502.sh now - it will execute Lease/Campaign steps described above and should end with Permission Denied.

P.S. Thanks in advance for researching this. Files are intentionally kept small so you hopefully can tweak them in any way you like to better understand the issue.

@RodionGork
Copy link
Author

@ahrtr Benjamin, please have a look at it when you have time - I marked it is not "security-related" because it is about auth misused in pretty narrow case - but on the second thought I'm unsure, whether this behavior could be harmfully exploited.

@ahrtr
Copy link
Member

ahrtr commented Mar 5, 2024

Thanks @RodionGork for the detailed steps to reproduce the issue.

Reason

When two sub/logic streams/watchers share the same physical gRPC stream, ServerStream.Context may return unexpected context, accordingly etcd may get the unexpected token.

Specifically, etcd gets the gRPC stream context using gRPC's interface ServerStream.Context, see below,

authInfo, err := sws.ag.AuthInfoFromCtx(sws.gRPCStream.Context())

Afterwards, etcd gets token from metadata included in the context, see below. Based on my test, when two sub/logic streams/watchers (the two wathcers are using two different tokens to communicate with the grpc/etcd server) share the same physical gRPC stream, etcd may get unexpected token. For example, etcdserver may get wather1's token when it's processing wather2's message. @dfawley is this expected behahaviour from gRPC perspective?

etcd/server/auth/store.go

Lines 1059 to 1073 in 9f8756b

md, ok := metadata.FromIncomingContext(ctx)
if !ok {
return nil, nil
}
//TODO(mitake|hexfusion) review unifying key names
ts, ok := md[rpctypes.TokenFieldNameGRPC]
if !ok {
ts, ok = md[rpctypes.TokenFieldNameSwagger]
}
if !ok {
return nil, nil
}
token := ts[0]

Suggestion (best practice): it can resolve the issue

Use etcd client sdk to communicate with etcdserver, and create a separate client for each user, so that the watchers for different users will not share the same gRPC stream. Please refer to #17532.

Actually I don't recommend to use the interface defined in v3election.proto, it just added a layer to make the election easier to use. But if you read #17532, it's still super easy to implement it without using the interface in v3election.proto. When it receives client request, it will re-connect to the etcdserver using another client; and etcd always uses the same client for all requests from all real clients. So obviously it has a little worse performance.

@ahrtr
Copy link
Member

ahrtr commented Mar 5, 2024

The v3election.proto was added in #7634. Although I don't recommend to use it as I mentioned above, it's useful for applications written using a language other than Golang.

@dfawley
Copy link

dfawley commented Mar 6, 2024

gRPC has no concept of stream sharing whatsoever. It's entirely contained within the application. The context used on the client to initiate the stream will have the metadata (set via metadata.NewOutgoingContext) that the server receives (and can discover via metadata.FromIncomingContext(stream.Context())). The metadata in the context must not be mutated, or else the behavior would be undefined.

https://pkg.go.dev/google.golang.org/grpc/metadata#NewOutgoingContext

NewOutgoingContext creates a new context with outgoing md attached. If used in conjunction with AppendToOutgoingContext, NewOutgoingContext will overwrite any previously-appended metadata. md must not be modified after calling this function.

@ahrtr
Copy link
Member

ahrtr commented Mar 6, 2024

Thanks @dfawley for the feedback.

If sleep is incremented so that step 2 finishes before step 4 - there is no error.

@RodionGork The reason is that once 2 finishes before step 4, then etcd will create a new gRPC stream instead of reusing the existing gPRC stream when executing step 4. So it won't reuse the token for user1, accordingly no error.

On top of my previous comment,

  • try not to use (*electionServer) Campaign, instead use the similar approach as Add election test with auth enabled #17532
  • If you have to use it for some reasons (e.g. your application isn't written with Golang, and there isn't a client SDK yet), then try to set a different metadata in the context for each user/token when you call Election.Campaign. Please refer to links below on how etcd uses the metadata.

ctxKey := streamKeyFromCtx(ctx)

etcd/client/v3/watch.go

Lines 1038 to 1043 in 9f8756b

func streamKeyFromCtx(ctx context.Context) string {
if md, ok := metadata.FromOutgoingContext(ctx); ok {
return fmt.Sprintf("%+v", md)
}
return ""
}

etcd/client/v3/watch.go

Lines 335 to 339 in 9f8756b

wgs := w.streams[ctxKey]
if wgs == nil {
wgs = w.newWatcherGrpcStream(ctx)
w.streams[ctxKey] = wgs
}

@ahrtr
Copy link
Member

ahrtr commented Mar 6, 2024

I marked it is not "security-related" because it is about auth misused in pretty narrow case - but on the second thought I'm unsure, whether this behavior could be harmfully exploited.

No, it can't be exploited. I don't see any security issue. If you provide a wrong token, the request will be rejected for sure. The issue is the request may be still be rejected even you provide a valid token just as we discussed above.

Note the issue can only be reproduced when (1) auth is enabled and (2) multiple users call Election.Campaign concurrently. It's a design issue to me.

We have two directions to resolve this issue,

  • fix it. We need to carry over the token received on Election.Campaign to any following internal requests. It will be a big change, and accordingly will have big impact.
  • [preferred] do not fix it and only update doc as I mentioned above. But we can add a long-term item to consider to refactor & resolve.

@ahrtr
Copy link
Member

ahrtr commented Mar 6, 2024

cc @fuweid @jmhbnz @serathius

@ahrtr
Copy link
Member

ahrtr commented Mar 24, 2024

Proposed action for this issue:

  • Update the doc
    • Do not recommend to use the v3election, instead use the clientSDK if present for the programming language.
    • Add a known issue

References:

@jcferretti
Copy link
Contributor

Commenting here to ensure it is not forgotten that is likely v3lock also has the same issue.
#17623 (comment)

@jmhbnz
Copy link
Member

jmhbnz commented Mar 28, 2024

Hey @ahrtr - Discussed this issue during sig-etcd triage meeting. Can you please confirm where in the docs you would like to see this updated so we can assign this issue out to a contributor?

@ahrtr
Copy link
Member

ahrtr commented Mar 28, 2024

I think api_concurrency_reference_v3/ might be the best place?

@jcferretti
Copy link
Contributor

No, it can't be exploited.

If etcd is using the wrong token, it could as well be authenticated when it shouldn't, as it can be not authenticated when it should, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants