Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Security to Riak #355

Closed
Vagabond opened this issue Jul 24, 2013 · 45 comments
Closed

Add Security to Riak #355

Vagabond opened this issue Jul 24, 2013 · 45 comments

Comments

@Vagabond
Copy link
Contributor

This is a tracking meta-issue for the cross-repo task of adding security to Riak.

Rationale

Riak (not Riak CS, which has its own application layer security model) currently has no authentication or authorization, nor does it provide encryption for the Protocol Buffers API (HTTPS is optional for the REST interface). In most deployments, Riak is deployed on a trusted network and unauthorized access is restricted by firewall/routing rules. This is usually fine, but if unauthorized access is obtained, Riak offers no protection to the data that it stores.

Thus, I propose we add authentication/authorization/TLS and auditing to Riak, to make Riak more resilient to unauthorized access. In general, I took the design cues from PostgreSQL. Another goal was to make this applicable to riak_core, so any reliance on KV primitives or features are intentionally avoided.

Authentication

All authentication should be done over TLS for two reasons, to avoid MITM attacks and to prevent eavesdroppers sniffing credentials. Self-signed certificates are acceptable if the client checks the server's certificate against a local copy of the CA certificate (and thus we can avoid the complicated 'web of trust' used by regular HTTPS). CRL checks should be done, when it is appropriately configured.

Once TLS has been negotiated and verified, the client supplies username/password credentials. The password is transmitted in the clear, this is to facilitate writing pluggable security backends. This is not a major problem because at this point the connection should be proof against eavesdropping.

The pluggable security backends we propose to implement are the following:

  • Trust - User is on a trusted connection, if you can connect from this CIDR range, we trust you. The password provided is ignored.
  • Password - Riak maintains a username/PBKDF2 password hash table and checks the provided password against the stored hash.
  • PAM - Riak submits the username/password to PAM and PAM verifies it against the specific service configured.
  • Certificate - The client provides a certificate, signed by the same CA as the server's certificate. If this certificate is validated and the common-name matches the requested username, the user is authenticated.
  • LDAP - Authenticate against an LDAP/Active Directory server (this is a stretch goal, since you can do this via PAM as well).

Postgres auth methods

Authentication information is split into two pieces, users and sources. A user cannot authenticate without a corresponding source that matches username/peer address.

Postgres authentication source configuration

To add a user named andrew and to trust all connections from localhost, you'd do:

riak-admin security add-user andrew
riak-admin security add-source all 127.0.0.1/32 trust

To add a user that you wanted to authenticate against the local password table and be allowed to connect from anywhere:

riak-admin security add-user sean password=justopenasocket
riak-admin security add sean 0.0.0.0/0 password

The password provided at user-creation time is hashed via PBKDF2 and stored.

To trust users on the LAN but to force everyone else to authenticate against PAM:

riak-admin security add-source all 192.168.1.0/24 trust
riak-admin security add-source all 0.0.0.0/0 pam service=riak

The service=riak option tells PAM to submit any provided credentials to that particular PAM service configuration. Sources are compared most to least specific, both by the user match and the CIDR match (specific usernames sort before 'all' and a /24 sorts before a /0). Only the first matching source is tested, if that fails, the authentication fails.

Authorization

Riak currently has a completely permissive approach to data access, if you can connect, you can get/put/delete anything you want. Providing authentication, as in the above section, raises the bar to that kind of access, but it still leaves your data vulnerable to a compromised client, especially if you have something like a lower security reporting application (or even a remotely hosted one with a hole punched in the firewall). This also makes anything like multi-tenancy impossible (think hosting multiple phpbb instances on a single mysql server).

Thus, in addition to authentication, we also need authorization. This is a major change to Riak's semantics, especially when it comes to creating buckets, which in Riak now is as simple as writing a key to the bucket you want to create. For applications that want to dynamically create buckets, we need to provide some way to give them authorization to do so, without compromising the ability to provide security.

To that end, I propose that authorization be checked on a per-bucket basis. Users are granted granular permissions (registered by the individual riak_core applications):

riak_core:register(riak_kv, [{permissions, [get, put, delete]}, ...])

The permissions are namespaced by the registering application, so the above permissions become riak_kv.get, riak_kv.put, etc. These permissions convey no meaning to riak_core, the application is in charge of indicating what permissions are required for each operation.

Examples of granting permissions:

riak-admin security grant riak_kv.get ON mybucket TO andrew
riak-admin security grant riak_kv.get,riak_kv.put ON mybucket TO sean

To preserve the ability to dynamically create buckets whose name is not known beforehand (think buckets per-username or something), I propose the ability to GRANT based on a bucket prefix:

riak-admin security grant riak_kv.put ON myapp_* to andrew

Thus, the application connecting with the 'andrew' credential can create unlimited buckets that begin with 'myapp_', but has no access to buckets outside that prefix space.

Additionally, perhaps you want to give a user access to everything, Riak could support the ALL permission, and the ANY target:

riak-admin security grant ALL on ANY to andrew

This would effectively provide the old unlimited access that Riak currently has, but still provide some security.

It may also be interesting to wildcard permissions by application, eg. 'riak_kv.*'.

As the superuser giveth, he may also taketh away:

riak-admin security revoke riak_kv.get ON mybucket FROM andrew

Grants and revokes are currently stored separately. The goal is to make users/sources/grants strongly consistent and revokes eventually consistent. That way, during an outage (possibly caused by a malicious/co-opted user account), you can revoke without requiring complete cluster availability, but you avoid problems with partial grants, etc.

Auditing

Since every operation will now be tied back to a user account, we should be able to audit what user did what and when. To that end I plan to extend lager to support alternate event streams (with a separate gen_event) and use that as an audit logging facility. Pairing that with the syslog backend, you'd be able to ship the logs off the machine and so make them harder to tamper with. This is a stretch goal for this development cycle.

Migration

When this work drops in the next major release, existing deployments will have to migrate. For at least existing deployments, the security stuff should default to off. When the user is ready to turn it on, they'll need to have upgraded their client libraries to support it as well as deployed SSL certificates to all the nodes, signed by the same CA. Until that switch gets flipped, clients will work exactly as they do now.

Open Questions

  • Should user group support be provided?
  • Should users be deleted, deactivated or both?
  • Can we make new deployments 'secure' by default?
  • What should permissions for things like 'list keys' or 2i lookups look like? Should full keyspace list-keys only return results from 'allowed' buckets? Should they even be allowed without special permissions?
  • Are there any important permissions that don't map to a specific bucket?
  • Should we allow authentication without requiring TLS - this seems irresponsible, so it should at least be hard.
  • Should user information be replicated by MDC?

Risks

  • Wildcarding bucket names may conflict with legitimate bucket names.
  • Dependence on TLS-PKI is annoying for users. R16B01 supports TLS-PSK which would be easier to get users started, but we may not be able to ship Riak with R16B01 due to some bugs we've found. It is also unclear if other languages provide working support for TLS-PSK we can use in our supported client libraries.
  • Clients will all need to be updated to support this, hopefully to coincide with the release (or shortly after). Users relying on 3rd party libraries may have to wait even longer.
  • The location grants/revokes will be stored, known as 'cluster metadata', is currently in the process of being implemented. The security prototype currently uses ring metadata.
  • CRL support in Erlang is undocumented and at least partially experimental, so implementing and verifying it will be a challenge.

Example client sessions

HTTP: https://gist.github.com/Vagabond/05b7dc8ae6d3ca4af6c2
PBC: https://gist.github.com/Vagabond/6222793a1d352f1ccdd2

Work in Progress

Partial implementations of all of this may be found in the 'adt-security' branch of the following repos:

@coderoshi
Copy link
Contributor

Concerning user groups, I'd vote yes. If a group of permissions could be bundled (aka roles), then users could be assigned to a group/role rather than granted/revoked permissions individually. This could prove helpful, not only in implemented RBAC, but also reduce the complexity of defining multiple users with similar complex roles.

@glassresistor
Copy link

+1

@bkerley
Copy link

bkerley commented Jul 24, 2013

Certificate - The client provides a certificate, signed by the same CA as the server's certificate. If this certificate is validated and the common-name matches the requested username, the user is authenticated.

This should include a configurable Certificate Revocation List; otherwise untrusted clients can't be removed without basically starting the CA from scratch.

@peschkaj
Copy link

+1

Groups are important, especially in the LDAP/ActiveDirectory world.

@aphyr
Copy link

aphyr commented Jul 24, 2013

There are fundamentally three modes for TLS' authentication model:

  1. Assume you create a separate client key and certificate for each user. Both client and server verify each other's certificate. These channels are secure against MITM attacks, even where the client's CA chain is tampered with. Completing the TLS handshake proves the client's identity to Riak. A username and password is not required.
  2. Assume clients generate their own certs, and keep the server's CA chain on hand to verify the server's certificate during channel negotiation. These channels are secure against MITM attacks so long as an attacker cannot manipulate the client's CA store. A username and password is still required to authenticate the channel, since the client is anonymous.
  3. Assume totally anonymous mode: neither the client nor the server are authenticated. This mode is trivially vulnerable to MITM attacks. Since usernames and passwords are required in this case, attackers can readily capture credentials and impersonate any user.

I recommend either 1.) fully authenticated or 2.) server-authenticated TLS channels. This mandates the use of a certificate store on each client. While you're going to that trouble, it might make sense to also generate and store client certificates as well--and indeed, the proposal requires the secure storage of client keys and certificates; which means you'll need a key distribution scheme for clients.

Given the presence of client secure storage for client keys and certs, you might as well encode the user credentials in the certificate directly. This removes the need for passwords and their secure storage on the server, which reduces the attack profile. It also removes the need for a separate username/password auth channel in the Riak protocol. That'd make it simpler for client maintainers to add auth support to their clients, since they can rely on the TLS protocol to do the work for them. Clients only have to store/configure [key, cert], instead of [key, cert, username, password]. User access can be cancelled via the usual CRL techniques.

@jj1bdx
Copy link
Contributor

jj1bdx commented Jul 25, 2013

Canola (a PAM authenticator module) at https://github.com/basho/canola should be included in the Work In Progress list.

@Vagabond
Copy link
Contributor Author

@bkerley Yes, CRL support is something I forgot to cover, will update the text.

@aphyr Yes, I plan to support both 1 and 2 (but not 3). If you want to use the certificate authentication mode (with no additional password), we require you to handshake via method 1. This is already sort of implemented for the PBC protocol and the riak_test shows its use:

https://github.com/basho/riak_test/blob/adt-security/tests/pb_security.erl#L94
https://github.com/basho/riak_test/blob/adt-security/tests/pb_security.erl#L133

@randysecrist
Copy link

Will the ACL style permissions preclude / (make it difficult) to use a more capability driven security model (oauth scopes) in a future phase? Has oauth been discussed?

@aggress
Copy link

aggress commented Jul 25, 2013

How would this work with MDC?

@cdahlqvist
Copy link

It would be useful to also have authorization for list_keys, list_buckets and secondary index queries as well as for the ability to run mapreduce queries.

@seancribbs
Copy link
Contributor

@cdahlqvist From my reading of the proposal, all operations will be authorized and audited.

@Vagabond
Copy link
Contributor Author

@aggress MDC doesn't deal with client authentication at all, it already has support for TLS based authentication, where both sides verify the other's certificate.

@cdahlqvist Yes, EVERY operation (or almost all of them) will need a permission associated, I just don't know what some of them will look like yet. The only exception to the rule may be the stats and ping endpoints, you may be able to hit them simply by being authenticated, not sure yet.

@aggress
Copy link

aggress commented Jul 25, 2013

@Vagabond I was thinking more along the lines of will users/roles created in cluster a) be replicated over to cluster b) or will they need to be set up individually and how might things work with such things like cascading writes?

@Vagabond
Copy link
Contributor Author

@aggress Yes, that is a good question, will add it to the open questions section. My initial feeling is to NOT replicate that information, but we'll see.

@aggress
Copy link

aggress commented Jul 25, 2013

How might commit hooks be handled? stopping user a from using a commit hook that updates a bucket only user b has access to

@ghost
Copy link

ghost commented Jul 25, 2013

Oh yeah, that reminds me: Erlang mapreduce is basically a wide-open door for arbitrary code execution, including, I suppose, modifying the ACLs themselves, so it should only be accessible to the highest privilege levels.

@jrwest
Copy link
Contributor

jrwest commented Jul 26, 2013

re: replication between clusters, that isn't something we planned to support in cluster metadata (where this info will be stored) -- at least initially.

Either way, I agree w/ @Vagabond's initial feeling. Even if the cluster is accessed by the same logical users I would assume they are typically accessed by different hardware (external LB at the least, probably different application servers).

@tarcieri
Copy link

Something Riak might consider is a capability-based security model for granting access to buckets. I think capability-based security could fit extremely well with Riak's key/value storage model and have done a bit of work in this space.

Under this model, authentication could be handled using whatever mechanism is desired (e.g. mutual TLS), but to authorize access to a particular bucket, the client would need to present a bucket-specific token, which could actually be a combination of cryptographic keys (known as a crypto-capability model).

I've implemented a generic key/value store encryption system which works with Riak among other key/value stores here, if you're interested in seeing a real-world example of what I'm describing. My scheme encrypts both keys and values, allows data to be accessed using only encrypted keys, and allows clients to decrypt the key names if so desired:

https://github.com/cryptosphere/keyspace

The best part of this approach is that it has minimal impact on Riak. In fact, encryption is orthogonal, and something only clients would have to support. The only thing that would have to be added to Riak itself is a digital signature check (along with a timestamp check to prevent replay attacks) to ensure values being written are authentic.

@randysecrist
Copy link

@tarcieri I like your work on keyspace, and it pairs with what I was going for when I asked about a capability based model a bit earlier. +1 for this.

@camshaft
Copy link

+1 capabilities. It's generally easier to understand and more secure. Managing ACLs becomes cumbersome very quickly from my experience.

@Vagabond
Copy link
Contributor Author

Vagabond commented Aug 8, 2013

So, after a bunch of reading and internal discussion, I think we're going to stick with ACLs, for the following reasons:

  • Users are more familiar with them, other databases are primarily ACL based (when they provide security at all).
  • Clients for Riak don't need to pass credentials around.
  • Credentials are harder to audit, unless you tie them to a per-user certificate you can revoke or something.

However, I think I will add postgres style roles, as a way to implement 'groups'.

@tarcieri
Copy link

tarcieri commented Aug 8, 2013

Sad to hear that :(

I buy the familiarity argument, but that's really the only thing ACLs have going for them over capabilities. Since capabilities solve the AuthZ problem, you can still use MTLS to solve AuthN and revoke access that way. Audit logging can be used to spot abuse of capabilities.

Relevant: Zed Shaw - The ACL Is Dead

@Vagabond
Copy link
Contributor Author

Vagabond commented Aug 9, 2013

I am watching that talk, but I'm struggling to extract much of relevance from it. It is sort of like a tech-related street performance that occasionally touches on ACLs en-route to bagels, smoked meat, corporate greed, the incompetence of MBAs, etc.

The three key points he seems to make are:

  • ACLs are not turing complete
  • Does not have repetition
  • Does not have externally accessible storage

Maybe I'm dense, but I don't understand why those are even a problem. I understand his point about ridiculous business rule requirements about time and situation dependent ACLs, but Riak does not really have that problem.

Right now my takeaway is this: ACLs for people != ACLs for applications. Applications rarely need time-dependent or situation dependent access to data, they have their data and they want to access it whenever they need to, and these access rules change rarely. Riak is not a document management solution, it is a database. It is used by applications, not people.

I'm happy to have a discussion about this, but providing references to things like the 'authz problem' that is not a 1:10 stream of consciousness rant about all sorts of unrelated things would help your case a lot more. It is fairly telling that none of the questions at the end were even about ACLs at all, beyond one question about making what Zed did into a product.

@tarcieri
Copy link

tarcieri commented Aug 9, 2013

Haha, sorry about that. But I hope it drives home that ACLs are in an uncanny valley between a capability based system and Turing-complete code for providing AuthZ.

Waterken Web describes some of the tradeoffs of capabilities vs ACLs:

http://waterken.sourceforge.net/

You might also take a look at how Tahoe-LAFS implements "writecaps" and "readcaps" for its mutable files. You wouldn't need anything so elaborate, just a digital signature:

http://eprint.iacr.org/2012/524.pdf

Tahoe ends up providing something that looks an awful lot like an encrypted version of Riak, sans many of the features that make Riak compelling as a database (read repair, vector clocks, 2I, etc)

@Vagabond
Copy link
Contributor Author

Vagabond commented Aug 9, 2013

Maybe we can narrow the conversation here. When would the confused deputy problem occur for Riak, along the lines of the compiler example here:

http://waterken.sourceforge.net/aclsdont/current.pdf

@Vagabond
Copy link
Contributor Author

Vagabond commented Aug 9, 2013

I guess my biggest sources of mystification are the following:

  • What part of the above proposal has the capacity to act as a confused deputy
  • What does issuing a capability to a user look like, and how would it work

@tarcieri
Copy link

@Vagabond that's not a question I can answer until you have defined a threat model. Only then can you enumerate potential attacks and choose defenses.

I can perhaps enumerate why ACLs don't work in practice with an example threat model:

Threat: We want to give Alice, but not Mallory, AuthZ to X even though both Alice and Mallory can both AuthN to the service providing X and Alice and Bob are conspirators
Capability attack scenario: Alice gives Mallory the capability to access X. Mallory can then access X. Our audit logs reflect Mallory accessing X
ACL attack scenario: Alice downloads X and gives it to Mallory. Mallory now has X. Our audit logs reflect Alice accessing X, not Mallory. Now we have the problem that Alice is authorized to access X and thus this may appear to be normal behavior, combined with the fact that Mallory gaining access to the content is not reflected in the audit logs.

In the end the result is the same, with some caveats: In the capability scenario, we see Mallory accessing the resource illicitly, but don't learn that Alice is a conspirator. In the ACL scenario, we don't learn about Mallory's involvement at all, as it appears that Alice accessed the resource. In the ACL scenario, Alice's behavior in the audit logs looks "normal", because Alice is authorized to access X. In the capability scenario, we can cross check the audit logs with our records of who should be able to access what, and determine that Mallory accessed X illicitly.

Thus, while capabilities are shareable, it's probably in Mallory's best interest to act as if they weren't and obtain X through a conspirator, lest his actions show up in the audit logs. In other words, while the fact capabilities are shareable appears to be disadvantageous, it's actually in the attacker's best interest not to take advantage of this fact, lest their actions appear in the audit logs. A sophisticated attacker will want to piggyback their attack on normal looking behavior as this will make it harder to detect.

What does issuing a capability to a user look like, and how would it work

This is a fairly open-ended question as there are many ways that capabilities can be implemented. I can roughly detail what you could do with the sort of crypto-capabilities model implemented by Tahoe (although in this case I'm only describing how you'd ensure authenticity of data, not confidentiality. Tahoe provides both)

In general capability tokens are considered necessary and sufficient in and of themselves for accessing a particular resource. This doesn't preclude adding an additional mutual TLS layer or what have you to AuthN to the service.

Ideally every part of the system has an associated set of capabilities. All data is individually, uniquely, and securely identifiable. So for starters: every bucket would have separate write/authenticate capabilities, if not every key.

So, at the time you create a bucket, a public and private digital signature key would be generated. The server would store the public key and use it to authenticate writes. The private key would allow new data to be written. The server would mandate that all writes be digitally signed (hopefully with a timestamp to prevent replay attacks)

Requests to write would include some type of request parameter containing a digital signature produced client side by the holder of a private key for a particular bucket or bucket:key combination. The server would authenticate digital signatures before accepting the write.

@danostrowski
Copy link

+1 and thanks!

@glagnar
Copy link

glagnar commented Nov 11, 2013

I am looking into Riak for project requiring a secure distributed. I need to make sure that if one node is compromised, i.e. server has been taken over, it will not be possible to break the entire cluster. For example, by prevention against altering permissions, or changing commit hooks.
Will either be possible with Riak 2.0 ?

@aphyr
Copy link

aphyr commented Nov 11, 2013

@glagnar: I doubt you'll satisfy that property in any major distributed database without end-to-end cryptographic verification of writes by both all servers and all clients. As an example, take a look at what's required to build http://www.pmg.csail.mit.edu/bft/castro99correctness-abstract.html

@coderoshi
Copy link
Contributor

@glagnar This is a different sort of security altogether. If a box itself is compromised, the user can simply give themselves any permissions they want via riak-admin.

@glagnar
Copy link

glagnar commented Nov 11, 2013

@coderoshi Thanks, I know. That was my exact source of worry. In a situation where the server is compromised, could an 'admin password' not solve this issue ? I.e. without password authentication, it should not be be allowed to change for example permissions within the cluster of nodes ?
@aphyr I am not sure my issue has this requirement, as it is not initially the 'data' writes I am worried about.

@tarcieri
Copy link

@glagnar if you really want a "trust no one" system where the compromise of a single node has zero impact on the rest of the grid, you might look at Tahoe-LAFS. It satisfies those properties (namely end-to-end cryptographic confidentiality and integrity of all content as @aphyr described): http://tahoe-lafs.org

@aphyr
Copy link

aphyr commented Nov 12, 2013

@glagnar @tarcieri Note that Tahoe-LAFS does not provide robustness to a single compromised gateway or client node; only storage nodes.

@tarcieri
Copy link

@aphyr well yes, but ideally you separate the Tahoe nodes which provide storage service from the clients which are accessing the content, in which case only the clients see the capabilities/secrets, and the storage nodes are otherwise completely oblivious and see only ciphertexts. In such a deployment, the servers could be compromised without worry

@glagnar
Copy link

glagnar commented Nov 12, 2013

Is it possible to perhaps setup RIAK in a unidirectional replication manor. I.e. A is master, and B & C are slaves. This means that it does not matter if B or C are compromised. Then 3 clusters could be set up, one where in turn A, B or C is master. A client would then be able to detect if one master had been compromised, by looking at the difference between the three clusters.

@coderoshi
Copy link
Contributor

No this is not possible. All Riak nodes are equivalent.
On Nov 12, 2013 3:31 AM, "Glagnar" notifications@github.com wrote:

Is it possible to perhaps setup RIAK in a unidirectional replication
manor. I.e. A is master, and B & C are slaves. This means that it does not
matter if B or C are compromised. Then 3 clusters could be set up, one
where in turn A, B or C is master. A client would then be able to detect if
one master had been compromised, by looking at the difference between the
three clusters.


Reply to this email directly or view it on GitHubhttps://github.com//issues/355#issuecomment-28286317
.

@sogabe
Copy link

sogabe commented Dec 17, 2013

I tried Security extensions with user/CIDR authentication. It seems to work fine. But I can't find how to remove Sources. Could anyone tell me when I should try most of functions?

@Vagabond
Copy link
Contributor Author

See #434 and related PRs, not all of the security code has landed yet.

@sogabe
Copy link

sogabe commented Dec 17, 2013

Thanks, @Vagabond

@jaredmorrow
Copy link
Contributor

Closing this as most of what was described here landed in 2.0 pre builds.

@rzezeski rzezeski modified the milestones: 2.0-beta, 2.0 Mar 25, 2014
hmmr pushed a commit that referenced this issue Nov 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests