Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a codec AES_128_GCM_SIV for encrypting columns on disk #19896

Merged
merged 4 commits into from
Jul 30, 2021

Conversation

depressed-pho
Copy link
Contributor

While this is implemented as a compression codec, it does not actually compress data. It instead encrypts data on disk. The key is obtained by executing a user-specified command at the server startup, or if it's not specified the codec refuses to process any data.

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

  • Added a compression codec AES_128_GCM_SIV which encrypts columns instead of compressing them.

Detailed description / Documentation draft:

  • Included in the commit.
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@robot-clickhouse robot-clickhouse added doc-alert pr-feature Pull request with new product feature labels Feb 1, 2021
@alex-zaitsev
Copy link
Contributor

@depressed-pho , this is nice, but can we have it more generic? I.e. instead of AES codec define it as Encrypted with parameters. That will allow to make it more generic, as current 'encrypt' function: https://clickhouse.tech/docs/en/sql-reference/functions/encryption-functions/

For now it will only support AES_128_GCM_SIV, but we can extend later to other options.

@depressed-pho
Copy link
Contributor Author

Like Codec(Default, Encrypted('AES-128-GCM-SIV'))? Okay, I will do it.

@alexey-milovidov
Copy link
Member

It should be conditionally compiled under #if USE_SSL, see examples in code.

@Enmk
Copy link
Contributor

Enmk commented Feb 1, 2021

@depressed-pho thanks, a very nice feature! I noticed that there are no tests, could you please add some? To simplify this, you may want to reuse the unit-test suite from other codes, please see CodecTestCompatibility for example.

@alexey-milovidov
Copy link
Member

Unit tests are optional but functional tests are mandatory.

@depressed-pho
Copy link
Contributor Author

But I'm not sure how to write a functional test for this. Would it be enough to just add something like <encryption><key_command>echo "Some static key in base64"</key_command></encryption> to tests/server-test.xml and a stateless SQL test using the codec?

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Feb 1, 2021

@depressed-pho Yes, you can add a new config with a script for static key, see tests/config/README.md

@vzakaznikov
Copy link
Contributor

If we make it more generic, it would nice for the encryption parameters to the codec to match our current encrypt function (parameter names, order and values).

@depressed-pho
Copy link
Contributor Author

depressed-pho commented Feb 2, 2021

Updated my patch. It's now named Encrypted(cipher) where cipher has to be a literal 'AES-128-GCM-SIV' for now. The code is guarded by #if USE_SSL, and has a stateless SQL test.

@depressed-pho
Copy link
Contributor Author

Amended it again to fix a build failure occuring when !USE_SSL.

@depressed-pho
Copy link
Contributor Author

Fixed !USE_SSL again.

@depressed-pho
Copy link
Contributor Author

Marked the test as being skipped in fasttest, where SSL is disabled.

@depressed-pho
Copy link
Contributor Author

Fixed several build errors found by the build check.

@depressed-pho
Copy link
Contributor Author

And now only the Yandex-exclusive checks are failing. I can't do anything about that.

@depressed-pho
Copy link
Contributor Author

Rebased and resolved a conflict.

@nikitamikhaylov
Copy link
Member

@depressed-pho Please, run a script generate-ya-make from utils folder to solve an issue with Yandex synchronization check. Then commit add all ya.make files to this PR.

@depressed-pho
Copy link
Contributor Author

@depressed-pho Please, run a script generate-ya-make from utils folder to solve an issue with Yandex synchronization check. Then commit add all ya.make files to this PR.

I did, but the script updated no ya.make files except for the one already included in this commit. I rebased the commit anyway.

While this is implemented as a compression codec, it does not actually compress data. It instead encrypts data on disk. The key is obtained by executing a user-specified command at the server startup, or if it's not specified the codec refuses to process any data. For now the only supported cipher is 'AES-128-GCM-SIV'.
@alexey-milovidov
Copy link
Member

alexey-milovidov commented May 19, 2021

(deleted) This was wrong.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Jul 27, 2021

This codec is actually mostly safe to use but some considerations exist:

  • the information that is revealed is when some blocks of data are the same: it makes it unsafe if ClickHouse is used for multiple tenants and some of them may have access to encrypted data on disk - then one tenant can find that other tenant have the same sequences of data;
  • also the size of compressed data is revealed if it is applied after compression - it makes bruteforce attacks slightly easier.

I think we can merge it and highlight these considerations in docs.

PS. I did not know that AES-GCM-SIV is a well known technique, initially I thought it is some dirty hack :)

@alexey-milovidov
Copy link
Member

@alex-zaitsev

For now it will only support AES_128_GCM_SIV, but we can extend later to other options.

For other modes of operation it will be "instant death" if we will not generate and store random IVs. But if we do it, replicas will argue about mismatched checksum of "compressed" data and download data parts from each other excessively.

So, I think we should not allow more modes of operation here.

@FArthur-cmd FArthur-cmd self-assigned this Jul 28, 2021
@FArthur-cmd FArthur-cmd merged commit 6425dd0 into ClickHouse:master Jul 30, 2021
For Linux with systemd:

```xml
<encryption>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This configuration can be confused with the configuration of entrypted disks.

Copy link
Member

@vitlibar vitlibar Jul 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably a longer path in xml would be better, for example

<codecs>
    <aes_128_gcm_siv>
        <key_hex>...</key_hex>
    </aes_128_gcm_siv>
</codecs>


// turbob64 doesn't like whitespace characters in input. Strip
// them before decoding.
std::erase_if(b64_key, [](char c)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has an inconsistency: in this place, the encryption key is in base64 while for encrypted disks it is in hex or binary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think, it's better to read it in hex.

{
#if USE_BASE64 && USE_SSL && USE_INTERNAL_SSL_LIBRARY

auto process = ShellCommand::execute(key_command);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another inconsistency - here the key is loaded by a command placed in config,
while for encrypted disks we don't allow to put a script in config file, we expect the key can be imported from env variable with from_env=... attribute in config.

{
// Fixed nonce. Yes this is unrecommended, but we have to live
// with it.
std::string_view nonce("\0\0\0\0\0\0\0\0\0\0\0\0", 12);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is actually ok for AES_GCM_SIV?

@alexey-milovidov
Copy link
Member

Need help from @vitlibar for further polishing.

traceon added a commit to traceon/ClickHouse that referenced this pull request Jul 31, 2021
* master: (482 commits)
  ru version colon
  en version colon
  Attempt to fix flaky 00705_drop_create_merge_tree
  Help with ClickHouse#26424
  Adjust 00537_quarters to be timezone independent
  Add a codec AES_128_GCM_SIV for encrypting columns on disk (ClickHouse#19896)
  Update BaseDaemon.cpp
  Update PULL_REQUEST_TEMPLATE.md
  edits after review
  small fix in links
  Make the test parallelizable
  ru version edit
  ru version
  en version
  Fix 01600_quota_by_forwarded_ip
  typo
  Remove trailing whitespaces from docs
  Remove trailing whitespaces from docs
  fix system.zookeeper_log initialization
  Move formatBlock to its own file
  ...

# Conflicts:
#	programs/server/Server.cpp
@vitlibar
Copy link
Member

vitlibar commented Aug 5, 2021

My thoughts about that:

  1. About syntax. I believe Codec(AES-128-GCM-SIV) would be better than Codec(Encrypted('AES-128-GCM-SIV')) because the first one is shorter and more consistent with other codecs. I mean for ordinary compression codec we don't write Codec(Compression('ZSTD')), do we? And I think if we later decide to support other encryption algorithm, for example AES-128-GCM with storing random nonce in output stream, we will just create a new codec.

  2. I suppose we should support multiple keys, because sometimes we may wish to switch to a new key but still allow using previous keys for decryption. Of course to support multiple keys we should add a key id into an output stream so the decryptor later will know which key to use in case if in the future the current key is changed.

  3. As already said, the representation of keys in the config file looks confusing because we also have encrypted disks.
    I suggest to set keys in the config file like this:

<encryption_codecs>
    <!-- just key in hex, not some command -->
    <!-- the code should use directly this key and throw an exception if its length is not 16 bytes -->
    <key_hex>...</key_hex>
</encryption_codecs>

or, for multiple keys

<encryption_codecs>
        <!-- just key in hex, not some command -->
        <!-- the code should use this key and throw an exception if its length is not 16 bytes -->
        <key_hex id="0">...</key_hex>
        <key_hex id="1">...</key_hex>
        <current_key_id>1</current_key_id>
</encryption_codecs>

or

<encryption_codecs>
        <!-- just key in hex, not some command -->
        <!-- the code should use this key and throw an exception if its length is not 16 bytes -->
        <key_hex id="0">...</key_hex>
        <key_hex id="1">...</key_hex>
    <aes_128_gcm_siv>
        <current_key_id>1</current_key_id>
    </aes_128_gcm_siv>
</encryption_codecs>
  1. Also we should change the documentation and say there that though encryption keys can be stored in the configuration file that is NOT recommended because it's not secure enough. It's better either to use some environmental variables (see "from_env" tag) or move the keys into a separate config file on a secure disk and put a symlink to that config file to config.d/ folder.

  2. About implementation. It seems to be wrong. It says that it uses aes_128_gcm_siv but actually uses aes_128_gcm. Also I don't think we should use nonces containing only zeroes, a random hardcoded nonce would look better. With a corresponding comment about nonce misuse resistance. And if we require the keys in the config file to be 16 bytes long it seems to be unnecessary to call HKDF.

@alexey-milovidov
Copy link
Member

alexey-milovidov commented Aug 5, 2021

I believe Codec(AES-128-GCM-SIV) would be better than Codec(Encrypted('AES-128-GCM-SIV'))

Yes, 100%. It was just some false advice...

@alexey-milovidov
Copy link
Member

I suppose we should support multiple keys, because sometimes we may wish to switch to a new key but still allow using previous keys for decryption. Of course to support multiple keys we should add a key id into an output stream so the decryptor later will know which key to use in case if in the future the current key is changed.

+1, it will be consistent with full disk encryption that we have implemented.

@alexey-milovidov
Copy link
Member

And if we require the keys in the config file to be 16 bytes long it seems to be unnecessary to call HKDF.

+1.

Using KDF will make it inconsistent with full disk encryption.
Otherwise we can add a separate alternative way to specify key, when KDF will be applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature Pull request with new product feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants