Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

Merged
merged 5 commits into from
Dec 12, 2023

Conversation

mhutchinson
Copy link
Contributor

@mhutchinson mhutchinson commented Nov 9, 2023

Just to confirm that removing this will actually work in practice.

The tests pass, so now we need to design an upgrade mechanism.
Probably forking this mysql driver / schema to be a mysql2 which is
effectively a different driver.

We'll also need to make some changes to this to support MariaDB.
The 'REPLACE INTO' syntax is, I'm pretty sure, a MySQL specfic
extension. There is a more standard format for doing this, but it
involved passing in more '?' parameters which was already getting
too wild for me with the <placeholder> expansion. Probably a
solveable problem, but one for when I'm a bit less ill :-)

@jsha
Copy link
Contributor

jsha commented Nov 9, 2023

MariaDB also documents the REPLACE INTO syntax: https://mariadb.com/kb/en/replace/

@jsha
Copy link
Contributor

jsha commented Nov 9, 2023

I think there's a way to get the benefits of this branch without the disruption of a table migration. Rather than remove the column, we could pick some very large revision number (larger than any currently existing one), and start writing / replacing all subtrees at that revision number. We could then modify the selectSubtreeSQL query to always request that specific revision (eliminating the subquery).

Doing this on an existing table would not reduce the cardinality of the existing index but would dramatically slow its growth, which could have a performance benefit, depending on whether the issue is the absolute size or the growth rate.

@mhutchinson
Copy link
Contributor Author

mhutchinson commented Nov 10, 2023 via email

@mhutchinson
Copy link
Contributor Author

As a data point, I've run the integration tests in CT-go against this version of Trillian and confirmed that it works: ./trillian/integration/integration_test.sh

@mhutchinson
Copy link
Contributor Author

For anyone following along at home, the commit history got weird here. Somehow (maybe during a rebase?) I ended up setting the branch to an intermediate version of the work that was incomplete. Luckily git reflog spelunking got me back to a version that works. I've force-pushed and so you should only see a single commit, which leaves the tree in a working state.

I've abandoned the mysql2 idea that I was pursuing for the moment. It may be possible to get it to work, but it ended up with a lot of duplication and started sprawling to changes outside of the driver directory (introducing new flags, storage/testdb.go, etc). I'll take a look at some other options.

mhutchinson added a commit to mhutchinson/trillian that referenced this pull request Nov 16, 2023
This is a minor cleanup but it simplifies google#3201 to avoid reading this, and pulling out this change to the existing logic into its own PR simplifies review.
@mhutchinson mhutchinson changed the title A really dirty rip-out of the subtree revisions Support for skipping subtree revisions to increase read performance and disk usage Nov 16, 2023
@mhutchinson
Copy link
Contributor Author

mhutchinson commented Nov 16, 2023

There are a number of approaches that could be used to transition from the current world where subtrees are revisioned, to one where they are unrevisioned. I'll list them here:

The breaking change

This simply changes the schema to drop the revisions column and then updates the read/write queries. This is what the first commit to this PR does.

This is the simplest thing to do, but it has no backwards compatibility for logs running on the older schema. This would need to be a Trillian 2.0.

Different storage layer

Copying the mysql storage layer into a mysql2 storage directory avoids the breaking change for existing deployments. Deploying trillian with --storage_system=mysql2 would change to the revisionless version.

This involves a lot of duplicate code for an indeterminate amount of time. It also means that new flags need to be introduced such as mysql2_uri to avoid conflicts with the existing flags in the mysql storage implementation. This approach also has a limitation that a single deployment of Trillian could only serve with either mysql or mysql2. This means that log operators that have a number of historic logs with revisioned subtrees would need to turn up a new instance of Trillian for any new logs that desired the mysql2 revisionless feature.

Different code paths depending on tree features

Trillian stores tree metadata that is read at the start of each tree operation. In this approach we store some marker in the tree metadata that indicates that this is a revisionless tree, and an absence of this marker means that it will be treated as a historic revisioned subtree log.

This approach allows for backwards compatibility with existing logs, and a single instance of Trillian can support both revisioned and unrevisioned subtrees. Further, having this data in the database means that it's more difficult for a log operator to point the wrong configuration of Trillian at a database and potentially make incompatible writes by mixing revisioned / unrevisioned approaches on the same log.

This approach has a little more complexity in terms of code, but simpler in terms of deployment. For this reason, this is the primary exploration path at this point. The current commit chain demonstrates one approach to doing this, where the marker is the presence of magic bytes '🙃norevs🙃' as the prefix of the tree description field. Other similar approaches involve hijacking the private/public key fields with magic bytes, as these are no longer used.

A more elegant approach would be to add a new column to the Trees table, but this would require a DDL schema update migration for all log operators using MySQL.

mhutchinson added a commit that referenced this pull request Nov 21, 2023
This is a minor cleanup but it simplifies #3201 to avoid reading this, and pulling out this change to the existing logic into its own PR simplifies review.
@pav-kv
Copy link
Contributor

pav-kv commented Nov 23, 2023

Just passing by here with some thoughts dump :)

Would a blatant ALTER TABLE Trees ADD COLUMN IF NOT EXISTS SomeInfo LONGBLOB; do the trick? Containing the norevs and other potential new things (bundled as a proto, for instance). If the binary is rolled back to an older code, would it choke upon seeing an extra field in this table?

To make this column creation seamless, the ALTER TABLE could be run upon the process startup (a bit wasteful, but perhaps fine if it's done only once by each node on each rollout). After a few releases, this step could be removed.

In general, some seamless/automatic/stateful migration mechanism (built into the concrete Trillian storage implementations) would remove a lot of friction here, and would allow Trillian schema/code to iterate. For example, have some migrations versioning (like a simple counter), a table in the schema containing the "min supported version" which is bumped after each migration and ensuring all nodes know this version, and some form of election and progress tracking to facilitate these migrations. All done from within Trillian (without reliance on operators, manual steps, and external tooling).

As an example, CockroachDB seamless migrations mechanism could be of interest, but may be a bit complex.

@n-canter
Copy link

n-canter commented Dec 2, 2023

REPLACE INTO may result in worse performance comparing to INSERT INTO … ON DUPLICATE KEY UPDATE

I created three trees:

  • 1884865777120468264 with revisions
  • 3161284979007895112 without revisions with REPLACE INTO
  • 6090622392197454768 without revisions with INSERT INTO ... ON DUPLICATE KEY UPDATE

Ran small benchmark (mysql Ver 8.0.35 for Linux on x86_64 (Source distribution)):

  • Add 1000000 records to queue.
  • Sequence queued records with -batch_size=100
  • GetInclusionProofByHash() for random records
mysql> SELECT TreeId, COUNT(TreeId) FROM SequencedLeafData GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |       1000000 |
| 3161284979007895112 |       1000000 |
| 6090622392197454768 |       1000000 |
+---------------------+---------------+
3 rows in set (0.96 sec)

mysql> SELECT TreeId, COUNT(TreeId) FROM Subtree GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |         17671 |
| 3161284979007895112 |          3924 |
| 6090622392197454768 |          3924 |
+---------------------+---------------+
3 rows in set (0.01 sec)

mysql> SELECT TreeId, COUNT(TreeId) FROM TreeHead GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |         10001 |
| 3161284979007895112 |         10001 |
| 6090622392197454768 |         10001 |
+---------------------+---------------+
3 rows in set (0.01 sec)

Getting rid of subtree revisions results in ~4.5x less rows in Subtree table.

Benchmark results prove that old version is slower than a new one, while REPLACE INTO and INSERT INTO ... ON DUPLICATE KEY UPDATE show similar results.

goos: linux
goarch: amd64
pkg: github.com/google/trillian/cmd/bench
cpu: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
BenchmarkInclusionProof
BenchmarkInclusionProof/with_revs
BenchmarkInclusionProof/with_revs-8         	     630	   1886578 ns/op
BenchmarkInclusionProof/without_revs
BenchmarkInclusionProof/without_revs-8      	     741	   1555563 ns/op
BenchmarkInclusionProof/without_revs_insert
BenchmarkInclusionProof/without_revs_insert-8         	     738	   1539970 ns/op
PASS
ok  	github.com/google/trillian/cmd/bench	4.006s

@mhutchinson
Copy link
Contributor Author

@pavelkalinnikov Hello, and welcome back :-)

We were at a conference last week, apologies for the slow reply. Replies are here, and I'll put them in line:

Would a blatant ALTER TABLE Trees ADD COLUMN IF NOT EXISTS SomeInfo LONGBLOB; do the trick? Containing the norevs and other potential new things (bundled as a proto, for instance). If the binary is rolled back to an older code, would it choke upon seeing an extra field in this table?

To make this column creation seamless, the ALTER TABLE could be run upon the process startup (a bit wasteful, but perhaps fine if it's done only once by each node on each rollout). After a few releases, this step could be removed.

I considered this but ruled it out because I think we would still need to have manual steps for users that have deployed with a locked down configuration, i.e. where the DB user that Trillian connects as has data read/write permissions, but does not have permissions to modify the schema (i.e. no ALTER TABLE permission). I suspect this concern doesn't affect you as a Cockroach maintainer because your code is the database and thus has all the permissions :-)

In general, some seamless/automatic/stateful migration mechanism (built into the concrete Trillian storage implementations) would remove a lot of friction here, and would allow Trillian schema/code to iterate. For example, have some migrations versioning (like a simple counter), a table in the schema containing the "min supported version" which is bumped after each migration and ensuring all nodes know this version, and some form of election and progress tracking to facilitate these migrations. All done from within Trillian (without reliance on operators, manual steps, and external tooling).

+1 for a metadata table. I wished that we had one when I started this PR :-) Something to add to the roadmap, along with automatic upgrades, and documentation on the permissions that the Trillian user needs. That said, I'm still very nervous about even proposing this feature because this adds complexity that is somewhat hard to test, and if it goes wrong, logs don't support recovery from a backup.

@mhutchinson mhutchinson marked this pull request as ready for review December 4, 2023 15:16
@mhutchinson mhutchinson requested a review from a team as a code owner December 4, 2023 15:16
@phbnf phbnf self-assigned this Dec 5, 2023
@mhutchinson
Copy link
Contributor Author

@n-canter suggested to use INSERT ... ON DUPLICATE KEY instead of REPLACE, which I have included. Thanks for the suggestion! My reply to this suggestion risks getting lost now that the comment is resolved, and I think it could be useful to future maintainers so I'll include it here:

I have a minor concern that the VALUES() syntax is unfavourable in both MySQL[1] and to a lesser degree MariaDB[2] in the newest versions, but for the time being they both support it so let's roll with it.

Tested this with the following docker images:

  • mysql:5
  • mysql:8.0
  • mariadb:10.5

[1] https://dev.mysql.com/doc/refman/8.0/en/insert-on-duplicate.html - "The use of VALUES() to refer to the new row and columns is deprecated beginning with MySQL 8.0.20, and is subject to removal in a future version of MySQL"
[2] https://mariadb.com/kb/en/values-value/ - In MariaDB 10.3.3 this function was renamed to VALUE(), because it's incompatible with the standard Table Value Constructors syntax ... The VALUES() function can still be used even from MariaDB 10.3.3 but only in INSERT ... ON DUPLICATE KEY UPDATE statements; it's a syntax error otherwise.

storage/mysql/tree_storage.go Outdated Show resolved Hide resolved
storage/mysql/tree_storage.go Show resolved Hide resolved
storage/mysql/tree_storage.go Outdated Show resolved Hide resolved
@mhutchinson
Copy link
Contributor Author

I've convinced myself that the current approach is too leaky an abstraction and I'm going to put more work into this. Specifically, changing the description in createTree means that other storage implementations other than mysql will be affected by this change. This is ugly and confusing (especially if these other storage implementations still use subtree revisions), and sets the wrong precedent for isolation of these different storage backends.

I have some ideas about doing this in a cleaner way, and I'll take another run at this tomorrow.

mhutchinson added a commit to mhutchinson/trillian that referenced this pull request Dec 6, 2023
This removes an obstacle to executing google#3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.
@mhutchinson
Copy link
Contributor Author

govulncheck failures seem completely unrelated to this PR. I'm ignoring them for the moment, but if they need my attention before we try to merge then let me know.

Latest commits land some big changes that rework how the new settings bit is stored. The main deltas:

  • Revisionless is still the default for new trees, but can be overridden by passing in a StorageSettings on the tree. The createtree command doesn't support this, so anyone really wanting to do this will need to do some work. The real motivation for doing this is to allow the tests to set up trees in both configurations so we can test both types.
  • I've laid the groundwork for other types of settings being stored, which is to say that we use a persisted struct of options instead of a string prefix in a description
  • The description is left alone now, and we now use the PublicKey column to store the encoded settings object

It's unfortunate how messy this is, but I'm pointing the finger at protos, especially the any type. It also has an extra complexity that we can't write protos directly to the DB because it's too dangerous. An encoded proto with all default values is an empty message, which appears to be exactly the same as no message when we read it back. In this case, we really care about this distinction. To work around this, a new struct has been introduced and it is gob encoded/decoded into the PublicKey column.

@AlCutter if you can review the TODO on line 166 of tree_storage.go that would be great. The only time this happens is because logtests.go has a test case that looks weird to me. If this reflects an actual invocation in reality then the design needs to change. If it doesn't, we should have a chat about the future of that test.

Copy link
Contributor

@phbnf phbnf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone else reviewing this, I've found it useful to break this review in two halves since some changes were done and undone by some commits: the first commit, and all the other ones together.

storage/mysql/schema/storage.sql Outdated Show resolved Hide resolved
storage/mysql/admin_storage.go Show resolved Hide resolved
storage/mysql/admin_storage.go Show resolved Hide resolved
storage/mysql/sql.go Show resolved Hide resolved
storage/mysql/tree_storage.go Outdated Show resolved Hide resolved
storage/mysql/admin_storage.go Show resolved Hide resolved
Copy link
Contributor Author

@mhutchinson mhutchinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments. Addressed, and still eagerly courting more eyes and opinions.

storage/mysql/admin_storage.go Show resolved Hide resolved
storage/mysql/admin_storage.go Show resolved Hide resolved
storage/mysql/admin_storage.go Show resolved Hide resolved
storage/mysql/sql.go Show resolved Hide resolved
storage/mysql/tree_storage.go Outdated Show resolved Hide resolved
mhutchinson added a commit to mhutchinson/trillian that referenced this pull request Dec 7, 2023
This removes an obstacle to executing google#3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.
mhutchinson added a commit that referenced this pull request Dec 7, 2023
This removes an obstacle to executing #3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.
mhutchinson added a commit to mhutchinson/trillian that referenced this pull request Dec 11, 2023
This was previously creating and initializing a tree, and then testing what happened if you created a transaction on a tree ID that definitely didn't exist. What it was trying to test was something different, which is the case where a tree had been created/defined, but was not initialized with an empty log root yet. The test now reflects that. This allows google#3201 to avoid a nil check for something that otherwise will be guaranteed to exist.
mhutchinson added a commit that referenced this pull request Dec 11, 2023
This was previously creating and initializing a tree, and then testing what happened if you created a transaction on a tree ID that definitely didn't exist. What it was trying to test was something different, which is the case where a tree had been created/defined, but was not initialized with an empty log root yet. The test now reflects that. This allows #3201 to avoid a nil check for something that otherwise will be guaranteed to exist.
This confirms that removing this will actually work in practice.
This commit is a breaking change and can't be merged as-is. The
lesson to be learned is that this is the minimum change that would
be required to make this work if there were no existing clients
using Trillian.
The same schema is used for both revisioned and unrevisioned subtrees. The difference is that we always write a revision of 0 in the unrevisioned case, which still means that there will only be a single entry per subtree.

The old PublicKey column has been used to store a settings object for newly created trees.
If this settings object is found in this column, then it will be parsed and checked for a property indicating that this tree is revisionless.
If the property is successfully confirmed to be revisionless then all writes to the subtree table will have a revision of 0, and all reads will skip the nested inner query that was causing slow queries.
If the property cannot be confirmed to be revisionless (no settings persisted, or settings persisted but explictly say to use revisioned), then the functionality will continue in the old way.
This preserves backwards compatibility, but makes it so that new trees will gain these features.

For users with legacy trees that wish to take advantage of the smaller storage costs and faster queries of the new revisionless storage, the proposed migration mechanism is to use migrillian to clone the old tree to a new tree. If anyone is interested in doing this then I recommend speaking to us on Slack (https://join.slack.com/t/transparency-dev/shared_invite/zt-27pkqo21d-okUFhur7YZ0rFoJVIOPznQ).
@mhutchinson
Copy link
Contributor Author

I've squashed the history for this PR into 2 commits. The first of these shows what the real change is to support the new revisionless approach, but does so in a breaking way. The second change makes it so that all new trees get the new revisionless behaviour, but old trees continue to work exactly as they used to.

storage/mysql/sql.go Show resolved Hide resolved
@phbnf phbnf removed their assignment Dec 12, 2023
@mhutchinson
Copy link
Contributor Author

@AlCutter @jsha @n-canter this is looking final now. If you have any comments positive/neutral/negative before this gets merged, now is the time!

It's a big one
CHANGELOG.md Outdated Show resolved Hide resolved
@mhutchinson mhutchinson changed the title Support for skipping subtree revisions to increase read performance and disk usage Support for skipping subtree revisions to increase read performance and reduce disk usage Dec 12, 2023
@mhutchinson mhutchinson merged commit d95458c into google:master Dec 12, 2023
10 checks passed
@mhutchinson mhutchinson deleted the noSubtreeRev branch December 12, 2023 16:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants