Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

mhutchinson · 2023-11-09T12:59:30Z

Just to confirm that removing this will actually work in practice.

The tests pass, so now we need to design an upgrade mechanism.
Probably forking this mysql driver / schema to be a mysql2 which is
effectively a different driver.

We'll also need to make some changes to this to support MariaDB.
The 'REPLACE INTO' syntax is, I'm pretty sure, a MySQL specfic
extension. There is a more standard format for doing this, but it
involved passing in more '?' parameters which was already getting
too wild for me with the <placeholder> expansion. Probably a
solveable problem, but one for when I'm a bit less ill :-)

jsha · 2023-11-09T17:40:50Z

MariaDB also documents the REPLACE INTO syntax: https://mariadb.com/kb/en/replace/

jsha · 2023-11-09T20:40:59Z

I think there's a way to get the benefits of this branch without the disruption of a table migration. Rather than remove the column, we could pick some very large revision number (larger than any currently existing one), and start writing / replacing all subtrees at that revision number. We could then modify the selectSubtreeSQL query to always request that specific revision (eliminating the subquery).

Doing this on an existing table would not reduce the cardinality of the existing index but would dramatically slow its growth, which could have a performance benefit, depending on whether the issue is the absolute size or the growth rate.

mhutchinson · 2023-11-10T09:42:48Z

I have the same intuition and imagined a similar solution. Such an elegant hack wouldn't be without risks, and would require a careful rollout. In particular a staged rollout where some old code and new code were live at the same time could really break things. If you would be happy to work through these risks then I would be happy to think about this approach some more.

mhutchinson · 2023-11-15T15:37:29Z

As a data point, I've run the integration tests in CT-go against this version of Trillian and confirmed that it works: ./trillian/integration/integration_test.sh

mhutchinson · 2023-11-16T10:46:45Z

For anyone following along at home, the commit history got weird here. Somehow (maybe during a rebase?) I ended up setting the branch to an intermediate version of the work that was incomplete. Luckily git reflog spelunking got me back to a version that works. I've force-pushed and so you should only see a single commit, which leaves the tree in a working state.

I've abandoned the mysql2 idea that I was pursuing for the moment. It may be possible to get it to work, but it ended up with a lot of duplication and started sprawling to changes outside of the driver directory (introducing new flags, storage/testdb.go, etc). I'll take a look at some other options.

This is a minor cleanup but it simplifies google#3201 to avoid reading this, and pulling out this change to the existing logic into its own PR simplifies review.

mhutchinson · 2023-11-16T16:19:54Z

There are a number of approaches that could be used to transition from the current world where subtrees are revisioned, to one where they are unrevisioned. I'll list them here:

The breaking change

This simply changes the schema to drop the revisions column and then updates the read/write queries. This is what the first commit to this PR does.

This is the simplest thing to do, but it has no backwards compatibility for logs running on the older schema. This would need to be a Trillian 2.0.

Different storage layer

Copying the mysql storage layer into a mysql2 storage directory avoids the breaking change for existing deployments. Deploying trillian with --storage_system=mysql2 would change to the revisionless version.

This involves a lot of duplicate code for an indeterminate amount of time. It also means that new flags need to be introduced such as mysql2_uri to avoid conflicts with the existing flags in the mysql storage implementation. This approach also has a limitation that a single deployment of Trillian could only serve with either mysql or mysql2. This means that log operators that have a number of historic logs with revisioned subtrees would need to turn up a new instance of Trillian for any new logs that desired the mysql2 revisionless feature.

Different code paths depending on tree features

Trillian stores tree metadata that is read at the start of each tree operation. In this approach we store some marker in the tree metadata that indicates that this is a revisionless tree, and an absence of this marker means that it will be treated as a historic revisioned subtree log.

This approach allows for backwards compatibility with existing logs, and a single instance of Trillian can support both revisioned and unrevisioned subtrees. Further, having this data in the database means that it's more difficult for a log operator to point the wrong configuration of Trillian at a database and potentially make incompatible writes by mixing revisioned / unrevisioned approaches on the same log.

This approach has a little more complexity in terms of code, but simpler in terms of deployment. For this reason, this is the primary exploration path at this point. The current commit chain demonstrates one approach to doing this, where the marker is the presence of magic bytes '🙃norevs🙃' as the prefix of the tree description field. Other similar approaches involve hijacking the private/public key fields with magic bytes, as these are no longer used.

A more elegant approach would be to add a new column to the Trees table, but this would require a DDL schema update migration for all log operators using MySQL.

This is a minor cleanup but it simplifies #3201 to avoid reading this, and pulling out this change to the existing logic into its own PR simplifies review.

pav-kv · 2023-11-23T19:22:39Z

Just passing by here with some thoughts dump :)

Would a blatant ALTER TABLE Trees ADD COLUMN IF NOT EXISTS SomeInfo LONGBLOB; do the trick? Containing the norevs and other potential new things (bundled as a proto, for instance). If the binary is rolled back to an older code, would it choke upon seeing an extra field in this table?

To make this column creation seamless, the ALTER TABLE could be run upon the process startup (a bit wasteful, but perhaps fine if it's done only once by each node on each rollout). After a few releases, this step could be removed.

In general, some seamless/automatic/stateful migration mechanism (built into the concrete Trillian storage implementations) would remove a lot of friction here, and would allow Trillian schema/code to iterate. For example, have some migrations versioning (like a simple counter), a table in the schema containing the "min supported version" which is bumped after each migration and ensuring all nodes know this version, and some form of election and progress tracking to facilitate these migrations. All done from within Trillian (without reliance on operators, manual steps, and external tooling).

As an example, CockroachDB seamless migrations mechanism could be of interest, but may be a bit complex.

n-canter · 2023-12-02T18:25:30Z

REPLACE INTO may result in worse performance comparing to INSERT INTO … ON DUPLICATE KEY UPDATE

I created three trees:

1884865777120468264 with revisions
3161284979007895112 without revisions with REPLACE INTO
6090622392197454768 without revisions with INSERT INTO ... ON DUPLICATE KEY UPDATE

Ran small benchmark (mysql Ver 8.0.35 for Linux on x86_64 (Source distribution)):

Add 1000000 records to queue.
Sequence queued records with -batch_size=100
GetInclusionProofByHash() for random records

mysql> SELECT TreeId, COUNT(TreeId) FROM SequencedLeafData GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |       1000000 |
| 3161284979007895112 |       1000000 |
| 6090622392197454768 |       1000000 |
+---------------------+---------------+
3 rows in set (0.96 sec)

mysql> SELECT TreeId, COUNT(TreeId) FROM Subtree GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |         17671 |
| 3161284979007895112 |          3924 |
| 6090622392197454768 |          3924 |
+---------------------+---------------+
3 rows in set (0.01 sec)

mysql> SELECT TreeId, COUNT(TreeId) FROM TreeHead GROUP BY TreeId;
+---------------------+---------------+
| TreeId              | COUNT(TreeId) |
+---------------------+---------------+
| 1884865777120468264 |         10001 |
| 3161284979007895112 |         10001 |
| 6090622392197454768 |         10001 |
+---------------------+---------------+
3 rows in set (0.01 sec)

Getting rid of subtree revisions results in ~4.5x less rows in Subtree table.

Benchmark results prove that old version is slower than a new one, while REPLACE INTO and INSERT INTO ... ON DUPLICATE KEY UPDATE show similar results.

goos: linux
goarch: amd64
pkg: github.com/google/trillian/cmd/bench
cpu: 11th Gen Intel(R) Core(TM) i5-1145G7 @ 2.60GHz
BenchmarkInclusionProof
BenchmarkInclusionProof/with_revs
BenchmarkInclusionProof/with_revs-8         	     630	   1886578 ns/op
BenchmarkInclusionProof/without_revs
BenchmarkInclusionProof/without_revs-8      	     741	   1555563 ns/op
BenchmarkInclusionProof/without_revs_insert
BenchmarkInclusionProof/without_revs_insert-8         	     738	   1539970 ns/op
PASS
ok  	github.com/google/trillian/cmd/bench	4.006s

mhutchinson · 2023-12-04T15:15:58Z

@pavelkalinnikov Hello, and welcome back :-)

We were at a conference last week, apologies for the slow reply. Replies are here, and I'll put them in line:

Would a blatant ALTER TABLE Trees ADD COLUMN IF NOT EXISTS SomeInfo LONGBLOB; do the trick? Containing the norevs and other potential new things (bundled as a proto, for instance). If the binary is rolled back to an older code, would it choke upon seeing an extra field in this table?

To make this column creation seamless, the ALTER TABLE could be run upon the process startup (a bit wasteful, but perhaps fine if it's done only once by each node on each rollout). After a few releases, this step could be removed.

I considered this but ruled it out because I think we would still need to have manual steps for users that have deployed with a locked down configuration, i.e. where the DB user that Trillian connects as has data read/write permissions, but does not have permissions to modify the schema (i.e. no ALTER TABLE permission). I suspect this concern doesn't affect you as a Cockroach maintainer because your code is the database and thus has all the permissions :-)

In general, some seamless/automatic/stateful migration mechanism (built into the concrete Trillian storage implementations) would remove a lot of friction here, and would allow Trillian schema/code to iterate. For example, have some migrations versioning (like a simple counter), a table in the schema containing the "min supported version" which is bumped after each migration and ensuring all nodes know this version, and some form of election and progress tracking to facilitate these migrations. All done from within Trillian (without reliance on operators, manual steps, and external tooling).

+1 for a metadata table. I wished that we had one when I started this PR :-) Something to add to the roadmap, along with automatic upgrades, and documentation on the permissions that the Trillian user needs. That said, I'm still very nervous about even proposing this feature because this adds complexity that is somewhat hard to test, and if it goes wrong, logs don't support recovery from a backup.

storage/mysql/tree_storage.go

mhutchinson · 2023-12-05T12:20:57Z

@n-canter suggested to use INSERT ... ON DUPLICATE KEY instead of REPLACE, which I have included. Thanks for the suggestion! My reply to this suggestion risks getting lost now that the comment is resolved, and I think it could be useful to future maintainers so I'll include it here:

I have a minor concern that the VALUES() syntax is unfavourable in both MySQL[1] and to a lesser degree MariaDB[2] in the newest versions, but for the time being they both support it so let's roll with it.

Tested this with the following docker images:

mysql:5
mysql:8.0
mariadb:10.5

[1] https://dev.mysql.com/doc/refman/8.0/en/insert-on-duplicate.html - "The use of VALUES() to refer to the new row and columns is deprecated beginning with MySQL 8.0.20, and is subject to removal in a future version of MySQL"
[2] https://mariadb.com/kb/en/values-value/ - In MariaDB 10.3.3 this function was renamed to VALUE(), because it's incompatible with the standard Table Value Constructors syntax ... The VALUES() function can still be used even from MariaDB 10.3.3 but only in INSERT ... ON DUPLICATE KEY UPDATE statements; it's a syntax error otherwise.

storage/mysql/tree_storage.go

mhutchinson · 2023-12-05T17:49:27Z

I've convinced myself that the current approach is too leaky an abstraction and I'm going to put more work into this. Specifically, changing the description in createTree means that other storage implementations other than mysql will be affected by this change. This is ugly and confusing (especially if these other storage implementations still use subtree revisions), and sets the wrong precedent for isolation of these different storage backends.

I have some ideas about doing this in a cleaner way, and I'll take another run at this tomorrow.

This removes an obstacle to executing google#3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.

mhutchinson · 2023-12-06T17:27:02Z

govulncheck failures seem completely unrelated to this PR. I'm ignoring them for the moment, but if they need my attention before we try to merge then let me know.

Latest commits land some big changes that rework how the new settings bit is stored. The main deltas:

Revisionless is still the default for new trees, but can be overridden by passing in a StorageSettings on the tree. The createtree command doesn't support this, so anyone really wanting to do this will need to do some work. The real motivation for doing this is to allow the tests to set up trees in both configurations so we can test both types.
I've laid the groundwork for other types of settings being stored, which is to say that we use a persisted struct of options instead of a string prefix in a description
The description is left alone now, and we now use the PublicKey column to store the encoded settings object

It's unfortunate how messy this is, but I'm pointing the finger at protos, especially the any type. It also has an extra complexity that we can't write protos directly to the DB because it's too dangerous. An encoded proto with all default values is an empty message, which appears to be exactly the same as no message when we read it back. In this case, we really care about this distinction. To work around this, a new struct has been introduced and it is gob encoded/decoded into the PublicKey column.

@AlCutter if you can review the TODO on line 166 of tree_storage.go that would be great. The only time this happens is because logtests.go has a test case that looks weird to me. If this reflects an actual invocation in reality then the design needs to change. If it doesn't, we should have a chat about the future of that test.

phbnf

For anyone else reviewing this, I've found it useful to break this review in two halves since some changes were done and undone by some commits: the first commit, and all the other ones together.

storage/mysql/schema/storage.sql

storage/mysql/admin_storage.go

storage/mysql/sql.go

storage/mysql/tree_storage.go

storage/mysql/admin_storage.go

mhutchinson

Thanks for the comments. Addressed, and still eagerly courting more eyes and opinions.

storage/mysql/admin_storage.go

storage/mysql/sql.go

storage/mysql/tree_storage.go

This removes an obstacle to executing google#3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.

This removes an obstacle to executing #3201 cleanly. Despite the duplication, it's also logically cleaner. The different backend implementations should be able to evolve independently. Tying them together with common code for reading rows forces the same schema layout across all implementations.

This was previously creating and initializing a tree, and then testing what happened if you created a transaction on a tree ID that definitely didn't exist. What it was trying to test was something different, which is the case where a tree had been created/defined, but was not initialized with an empty log root yet. The test now reflects that. This allows google#3201 to avoid a nil check for something that otherwise will be guaranteed to exist.

This was previously creating and initializing a tree, and then testing what happened if you created a transaction on a tree ID that definitely didn't exist. What it was trying to test was something different, which is the case where a tree had been created/defined, but was not initialized with an empty log root yet. The test now reflects that. This allows #3201 to avoid a nil check for something that otherwise will be guaranteed to exist.

This confirms that removing this will actually work in practice. This commit is a breaking change and can't be merged as-is. The lesson to be learned is that this is the minimum change that would be required to make this work if there were no existing clients using Trillian.

The same schema is used for both revisioned and unrevisioned subtrees. The difference is that we always write a revision of 0 in the unrevisioned case, which still means that there will only be a single entry per subtree. The old PublicKey column has been used to store a settings object for newly created trees. If this settings object is found in this column, then it will be parsed and checked for a property indicating that this tree is revisionless. If the property is successfully confirmed to be revisionless then all writes to the subtree table will have a revision of 0, and all reads will skip the nested inner query that was causing slow queries. If the property cannot be confirmed to be revisionless (no settings persisted, or settings persisted but explictly say to use revisioned), then the functionality will continue in the old way. This preserves backwards compatibility, but makes it so that new trees will gain these features. For users with legacy trees that wish to take advantage of the smaller storage costs and faster queries of the new revisionless storage, the proposed migration mechanism is to use migrillian to clone the old tree to a new tree. If anyone is interested in doing this then I recommend speaking to us on Slack (https://join.slack.com/t/transparency-dev/shared_invite/zt-27pkqo21d-okUFhur7YZ0rFoJVIOPznQ).

mhutchinson · 2023-12-11T16:10:59Z

I've squashed the history for this PR into 2 commits. The first of these shows what the real change is to support the new revisionless approach, but does so in a breaking way. The second change makes it so that all new trees get the new revisionless behaviour, but old trees continue to work exactly as they used to.

… pitfall

storage/mysql/sql.go

mhutchinson · 2023-12-12T09:38:04Z

@AlCutter @jsha @n-canter this is looking final now. If you have any comments positive/neutral/negative before this gets merged, now is the time!

It's a big one

CHANGELOG.md

mhutchinson force-pushed the noSubtreeRev branch from f5b22ff to d4caed0 Compare November 13, 2023 10:36

mhutchinson force-pushed the noSubtreeRev branch from 30db159 to da7cf49 Compare November 16, 2023 10:41

mhutchinson mentioned this pull request Nov 16, 2023

Skip SELECTing revision that isn't used #3207

Merged

mhutchinson changed the title ~~A really dirty rip-out of the subtree revisions~~ Support for skipping subtree revisions to increase read performance and disk usage Nov 16, 2023

mhutchinson added a commit that referenced this pull request Nov 21, 2023

Skip SELECTing revision that isn't used (#3207)

1b80253

This is a minor cleanup but it simplifies #3201 to avoid reading this, and pulling out this change to the existing logic into its own PR simplifies review.

mhutchinson force-pushed the noSubtreeRev branch from 09bc384 to 7b71e91 Compare November 21, 2023 14:39

mhutchinson marked this pull request as ready for review December 4, 2023 15:16

mhutchinson requested a review from a team as a code owner December 4, 2023 15:16

mhutchinson requested a review from AlCutter December 4, 2023 15:16

mhutchinson force-pushed the noSubtreeRev branch from 0fdc0ff to 5696d6d Compare December 4, 2023 16:28

n-canter reviewed Dec 4, 2023

View reviewed changes

storage/mysql/tree_storage.go Outdated Show resolved Hide resolved

phbnf self-assigned this Dec 5, 2023

roger2hk requested review from roger2hk and phbnf December 5, 2023 12:15

roger2hk assigned mhutchinson Dec 5, 2023

AlCutter reviewed Dec 5, 2023

View reviewed changes

storage/mysql/tree_storage.go Outdated Show resolved Hide resolved

storage/mysql/tree_storage.go Show resolved Hide resolved

storage/mysql/tree_storage.go Outdated Show resolved Hide resolved

mhutchinson mentioned this pull request Dec 6, 2023

Inlined storage/sql.go into both implementations that use it #3235

Merged

mhutchinson force-pushed the noSubtreeRev branch from 02e48ff to c5bff06 Compare December 6, 2023 17:15

phbnf reviewed Dec 7, 2023

View reviewed changes

mhutchinson commented Dec 7, 2023

View reviewed changes

storage/mysql/admin_storage.go Show resolved Hide resolved

storage/mysql/admin_storage.go Show resolved Hide resolved

storage/mysql/admin_storage.go Show resolved Hide resolved

storage/mysql/sql.go Show resolved Hide resolved

storage/mysql/tree_storage.go Outdated Show resolved Hide resolved

mhutchinson force-pushed the noSubtreeRev branch from 260ccff to 6b26d4e Compare December 7, 2023 14:35

mhutchinson force-pushed the noSubtreeRev branch from 6b26d4e to a3f3528 Compare December 7, 2023 14:43

mhutchinson mentioned this pull request Dec 11, 2023

Make uninitializedBegin test accurately test its intention #3244

Merged

mhutchinson force-pushed the noSubtreeRev branch from a3f3528 to 0e02c59 Compare December 11, 2023 12:48

mhutchinson added 2 commits December 11, 2023 15:58

mhutchinson force-pushed the noSubtreeRev branch from 0e02c59 to 13d5d31 Compare December 11, 2023 16:07

Added comment in UpdateTree warning future maintainers of a potential…

8027a3d

… pitfall

roger2hk approved these changes Dec 11, 2023

View reviewed changes

storage/mysql/sql.go Show resolved Hide resolved

phbnf approved these changes Dec 12, 2023

View reviewed changes

phbnf removed their assignment Dec 12, 2023

Updated CHANGELOG

16ac323

It's a big one

AlCutter reviewed Dec 12, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

mhutchinson changed the title ~~Support for skipping subtree revisions to increase read performance and disk usage~~ Support for skipping subtree revisions to increase read performance and reduce disk usage Dec 12, 2023

clarify reduced disk usage

1ca893f

AlCutter approved these changes Dec 12, 2023

View reviewed changes

mhutchinson merged commit d95458c into google:master Dec 12, 2023
10 checks passed

mhutchinson deleted the noSubtreeRev branch December 12, 2023 16:35

mhutchinson mentioned this pull request Jan 4, 2024

TX rollback error: sql: transaction has already been committed or rolled back #2998

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

mhutchinson commented Nov 9, 2023 •

edited

Loading

jsha commented Nov 9, 2023

jsha commented Nov 9, 2023

mhutchinson commented Nov 10, 2023 via email •

edited

Loading

mhutchinson commented Nov 15, 2023

mhutchinson commented Nov 16, 2023

mhutchinson commented Nov 16, 2023 •

edited

Loading

pav-kv commented Nov 23, 2023

n-canter commented Dec 2, 2023 •

edited

Loading

mhutchinson commented Dec 4, 2023

mhutchinson commented Dec 5, 2023

mhutchinson commented Dec 5, 2023

mhutchinson commented Dec 6, 2023

phbnf left a comment

mhutchinson left a comment

mhutchinson commented Dec 11, 2023

mhutchinson commented Dec 12, 2023

Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

Support for skipping subtree revisions to increase read performance and reduce disk usage #3201

Conversation

mhutchinson commented Nov 9, 2023 • edited Loading

jsha commented Nov 9, 2023

jsha commented Nov 9, 2023

mhutchinson commented Nov 10, 2023 via email • edited Loading

mhutchinson commented Nov 15, 2023

mhutchinson commented Nov 16, 2023

mhutchinson commented Nov 16, 2023 • edited Loading

The breaking change

Different storage layer

Different code paths depending on tree features

pav-kv commented Nov 23, 2023

n-canter commented Dec 2, 2023 • edited Loading

mhutchinson commented Dec 4, 2023

mhutchinson commented Dec 5, 2023

mhutchinson commented Dec 5, 2023

mhutchinson commented Dec 6, 2023

phbnf left a comment

Choose a reason for hiding this comment

mhutchinson left a comment

Choose a reason for hiding this comment

mhutchinson commented Dec 11, 2023

mhutchinson commented Dec 12, 2023

mhutchinson commented Nov 9, 2023 •

edited

Loading

mhutchinson commented Nov 10, 2023 via email •

edited

Loading

mhutchinson commented Nov 16, 2023 •

edited

Loading

n-canter commented Dec 2, 2023 •

edited

Loading