Fix array schema when time traveling #4549

davisp · 2023-11-29T20:05:15Z

We were using the latest array schema even while traveling. This changes the latest array schema selection when in ArrayDirectoryMode::READ to use the newest array schema that is less than or equal to the current end timestamp. If no array schemas are newer than the timestamp end value, the earliest schema is used.

TYPE: BUG
DESC: Fix array schema selection when time traveling

shortcut-integration · 2023-11-29T20:05:18Z

This pull request has been linked to Shortcut Story #35424: Schema evolution: time traveling should pick the correct schema..

davisp · 2023-11-29T20:08:24Z

@Shelnutt2 from left field!

eric-hughes-tiledb

The existing code solves the basic problem. It doesn't seem to account for all possible cases, though.

eric-hughes-tiledb · 2023-11-29T21:11:42Z

tiledb/sm/array/array_directory.cc

+      for (auto& uri : array_schema_uris_) {
+        auto name = uri.remove_trailing_slash().last_path_part();
+
+        // Skip the old schema URI name since it doesn't have timestamps
+        if (name == constants::array_schema_filename) {
+          continue;
+        }
+
+        std::pair<uint64_t, uint64_t> ts_range;
+        throw_if_not_ok(utils::parse::get_timestamp_range(uri, &ts_range));
+
+        if (ts_range.second > latest_ts && ts_range.second <= timestamp_end_) {
+          latest_uri = uri;
+          latest_ts = ts_range.second;
+        }
+      }


Could you extract this loop to a function to somewhere in storage_format/? It is, after all, parsing URIs, which is something we want to centralize there regardless.

This particular one caught my eye for another reason. We're going to be adding submillisecond sequence numbers to our basic timestamps. While it won't directly impinge on this particular code, we're going to want to make everything that uses timestamps generic to avoid bleeding internals of the new abstraction into every use of it.

No need to do more than extract it. We can rewrite it later, and if it's already moved, we won't have to do so in order to proceed with the work.

Could you extract this loop to a function to somewhere in storage_format/? It is, after all, parsing URIs, which is something we want to centralize there regardless.

Not sure wha you mean here. I'm obviously calling a function to parse URIs, but I don't think that means this code is doing the parsing or belongs in storage_format. I had contemplated putting this logic in ArrayDirectory::latest_array_schema_uri() but I haven't got any clue how hot that function is. I could go back and make the latest_array_schema_uri_ and optional and then only invoke this logic when that member is nullopt. However, given the logic happens exactly once I wasn't particularly concerned in just putting the logic in the one place its used given I don't see where else we'd apply it.

For the sub-millisecond sequences, I was under the impression that we were going to encode that into the UUID? Is there a design doc on the new approach?

No need to do more than extract it. We can rewrite it later, and if it's already moved, we won't have to do so in order to proceed with the work.

Why would we need to rewrite this? Because of the sequence numbers?

sub-millisecond sequences, I was under the impression that we were going to encode that into the UUID

Yes, but that's not where the story ends. It's where it begins. What we're effectively doing is increasing our temporal resolution by not using just a clock. It's essentially a new temporal type that's going to end up everywhere in the code.

Why would we need to rewrite this? Because of the sequence numbers?

Exactly this.

The larger issue is that storage_format involves any of (1) a file format, (2) a file name, or (3) directory contents.

eric-hughes-tiledb · 2023-11-29T21:16:54Z

tiledb/sm/array/array_directory.cc

+        // Skip the old schema URI name since it doesn't have timestamps
+        if (name == constants::array_schema_filename) {
+          continue;
+        }


I believe there's a defect here that if there's a timestamp requested that precedes any of the files that includes times, it won't find anything. It should probably find this one, but it's unambiguously discarded.

Similarly, I find it odd that this code doesn't throw anywhere. If a timestamp is requested that precedes all available schema, it should fail.

I believe there's a defect here that if there's a timestamp requested that precedes any of the files that includes times, it won't find anything. It should probably find this one, but it's unambiguously discarded.

I'm guessing you missed this line?

https://github.com/TileDB-Inc/TileDB/pull/4549/files#diff-c2be2932fddd2739bad67ac6b693a91fe96d060d291998be7082e78057fc850dR422

Similarly, I find it odd that this code doesn't throw anywhere. If a timestamp is requested that precedes all available schema, it should fail.

Yeah, I started off thinking that as well but then I got to considering all of the different edge cases around our requirements for reading arrays that have a __array_schema.tdb and I realized we can't blindly fail here since those arrays have an un-timestamped schema available.

The other thing to keep in mind is that we can time travel before the first schema was written and write fragments. This is done a lot in the test suite, but more importantly, its possible users have done this. So changing the array schema latest could theoretically break some users existing arrays if they're doing fun things like that.

we can time travel before the first schema was written and write fragments

It's a defect that we allow this. Such data is fundamentally ambiguous, because it could have been written with any schema. Rather than double down on supporting defective behavior, I'd prefer that we fail to read such data with a "contact TileDB" message. We might be able to infer what the data means from the schema history, but that's a manual process.

@eric-hughes-tiledb this is out of scope for 2.19 and we need this PR in for the release per discussion with Isaiah. If we do what you suggest we are going to end up with a lot of support calls from customers as we have a lot of arrays out there that do this. Let’s file a follow up story for this an take more time to figure out how we can correct this moving forward.

We were using the latest array schema even while traveling. This changes the latest array schema selection when in ArrayDirectoryMode::READ to use the newest array schema that is less than or equal to the current end timestamp. If no array schemas are newer than the timestamp end value, the earliest schema is used.

We were using the latest array schema even while traveling. This changes the latest array schema selection when in ArrayDirectoryMode::READ to use the newest array schema that is less than or equal to the current end timestamp. If no array schemas are newer than the timestamp end value, the earliest schema is used. --- TYPE: BUG DESC: Fix array schema selection when time traveling

PR #4549 attempted to fix the latest array schema selection. Unfortunately, it didn't take into account that some array schemas might have the same timestamps. This change uses the fact that array schema URIs are lexicographically sorted (names generated in the same timestamps use extra sorting bits in the UUID) to chose the latest array URI whilst keeping the original intent of #4549 and checking against the end timestamp. --- TYPE: BUG DESC: Fix array latest schema selection for same MS timestamps schemas.

) PR #4549 attempted to fix the latest array schema selection. Unfortunately, it didn't take into account that some array schemas might have the same timestamps. This change uses the fact that array schema URIs are lexicographically sorted (names generated in the same timestamps use extra sorting bits in the UUID) to chose the latest array URI whilst keeping the original intent of #4549 and checking against the end timestamp. [sc-49806] --- TYPE: BUG DESC: Fix array latest schema selection for same MS timestamps schemas.

davisp requested review from ihnorton and KiterLuc and removed request for ihnorton November 29, 2023 20:05

Shelnutt2 approved these changes Nov 29, 2023

View reviewed changes

davisp force-pushed the pd/sc-35424/fix-schema-for-time-traveling branch from a8ca015 to 7566609 Compare November 29, 2023 21:02

eric-hughes-tiledb reviewed Nov 29, 2023

View reviewed changes

davisp force-pushed the pd/sc-35424/fix-schema-for-time-traveling branch from 7566609 to 127493f Compare November 30, 2023 18:56

davisp force-pushed the pd/sc-35424/fix-schema-for-time-traveling branch from 127493f to 8e54edd Compare December 1, 2023 21:07

KiterLuc merged commit 6042a36 into dev Dec 6, 2023
61 checks passed

KiterLuc deleted the pd/sc-35424/fix-schema-for-time-traveling branch December 6, 2023 00:40

teo-tsirpanis mentioned this pull request Dec 9, 2023

Fix schema evolution delete test. TileDB-Inc/TileDB-CSharp#338

Merged

KiterLuc mentioned this pull request Jul 1, 2024

Fix array latest schema selection for same MS timestamps schemas. #5143

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix array schema when time traveling #4549

Fix array schema when time traveling #4549

davisp commented Nov 29, 2023

shortcut-integration bot commented Nov 29, 2023

davisp commented Nov 29, 2023

eric-hughes-tiledb left a comment

eric-hughes-tiledb Nov 29, 2023

davisp Nov 29, 2023

eric-hughes-tiledb Nov 30, 2023

eric-hughes-tiledb Nov 30, 2023

eric-hughes-tiledb Nov 29, 2023

davisp Nov 29, 2023

eric-hughes-tiledb Nov 30, 2023

KiterLuc Dec 1, 2023

Fix array schema when time traveling #4549

Fix array schema when time traveling #4549

Conversation

davisp commented Nov 29, 2023

shortcut-integration bot commented Nov 29, 2023

davisp commented Nov 29, 2023

eric-hughes-tiledb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment