Spark: Use snapshot schema when reading snapshot #3722

wypoon · 2021-12-12T23:17:57Z

This has been implemented for Spark 2 in #1508. For Spark 3, Ryan Blue proposed a syntax for adding the snapshot id or timestamp to the table identifier in #3269. Here we implement the Spark 3 support for using the snapshot schema by using the proposed table identifier syntax. This is until a new version of Spark 3 is released with support for AS OF in Spark SQL.
Note: The table identifier syntax is for internal use only (as in this implementation) and not meant to be exposed as a publicly supported syntax in SQL. However, for testing, we do test its use from SQL.

wypoon · 2021-12-13T01:21:39Z

@rdblue @jackye1995 please take a look. This incorporates #3269 (updated). It would be nice if this could make it into 0.13, as using the snapshot schema is already implemented in the Spark 2.4 support.

jackye1995 · 2021-12-13T18:24:17Z

Yes agree, I think we need to include this for 0.13 consistent experience in 3.x and 2.4. Please let me know if anyone is against it.

jackye1995 · 2021-12-13T18:58:49Z

Skimmed through this, mostly look good to me as most of the content was reviewed once in the original PR, I will take another deeper look in the afternoon.

And FYI, I am also adding the time travel support in Trino (trinodb/trino#10258), I will add another PR to match this behavior.

wypoon · 2021-12-13T19:17:09Z

@aokolnychyi can you take a look too, in case it conflicts with any changes you're making?

wypoon · 2021-12-13T19:32:37Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

-
-    if (requestedSchema != null) {
-      // convert the requested schema to throw an exception if any requested fields are unknown
-      SparkSchemaUtil.convert(icebergTable.schema(), requestedSchema);
-    }


I pointed this out in #1508 and I'll point it out again here:

I removed requestedSchema from SparkTable because with #1783, the Spark 3 IcebergSource changed to be a SupportsCatalogOptions, not just a TableProvider. Since DataFrameReader does not support specifying a schema when reading from an IcebergSource:

DataSource.lookupDataSourceV2(source, sparkSession.sessionState.conf).map { provider => ... val (table, catalog, ident) = provider match { case _: SupportsCatalogOptions if userSpecifiedSchema.nonEmpty => throw new IllegalArgumentException( s"$source does not support user specified schema. Please don't specify the schema.")

(see https://github.com/apache/spark/blob/v3.2.0/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L220-L223)
there is no reason to have a requestedSchema field as we cannot make use of it.

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

This has been implemented for Spark 2. For Spark 3, Ryan Blue proposed a syntax for adding the snapshot id or timestamp to the table identifier in apache#3269. Here we implement the Spark 3 support for using the snapshot schema by using the proposed table identifier syntax. This is until a new Spark 3 is released with support for AS OF in Spark SQL. Note: The table identifier syntax is for internal use only (as in this implementation) and not meant to be exposed as a publicly supported syntax in SQL. However, for testing, we do test its use from SQL.

Add a Schema parameter to the SparkScanBuilder constructor, so that we can pass the snapshot schema in when constructing it. In SparkTable#newScanBuilder, construct SparkScanBuilder with the snapshot schema.

jackye1995

overall looks good to me, I don't have any further comments around the time travel logic. I think we are missing a few failure test cases, could you add those?

jackye1995 · 2021-12-15T00:37:19Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

+    String value = options.get(property);
+    if (value != null) {
+      return Long.parseLong(value);
+    }


nit: newline after if

Will add a blank line.

Added a blank line.
What is the rationale for always adding a blank line after an if?
I fail to see how this makes the code more readable.
I can understand breaking a large block of code up with blank lines in general, but this is a very short method.

Yes agree. I think it's mostly just general codestyle rules the community follows, maybe we should just put these into checkstyle instead of being human linters

jackye1995 · 2021-12-15T00:52:52Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

@@ -120,4 +120,42 @@ public void testMetadataTables() {
        ImmutableList.of(row(ANY, ANY, null, "append", ANY, ANY)),
        sql("SELECT * FROM %s.snapshots", tableName));
  }
+
+  @Test


I think we are missing a few failure test cases:

Cannot specify both snapshot-id and as-of-timestamp

Cannot write from table at a specific snapshot

Cannot delete from table at a specific snapshot

Ack. I agree that it'd be good to have such test cases. I'd point out though that none of the above should be supported even before this change, so if the test cases don't exist, they are existing holes.

Added test cases for reading with both snapshot-id and as-of-timestamp, writing to a table at a specific snapshot, and deleting from a table at a specific snapshot.

These look good to me. Thanks for adding them!

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java

rdblue · 2021-12-16T00:32:44Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+    // or timestamp selector, then SparkTable will be constructed with a non-null snapshotId, but
+    // SparkTable#newScanBuilder will be called without the "snapshot-id" or "as-of-timestamp" option.
+    // We therefore add a "snapshot-id" option here in this latter case.
+    CaseInsensitiveStringMap scanOptions =


I'm not sure this is worth the complexity. Why not just always add the snapshot ID if snapshotId is set? We know that if it is set, the option snapshot-id or as-of-timestamp should correspond to it. We should just make sure that the given snapshot ID is set in the options and remove as-of-timestamp if it is set. That makes this whole block simpler:

CaseInsensitiveStringMap scanOptions = snapshotId != null ? addSnapshotId(options, snapshotId) : options;

With the update to addSnapshotId below, this worked fine with tests:

CaseInsensitiveStringMap scanOptions = addSnapshotId(options, snapshotId);

It didn't need the null check because that's done inside addSnapshotId.

Let me try it out.
I had run into a problem with the original addSnapshotId function and always calling it. After I analysed what was happening, I wrote that comment to remind myself. I therefore called addSnapshotId only when strictly necessary.

Ok, I see that you changed

Preconditions.checkArgument(snapshotIdFromOptions == null, "Cannot override snapshot ID more than once: %s", snapshotIdFromOptions);

to

Preconditions.checkArgument(snapshotIdFromOptions == null || snapshotId.toString().equals(snapshotIdFromOptions), "Cannot override snapshot ID more than once: %s", snapshotIdFromOptions);

in addSnapshotId.
With the old version, you should only call addSnapshotId if the options did not already have snapshot-id or as-of-timestamp.

rdblue · 2021-12-16T00:35:36Z

spark/v3.2/spark/src/main/java/org/apache/iceberg/spark/source/SparkTable.java

+
+      Map<String, String> scanOptions = Maps.newHashMap();
+      scanOptions.putAll(options.asCaseSensitiveMap());
+      scanOptions.put(SparkReadOptions.SNAPSHOT_ID, String.valueOf(snapshotId));


This should also remove as-of-timestamp since snapshot-id is being set. I think I missed that in my PR.

I updated this to the following and tests work fine:

private static CaseInsensitiveStringMap addSnapshotId(CaseInsensitiveStringMap options, Long snapshotId) { if (snapshotId != null) { String snapshotIdFromOptions = options.get(SparkReadOptions.SNAPSHOT_ID); Preconditions.checkArgument(snapshotIdFromOptions == null || snapshotId.toString().equals(snapshotIdFromOptions), "Cannot override snapshot ID more than once: %s", snapshotIdFromOptions); Map<String, String> scanOptions = Maps.newHashMap(); scanOptions.putAll(options.asCaseSensitiveMap()); scanOptions.put(SparkReadOptions.SNAPSHOT_ID, String.valueOf(snapshotId)); scanOptions.remove(SparkReadOptions.AS_OF_TIMESTAMP); return new CaseInsensitiveStringMap(scanOptions); } return options; }

Thanks, that makes sense.

rdblue · 2021-12-16T00:37:30Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/source/TestIcebergSourceTablesBase.java

+    Assert.assertEquals("Records should match", originalRecords,
+        resultDf.orderBy("id").collectAsList());
+
+    Snapshot snapshot1 = table.currentSnapshot();


A better name would be beforeAddColumn. In general, I think adding numbers to a generic name is not a good practice for readable tests.

rdblue · 2021-12-16T00:41:54Z

spark/v3.2/spark/src/test/java/org/apache/iceberg/spark/sql/TestSelect.java

+        "spark_catalog".equals(catalogName));
+
+    // get a timestamp just after the last write and get the current row set as expected
+    long timestamp = validationCatalog.loadTable(tableIdent).currentSnapshot().timestampMillis() + 2;


Can you use waitUntilAfter defined in SparkTestBase to avoid flaky tests?

I added a waitUntilAfter.

rdblue

Overall, this is ready to go in. My only real concern is over timestamps in testing without using waitUntilAfter. I think we can also simplify handling in newScanBuilder, but that's minor and I think that the current logic is correct.

rdblue · 2021-12-16T00:59:47Z

Thanks, @wypoon! This looks great I think we can get it in with a couple minor changes.

Minor tweaks to tests.

wypoon · 2021-12-16T03:15:36Z

@rdblue thanks for all the reviews. I adopted your suggestion around SparkTable#newScanBuilder and SparkTable.addSnapshotId, and made the other tweaks you suggested.
@jackye1995 thanks for your reviews and suggestions too.

wypoon · 2021-12-17T17:48:25Z

@rdblue @jackye1995 if this can be merged, I'll prepare PRs for Spark 3.1 and 3.0 for porting it. I'll be on vacation for the next two weeks.

rdblue · 2021-12-17T17:52:12Z

Thanks, @wypoon! Nice work.

github-actions bot added core spark labels Dec 12, 2021

wypoon force-pushed the schema-for-snapshot-spark3 branch from bec6bd5 to 31399e6 Compare December 13, 2021 00:21

jackye1995 added this to the Iceberg 0.13.0 Release milestone Dec 13, 2021

jackye1995 requested a review from rdblue December 13, 2021 18:56

wypoon commented Dec 13, 2021

View reviewed changes