feat: expose Iceberg features to python users #5590

lbooker42 · 2024-06-07T23:30:40Z

Exposes Iceberg table support and adapter creation through python.

Will close #5574

Example Usage:

Local (MinIO + REST Catalog):

from deephaven.experimental import s3, iceberg

local_adapter = iceberg.adapter_s3_rest(
        name="minio-iceberg",
        catalog_uri="http://rest:8181",
        warehouse_location="s3a://warehouse/wh",
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        end_point_override="http://minio:9000");

t_ns = local_adapter.namespaces()
t_tables = local_adapter.tables("sales")
t_snapshots = local_adapter.snapshots("sales.sales_multi")

#################################################

s3_instructions = s3.S3Instructions(
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        endpoint_override="http://minio:9000"
        )

iceberg_instructions = iceberg.IcebergInstructions(data_instructions=s3_instructions)

data_table = local_adapter.read_table(table_identifier="sample.all_types", instructions=iceberg_instructions)

sales_table = local_adapter.read_table(table_identifier="sales.sales_single", instructions=iceberg_instructions)

sales_restricted = sales_table.select(["Region", "Item_Type", "Unit_Price", "Order_Date"])

sales_pt = local_adapter.read_table(table_identifier="sales.sales_partitioned", instructions=iceberg_instructions)

#################################################

custom_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        column_renames={
                "Region":"Area",
                "Item_Type":"Category"
        })

sales_custom = local_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

#################################################

from deephaven import dtypes        

custom_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        column_renames={
                "Region":"Area",
                "Item_Type":"Category"
        }, table_definition={
                "Area": dtypes.string,
                "Category": dtypes.string,
                "Unit_Price": dtypes.double
        })

sales_custom_td = local_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

AWS Glue:

NOTE: the region and credentials are specified locally in the ~/.aws/config and ~/.aws/credentials files.

from deephaven.experimental import s3, iceberg

cloud_adapter = iceberg.adapter_aws_glue(
        name="aws-iceberg",
        catalog_uri="s3://lab-warehouse/sales",
        warehouse_location="s3://lab-warehouse/sales");

t_ns = cloud_adapter.namespaces()
t_tables = cloud_adapter.tables("sales")
t_snapshots = cloud_adapter.snapshots("sales.sales_single")

#################################################

sales_table = cloud_adapter.read_table(table_identifier="sales.sales_single")

#################################################

custom_instructions = iceberg.IcebergInstructions(
        column_renames={
                "region":"Area",
                "item_type":"Category"
        })

sales_custom = cloud_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

#################################################

from deephaven import dtypes        

custom_instructions = iceberg.IcebergInstructions(
        column_renames={
                "region":"Area",
                "item_type":"Category",
                "unit_price":"Price"
        }, table_definition={
                "Area": dtypes.string,
                "Category": dtypes.string,
                "Price": dtypes.double
        })

sales_custom_td = cloud_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

py/server/deephaven/experimental/iceberg.py

py/server/deephaven/jcompat.py

py/server/tests/test_iceberg.py

chipkent · 2024-06-12T16:11:44Z

py/server/tests/test_iceberg.py

+            Column("z", dtypes.double),
+        ]
+
+        iceberg_instructions = iceberg.IcebergInstructions(table_definition=table_def)


Missing tests:

IcebergCatalogAdapter

s3_rest_adapter

aws_glue_adapter

These objects aren't instantiable without a working S3 or AWS backend. I have tested manually, but not sure how or if we can test these in CI.

IcebergCatalogAdapter is tested as a Java object in CI (IcebergToolsTest.java)

Testing the Java is not the same as making sure the python actually works. Python tests don't need to be the same detail, but they should confirm that the code runs.

In python, I think we test against Kafka, SQL, etc. Is there a reason that we shouldn't actually be testing against some S3 here? Is it significantly harder to setup than these other cases?

There is no way to run the tests in a repeatable way when there is S3 available if the tests don't exist.

I have low confidence that this (or any other code) will keep working without some kind of test.

After discussing with @devinrsmith and @malhotrashivam , it is significantly harder to test the python wrappers directly. But these wrappers are very light wrapping over better tested Java objects.

Regardless, the python isn't tested.

We're quite confident that we don't want build additional dockerized integration tests. They tend to be flaky, and the work involved greatly exceeds the scope of this ticket. We'll file a ticket to expand testcontainers-based testing for glue, but per @devinrsmith it appears to be a paid feature so we may not want to add it.

I think manual testing is enough for such a thin wrapper, and I don't intend to sponsor docker-based automation at this time.

Created the following ticket to address missing CI testing: #5656

chipkent · 2024-06-14T16:10:57Z

py/server/tests/test_iceberg.py

+            Column("z", dtypes.double),
+        ]
+
+        iceberg_instructions = iceberg.IcebergInstructions(table_definition=table_def)


Testing the Java is not the same as making sure the python actually works. Python tests don't need to be the same detail, but they should confirm that the code runs.

In python, I think we test against Kafka, SQL, etc. Is there a reason that we shouldn't actually be testing against some S3 here? Is it significantly harder to setup than these other cases?

There is no way to run the tests in a repeatable way when there is S3 available if the tests don't exist.

I have low confidence that this (or any other code) will keep working without some kind of test.

py/server/deephaven/experimental/iceberg.py

rcaudy

Java-only light review.

extensions/iceberg/s3/src/main/java/io/deephaven/iceberg/util/IcebergToolsS3.java

extensions/iceberg/src/main/java/io/deephaven/iceberg/util/IcebergCatalogAdapter.java

rcaudy

Content with Java changes

rcaudy

.

devinrsmith · 2024-06-20T16:54:21Z

Note that this PR fixes #5642

py/server/deephaven/experimental/iceberg.py

deephaven-internal · 2024-06-24T19:51:36Z

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#241

lbooker42 added 2 commits June 7, 2024 15:13

WIP commit, functional but needs docs.

1b39b99

Much better docs, need to figure out AWS Glue ASAP.

c93d052

lbooker42 requested review from chipkent, jmao-denver and rcaudy as code owners June 7, 2024 23:30

lbooker42 self-assigned this Jun 7, 2024

lbooker42 added query engine python DocumentationNeeded python-server-side ReleaseNotesNeeded Release notes are needed labels Jun 7, 2024

lbooker42 added this to the 3. May 2024 milestone Jun 7, 2024

lbooker42 commented Jun 8, 2024

View reviewed changes