Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose Iceberg features to python users #5590

Merged
merged 13 commits into from
Jun 24, 2024

Conversation

lbooker42
Copy link
Contributor

@lbooker42 lbooker42 commented Jun 7, 2024

Exposes Iceberg table support and adapter creation through python.

Will close #5574

Example Usage:

Local (MinIO + REST Catalog):

from deephaven.experimental import s3, iceberg

local_adapter = iceberg.adapter_s3_rest(
        name="minio-iceberg",
        catalog_uri="http://rest:8181",
        warehouse_location="s3a://warehouse/wh",
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        end_point_override="http://minio:9000");

t_ns = local_adapter.namespaces()
t_tables = local_adapter.tables("sales")
t_snapshots = local_adapter.snapshots("sales.sales_multi")

#################################################

s3_instructions = s3.S3Instructions(
        region_name="us-east-1",
        access_key_id="admin",
        secret_access_key="password",
        endpoint_override="http://minio:9000"
        )

iceberg_instructions = iceberg.IcebergInstructions(data_instructions=s3_instructions)

data_table = local_adapter.read_table(table_identifier="sample.all_types", instructions=iceberg_instructions)

sales_table = local_adapter.read_table(table_identifier="sales.sales_single", instructions=iceberg_instructions)

sales_restricted = sales_table.select(["Region", "Item_Type", "Unit_Price", "Order_Date"])

sales_pt = local_adapter.read_table(table_identifier="sales.sales_partitioned", instructions=iceberg_instructions)

#################################################

custom_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        column_renames={
                "Region":"Area",
                "Item_Type":"Category"
        })

sales_custom = local_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

#################################################

from deephaven import dtypes        

custom_instructions = iceberg.IcebergInstructions(
        data_instructions=s3_instructions,
        column_renames={
                "Region":"Area",
                "Item_Type":"Category"
        }, table_definition={
                "Area": dtypes.string,
                "Category": dtypes.string,
                "Unit_Price": dtypes.double
        })

sales_custom_td = local_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

AWS Glue:

NOTE: the region and credentials are specified locally in the ~/.aws/config and ~/.aws/credentials files.

from deephaven.experimental import s3, iceberg

cloud_adapter = iceberg.adapter_aws_glue(
        name="aws-iceberg",
        catalog_uri="s3://lab-warehouse/sales",
        warehouse_location="s3://lab-warehouse/sales");

t_ns = cloud_adapter.namespaces()
t_tables = cloud_adapter.tables("sales")
t_snapshots = cloud_adapter.snapshots("sales.sales_single")

#################################################

sales_table = cloud_adapter.read_table(table_identifier="sales.sales_single")

#################################################

custom_instructions = iceberg.IcebergInstructions(
        column_renames={
                "region":"Area",
                "item_type":"Category"
        })

sales_custom = cloud_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

#################################################

from deephaven import dtypes        

custom_instructions = iceberg.IcebergInstructions(
        column_renames={
                "region":"Area",
                "item_type":"Category",
                "unit_price":"Price"
        }, table_definition={
                "Area": dtypes.string,
                "Category": dtypes.string,
                "Price": dtypes.double
        })

sales_custom_td = cloud_adapter.read_table(table_identifier="sales.sales_single", instructions=custom_instructions)

py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/jcompat.py Show resolved Hide resolved
py/server/tests/test_iceberg.py Show resolved Hide resolved
Column("z", dtypes.double),
]

iceberg_instructions = iceberg.IcebergInstructions(table_definition=table_def)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests:

  • IcebergCatalogAdapter
  • s3_rest_adapter
  • aws_glue_adapter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These objects aren't instantiable without a working S3 or AWS backend. I have tested manually, but not sure how or if we can test these in CI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IcebergCatalogAdapter is tested as a Java object in CI (IcebergToolsTest.java)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Testing the Java is not the same as making sure the python actually works. Python tests don't need to be the same detail, but they should confirm that the code runs.
  2. In python, I think we test against Kafka, SQL, etc. Is there a reason that we shouldn't actually be testing against some S3 here? Is it significantly harder to setup than these other cases?
  3. There is no way to run the tests in a repeatable way when there is S3 available if the tests don't exist.
  4. I have low confidence that this (or any other code) will keep working without some kind of test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing with @devinrsmith and @malhotrashivam , it is significantly harder to test the python wrappers directly. But these wrappers are very light wrapping over better tested Java objects.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless, the python isn't tested.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're quite confident that we don't want build additional dockerized integration tests. They tend to be flaky, and the work involved greatly exceeds the scope of this ticket. We'll file a ticket to expand testcontainers-based testing for glue, but per @devinrsmith it appears to be a paid feature so we may not want to add it.

I think manual testing is enough for such a thin wrapper, and I don't intend to sponsor docker-based automation at this time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created the following ticket to address missing CI testing: #5656

@lbooker42 lbooker42 requested a review from chipkent June 13, 2024 16:33
Column("z", dtypes.double),
]

iceberg_instructions = iceberg.IcebergInstructions(table_definition=table_def)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Testing the Java is not the same as making sure the python actually works. Python tests don't need to be the same detail, but they should confirm that the code runs.
  2. In python, I think we test against Kafka, SQL, etc. Is there a reason that we shouldn't actually be testing against some S3 here? Is it significantly harder to setup than these other cases?
  3. There is no way to run the tests in a repeatable way when there is S3 available if the tests don't exist.
  4. I have low confidence that this (or any other code) will keep working without some kind of test.

py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Show resolved Hide resolved
Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java-only light review.

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content with Java changes

Copy link
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@devinrsmith
Copy link
Member

Note that this PR fixes #5642

@devinrsmith devinrsmith linked an issue Jun 20, 2024 that may be closed by this pull request
@rcaudy rcaudy added the release blocker A bug/behavior that puts is below the "good enough" threshold to release. label Jun 20, 2024
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
py/server/deephaven/experimental/iceberg.py Outdated Show resolved Hide resolved
@lbooker42 lbooker42 merged commit a13a0dc into deephaven:main Jun 24, 2024
15 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jun 24, 2024
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

Community: deephaven/deephaven-docs-community#241

@lbooker42 lbooker42 deleted the lab-iceberg-py branch June 26, 2024 19:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
DocumentationNeeded python python-server-side query engine release blocker A bug/behavior that puts is below the "good enough" threshold to release. ReleaseNotesNeeded Release notes are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

org.slf4j:slf4j-api 1.x on classpath breaks logging Add python wrapper for Iceberg tables
6 participants