Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add flags for Iceberg and Lake Formation and Security Lake as a data source type. #2858

Merged
merged 2 commits into from
Aug 13, 2024

Conversation

asuresh8
Copy link
Contributor

Description

Previously, Iceberg catalog was set as the default catalog. This poses problems as the behavior to fall back to default Spark catalog is only correct in some versions of Iceberg. Rather than always opt into Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR Serverless job, but this did not work as expected. Setting Lake Formation enabled for the entire EMR Serverless job result in EMR Serverless system space being used. This system space does not work with Flint. The full limitations are documented in https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html. Instead Lake Formation can be enabled using the Iceberg catalog implementation.

Testing

Built and deployed to cluster and then queried Iceberg tables to verify functionality still works as well as testing the new functionality.

Setup

Created 2 data sources:

curl \
--request POST \
--url http://localhost:9200/_plugins/_query/_datasources \
--header 'content-type: application/x-ndjson' \
--data '{"name": "gdc2","description": "","connector": "S3GLUE","allowedRoles": [],
"properties": {"glue.auth.type": "iam_role","glue.auth.role_arn": "arn:aws:iam::476834799096:role/DirectQueryWithPublicLakeFormation","glue.indexstore.opensearch.uri": "http://ip-10-1-41-51.us-west-2.compute.internal:9200","glue.indexstore.
opensearch.auth": "noauth", "glue.iceberg.enabled": "true"}}'
curl \
--request POST \
--url http://localhost:9200/_plugins/_query/_datasources \
--header 'content-type: application/x-ndjson' \
--data '{"name": "gdc3","description": "","connector": "S3GLUE","allowedRoles": [],"properties": {"glue.auth.type": "iam_role","glue.auth.role_arn": "arn:aws:iam::476834799096:role/DirectQueryWithPublicLakeFormation","glue.indexstore.opensearch.uri": "http://ip-10-1-41-51.us-west-2.compute.internal:9200","glue.indexstore.opensearch.auth": "noauth", "glue.iceberg.enabled": "true", "glue.lakeformation.enabled": "true", "glue.lakeformation.session_tag": "directquery"}}'

Scenario 1 (S3 IAM permissions, Yes Iceberg and No Lake Formation flags set) Iceberg table should work

curl --request  POST --url http://localhost:9200/_plugins/_async_query --header 'content-type: application/x-ndjson' --data '{"datasource": "gdc2","lang": "sql","query": "SELECT * FROM gdc2.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_vpc_flow_2_0 LIMIT 1;"}'
{
  "queryId": "aWt3aDQyM0JxaGdkYzI=",
  ...
}

curl --request GET --url http://localhost:9200/_plugins/_async_query/aWt3aDQyM0JxaGdkYzI=

Scenario 2 (LF permissions, Yes Iceberg and Yes Lake Formation flags set) Iceberg table should work

curl --request  POST --url http://localhost:9200/_plugins/_async_query --header 'content-type: application/x-ndjson' --data '{"datasource": "gdc3","lang": "sql","query": "SELECT * FROM gdc3.amazon_security_lake_glue_db_us_west_2.amazon_security_lake_table_us_west_2_vpc_flow_2_0 LIMIT 1;"}'
{
  "queryId": "T3A3RWlTSGVRSWdkYzM=",
  ...
}

curl --request GET --url http://localhost:9200/_plugins/_async_query/T3A3RWlTSGVRSWdkYzM=
{
  "status": "SUCCESS",
...
}

Check List

  • New functionality includes testing.
    • All tests pass, including unit test, integration test and doctest
  • New functionality has been documented.
    • New functionality has javadoc added
    • New functionality has user manual doc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* ``glue.lakeformation.enabled`` determines whether to enable lakeformation for queries. Default value is ``"false"`` if not specified
* ``glue.iceberg.enabled`` determines whether to enable Iceberg for the session. Default value is ``"false"`` if not specified.
* ``glue.lakeformation.enabled`` determines whether to enable Lake Formation for queries when Iceberg is also enabled. If Iceberg is not enabled, then this property has no effect. Default value is ``"false"`` if not specified.
* ``glue.lakeformation.session_tag`` what session tag to use when assuming the data source role. This property is required when both Iceberg and Lake Formation are enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we introduced validations on these conditions while creating Glue datasource?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! I've added them now.

@asuresh8 asuresh8 force-pushed the iceberg_lf branch 2 times, most recently from 75683cc to 726c24f Compare July 30, 2024 21:30
@asuresh8 asuresh8 force-pushed the iceberg_lf branch 2 times, most recently from 79b8f9a to af13394 Compare August 7, 2024 18:56
@asuresh8 asuresh8 changed the title Add flag for iceberg and correct flag for Lake Formation. Add flags for Iceberg and Lake Formation and Security Lake as a data source type. Aug 7, 2024
@asuresh8 asuresh8 force-pushed the iceberg_lf branch 3 times, most recently from 2a2843d to d5ae63a Compare August 7, 2024 21:55
@penghuo
Copy link
Collaborator

penghuo commented Aug 9, 2024

@ykmr1224 could u take a look.
@asuresh8 could u check Java CI failure.

@asuresh8
Copy link
Contributor Author

asuresh8 commented Aug 9, 2024

@asuresh8 could u check Java CI failure.

Failure does not look related to this change.

docs/user/ppl/admin/connectors/s3glue_connector.rst Outdated Show resolved Hide resolved
"glue.auth.role_arn": "role_arn",
"glue.indexstore.opensearch.uri": "http://adsasdf.amazonopensearch.com:9200",
"glue.indexstore.opensearch.auth" :"awssigv4",
"glue.indexstore.opensearch.auth.region" :"awssigv4",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. The value should be region name.

docs/user/ppl/admin/connectors/security_lake_connector.rst Outdated Show resolved Hide resolved
public static final String SPARK_CATALOG_CATALOG_IMPL =
"spark.sql.catalog.spark_catalog.catalog-impl";
public static final String ICEBERG_SPARK_JARS =
"org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.5.0,software.amazon.awssdk:bundle:2.26.30";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it safe to specify the specific version? When do we notice the issue if we have version inconsistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to say. Some things I observed here is that using the Iceberg version in EMR was causing issues in EMR versions prior to 7.2 (Spark 3.5.1). So specifying iceberg from Maven central is more stable than that. On the AWS version, I'm not sure. The AWS sdk v2 is only used with Iceberg in the EMR 6.x versions, and this version doesn't conflict. That's not to say it couldn't be an issue in EMR 7.x.

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
Copy link
Collaborator

@penghuo penghuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BWC Test failed, it is not related to this PR.
Track it seperatelly.

@penghuo penghuo merged commit 05c961e into opensearch-project:main Aug 13, 2024
13 of 15 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Aug 13, 2024
…source type. (#2858)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
(cherry picked from commit 05c961e)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
manasvinibs pushed a commit to manasvinibs/sql that referenced this pull request Aug 14, 2024
…source type. (opensearch-project#2858)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
jzonthemtn pushed a commit to jzonthemtn/sql that referenced this pull request Aug 28, 2024
…source type. (opensearch-project#2858)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 4, 2024
…source type. (#2858)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
(cherry picked from commit 05c961e)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ykmr1224 pushed a commit that referenced this pull request Sep 4, 2024
…source type. (#2858) (#2978)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------


(cherry picked from commit 05c961e)

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
ykmr1224 pushed a commit that referenced this pull request Sep 4, 2024
…source type. (#2858) (#2934)

Previously, Iceberg catalog was set as the default catalog. This poses
problems as the behavior to fall back to default Spark catalog is only
correct in some versions of Iceberg. Rather than always opt into
Iceberg, Iceberg should be an option.

Additionally, the Lake Formation flag enabled Lake Formation for the EMR
job. This did not work as expected because EMR system space does not
work with Flint. Instead Lake Formation can be enabled using the Iceberg
catalog implementation.

This changes adds Security Lake as a data source type. Security Lake as
a data source is simply specific options set on top of the base S3Glue
data source.
---------


(cherry picked from commit 05c961e)

Signed-off-by: Adi Suresh <adsuresh@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants