-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(glue): support partition index on tables #17998
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. A handful of questions and changes, plus a few typos.
this.tableArn, | ||
this.database.catalogArn, | ||
this.database.databaseArn, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a source for these permissions? What does glue:UpdateTable
mean for a table vs catalog vs database?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is mostly from my own investigations (nothing documented that I could find).
Here's what I get with no permissions:
Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1GZIV4NJQYJJ1/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-89Z
BzvCbg7Oi is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:catalog (RequestId: ab6c2467-9c78-4d23-be59-953e4ab2
3144)
Here's what I get after I add glue:UpdateTable
with permissions to the catalog
:
Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-7U8A8EMIKP4E/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-nlw0
ulljEHZq is not authorized to perform: glue:UpdateTable on resource: arn:aws:gl
ue:us-east-1:489318732371:database/my_database (RequestId: d4caefa8-30d2-4b02-9
9e4-6b1305ab5aea)
So I then add the same permission to the database
and I get:
Received response status [FAILED] from custom resource. Message returned: User:
arn:aws:sts::489318732371:assumed-role/GluePartitionStack-AWS679f53fac002430cb0
da5b7982bd-1N8TEEWC845G7/GluePartitionStack-AWS679f53fac002430cb0da5b7982bd-z2X
SZ4Zjykso is not authorized to perform: glue:UpdateTable on resource: arn:aws:g
lue:us-east-1:489318732371:table/my_database/json_table (RequestId: aa73a7ff-0f
f8-4c20-8338-b0f587298d6e)
The only valid permission is for the catalog, database, and table.
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from master and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
This PR adds support for creating partition indexes on tables via custom resources. It offers two different ways to create indexes: ```ts // via table definition const table = new glue.Table(this, 'Table', { database, bucket, tableName: 'table', columns, partitionKeys, partitionIndexes: [{ indexName: 'my-index', keyNames: ['month'], }], dataFormat: glue.DataFormat.CSV, }); ``` ```ts // or as a function table.AddPartitionIndex([{ indexName: 'my-other-index', keyNames: ['month', 'year'], }); ``` I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. Motivation: Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves. FAQ: Why do we need to chain dependencies between different Partition Index Custom Resources? - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped. Why is it called `partitionIndexes`? Is that really how you pluralize index? - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`. Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys... - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective: ```ts const key1 = { name: 'mykey', type: glue.Schema.STRING }; const key2 = { name: 'mykey2', type: glue.Schema.STRING }; const key3 = { name: 'mykey3', type: glue.Schema.STRING }; new glue.Table(this, 'table', { database, bucket, tableName: 'table', columns, partitionKeys: [key1, key2, key3], partitionIndexes: [key1, key2], dataFormat: glue.DataFormat.CSV, }); ``` Why are there 2 different checks for having > 3 partition indexes? - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources. What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure? - Sorry, no. Why do we need to generate names? - We don't. I just thought it would be helpful. Why is `grantToUnderlyingResources` public? - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog. Closes aws#17589. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
This PR adds support for creating partition indexes on tables via custom resources.
It offers two different ways to create indexes:
I also refactored the format of some tests, which is what accounts for the large diff in
test.table.ts
.Motivation:
Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves.
FAQ:
Why do we need to chain dependencies between different Partition Index Custom Resources?
Why is it called
partitionIndexes
? Is that really how you pluralize index?partitionIndices
.Why is
keyNames
of typestring[]
and notColumn[]
?PartitionKey
is of typeColumn[]
and partition indexes must be a subset of partition keys...Why are there 2 different checks for having > 3 partition indexes?
table.addPartitionIndex()
. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources.What if I deploy a table, manually add 3 partition indexes, and then try to call
table.addPartitionIndex()
and update the stack? Will that still be a synth time failure?Why do we need to generate names?
Why is
grantToUnderlyingResources
public?Closes #17589.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license