Skip to content

Commit

Permalink
feat(glue): support partition index on tables (#17998)
Browse files Browse the repository at this point in the history
This PR adds support for creating partition indexes on tables via custom resources.
It offers two different ways to create indexes:

```ts
// via table definition
const table = new glue.Table(this, 'Table', {
  database,
  bucket,
  tableName: 'table',
  columns,
  partitionKeys,
  partitionIndexes: [{
    indexName: 'my-index',
    keyNames: ['month'],
  }],
  dataFormat: glue.DataFormat.CSV,
});
```

```ts
// or as a function
table.AddPartitionIndex([{
  indexName: 'my-other-index',
  keyNames: ['month', 'year'],
});
```

I also refactored the format of some tests, which is what accounts for the large diff in `test.table.ts`. 

Motivation: 
Creating partition indexes on a table is something you can do via the console, but is not an exposed property in cloudformation. In this case, I think it makes sense to support this feature via custom resources as it will significantly reduce the customer pain of either provisioning a custom resource with correct permissions or manually going into the console after resource creation. Supporting this feature allows for synth-time checks and dependency chaining for multiple indexes (reason detailed in the FAQ) which removes a rather sharp edge for users provisioning custom resource indexes themselves.

FAQ:

Why do we need to chain dependencies between different Partition Index Custom Resources? 
  - Because Glue only allows 1 index to be created or deleted simultaneously per table. Without dependencies the resources will try to create partition indexes simultaneously and the second sdk call with be dropped.

Why is it called `partitionIndexes`? Is that really how you pluralize index?
  - [Yesish](https://www.nasdaq.com/articles/indexes-or-indices-whats-the-deal-2016-05-12). If you hate it it can be `partitionIndices`.

Why is `keyNames` of type `string[]` and not `Column[]`? `PartitionKey` is of type `Column[]` and partition indexes must be a subset of partition keys...
  - This could be a debate. But my argument is that the pattern I see for defining a Table is to define partition keys inline and not declare them each as variables. It would be pretty clunky from a UX perspective:
    ```ts
    const key1 = { name: 'mykey', type: glue.Schema.STRING };
    const key2 = { name: 'mykey2', type: glue.Schema.STRING };
    const key3 = { name: 'mykey3', type: glue.Schema.STRING };
    new glue.Table(this, 'table', {
      database,
      bucket,
      tableName: 'table',
      columns,
      partitionKeys: [key1, key2, key3],
      partitionIndexes: [key1, key2],
      dataFormat: glue.DataFormat.CSV,
    });
    ```

Why are there 2 different checks for having > 3 partition indexes?
  - It's possible someone decides to define 3 indexes in the definition and then try to add another with `table.addPartitionIndex()`. This would be a nasty deploy time error, its better if it is synth time. It's also possible someone decides to define 4 indexes in the definition. It's better to fast-fail here before we create 3 custom resources.

What if I deploy a table, manually add 3 partition indexes, and then try to call `table.addPartitionIndex()` and update the stack? Will that still be a synth time failure?
  - Sorry, no. 

Why do we need to generate names?
  - We don't. I just thought it would be helpful.

Why is `grantToUnderlyingResources` public?
  - I thought it would be helpful. Some permissions need to be added to the table, the database, and the catalog.

Closes #17589.

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
  • Loading branch information
kaizencc committed Dec 29, 2021
1 parent aa51b6c commit c071367
Show file tree
Hide file tree
Showing 7 changed files with 1,379 additions and 420 deletions.
48 changes: 47 additions & 1 deletion packages/@aws-cdk/aws-glue/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ new glue.Table(this, 'MyTable', {

By default, an S3 bucket will be created to store the table's data and stored in the bucket root. You can also manually pass the `bucket` and `s3Prefix`:

### Partitions
### Partition Keys

To improve query performance, a table can specify `partitionKeys` on which data is stored and queried separately. For example, you might partition a table by `year` and `month` to optimize queries based on a time window:

Expand All @@ -218,6 +218,52 @@ new glue.Table(this, 'MyTable', {
});
```

### Partition Indexes

Another way to improve query performance is to specify partition indexes. If no partition indexes are
present on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using
the query expression. The query takes more time to run as the number of partitions increase. With an
index, the query will try to fetch a subset of the partitions instead of loading all partitions of the
table.

The keys of a partition index must be a subset of the partition keys of the table. You can have a
maximum of 3 partition indexes per table. To specify a partition index, you can use the `partitionIndexes`
property:

```ts
declare const myDatabase: glue.Database;
new glue.Table(this, 'MyTable', {
database: myDatabase,
tableName: 'my_table',
columns: [{
name: 'col1',
type: glue.Schema.STRING,
}],
partitionKeys: [{
name: 'year',
type: glue.Schema.SMALL_INT,
}, {
name: 'month',
type: glue.Schema.SMALL_INT,
}],
partitionIndexes: [{
indexName: 'my-index', // optional
keyNames: ['year'],
}], // supply up to 3 indexes
dataFormat: glue.DataFormat.JSON,
});
```

Alternatively, you can call the `addPartitionIndex()` function on a table:

```ts
declare const myTable: glue.Table;
myTable.addPartitionIndex({
indexName: 'my-index',
keyNames: ['year'],
});
```

## [Encryption](https://docs.aws.amazon.com/athena/latest/ug/encryption.html)

You can enable encryption on a Table's data:
Expand Down
132 changes: 130 additions & 2 deletions packages/@aws-cdk/aws-glue/lib/table.ts
Original file line number Diff line number Diff line change
@@ -1,13 +1,33 @@
import * as iam from '@aws-cdk/aws-iam';
import * as kms from '@aws-cdk/aws-kms';
import * as s3 from '@aws-cdk/aws-s3';
import { ArnFormat, Fn, IResource, Resource, Stack } from '@aws-cdk/core';
import { ArnFormat, Fn, IResource, Names, Resource, Stack } from '@aws-cdk/core';
import * as cr from '@aws-cdk/custom-resources';
import { AwsCustomResource } from '@aws-cdk/custom-resources';
import { Construct } from 'constructs';
import { DataFormat } from './data-format';
import { IDatabase } from './database';
import { CfnTable } from './glue.generated';
import { Column } from './schema';

/**
* Properties of a Partition Index.
*/
export interface PartitionIndex {
/**
* The name of the partition index.
*
* @default - a name will be generated for you.
*/
readonly indexName?: string;

/**
* The partition key names that comprise the partition
* index. The names must correspond to a name in the
* table's partition keys.
*/
readonly keyNames: string[];
}
export interface ITable extends IResource {
/**
* @attribute
Expand Down Expand Up @@ -102,7 +122,16 @@ export interface TableProps {
*
* @default table is not partitioned
*/
readonly partitionKeys?: Column[]
readonly partitionKeys?: Column[];

/**
* Partition indexes on the table. A maximum of 3 indexes
* are allowed on a table. Keys in the index must be part
* of the table's partition keys.
*
* @default table has no partition indexes
*/
readonly partitionIndexes?: PartitionIndex[];

/**
* Storage type of the table's data.
Expand Down Expand Up @@ -230,6 +259,18 @@ export class Table extends Resource implements ITable {
*/
public readonly partitionKeys?: Column[];

/**
* This table's partition indexes.
*/
public readonly partitionIndexes?: PartitionIndex[];

/**
* Partition indexes must be created one at a time. To avoid
* race conditions, we store the resource and add dependencies
* each time a new partition index is created.
*/
private partitionIndexCustomResources: AwsCustomResource[] = [];

constructor(scope: Construct, id: string, props: TableProps) {
super(scope, id, {
physicalName: props.tableName,
Expand Down Expand Up @@ -287,6 +328,77 @@ export class Table extends Resource implements ITable {
resourceName: `${this.database.databaseName}/${this.tableName}`,
});
this.node.defaultChild = tableResource;

// Partition index creation relies on created table.
if (props.partitionIndexes) {
this.partitionIndexes = props.partitionIndexes;
this.partitionIndexes.forEach((index) => this.addPartitionIndex(index));
}
}

/**
* Add a partition index to the table. You can have a maximum of 3 partition
* indexes to a table. Partition index keys must be a subset of the table's
* partition keys.
*
* @see https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html
*/
public addPartitionIndex(index: PartitionIndex) {
const numPartitions = this.partitionIndexCustomResources.length;
if (numPartitions >= 3) {
throw new Error('Maximum number of partition indexes allowed is 3');
}
this.validatePartitionIndex(index);

const indexName = index.indexName ?? this.generateIndexName(index.keyNames);
const partitionIndexCustomResource = new cr.AwsCustomResource(this, `partition-index-${indexName}`, {
onCreate: {
service: 'Glue',
action: 'createPartitionIndex',
parameters: {
DatabaseName: this.database.databaseName,
TableName: this.tableName,
PartitionIndex: {
IndexName: indexName,
Keys: index.keyNames,
},
},
physicalResourceId: cr.PhysicalResourceId.of(
indexName,
),
},
policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE,
}),
});
this.grantToUnderlyingResources(partitionIndexCustomResource, ['glue:UpdateTable']);

// Depend on previous partition index if possible, to avoid race condition
if (numPartitions > 0) {
this.partitionIndexCustomResources[numPartitions-1].node.addDependency(partitionIndexCustomResource);
}
this.partitionIndexCustomResources.push(partitionIndexCustomResource);
}

private generateIndexName(keys: string[]): string {
const prefix = keys.join('-') + '-';
const uniqueId = Names.uniqueId(this);
const maxIndexLength = 80; // arbitrarily specified
const startIndex = Math.max(0, uniqueId.length - (maxIndexLength - prefix.length));
return prefix + uniqueId.substring(startIndex);
}

private validatePartitionIndex(index: PartitionIndex) {
if (index.indexName !== undefined && (index.indexName.length < 1 || index.indexName.length > 255)) {
throw new Error(`Index name must be between 1 and 255 characters, but got ${index.indexName.length}`);
}
if (!this.partitionKeys || this.partitionKeys.length === 0) {
throw new Error('The table must have partition keys to create a partition index');
}
const keyNames = this.partitionKeys.map(pk => pk.name);
if (!index.keyNames.every(k => keyNames.includes(k))) {
throw new Error(`All index keys must also be partition keys. Got ${index.keyNames} but partition key names are ${keyNames}`);
}
}

/**
Expand Down Expand Up @@ -336,6 +448,22 @@ export class Table extends Resource implements ITable {
});
}

/**
* Grant the given identity custom permissions to ALL underlying resources of the table.
* Permissions will be granted to the catalog, the database, and the table.
*/
public grantToUnderlyingResources(grantee: iam.IGrantable, actions: string[]) {
return iam.Grant.addToPrincipal({
grantee,
resourceArns: [
this.tableArn,
this.database.catalogArn,
this.database.databaseArn,
],
actions,
});
}

private getS3PrefixForGrant() {
return this.s3Prefix + '*';
}
Expand Down
2 changes: 2 additions & 0 deletions packages/@aws-cdk/aws-glue/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,7 @@
"@aws-cdk/aws-s3": "0.0.0",
"@aws-cdk/aws-s3-assets": "0.0.0",
"@aws-cdk/core": "0.0.0",
"@aws-cdk/custom-resources": "0.0.0",
"constructs": "^3.3.69"
},
"homepage": "https://github.com/aws/aws-cdk",
Expand All @@ -113,6 +114,7 @@
"@aws-cdk/aws-s3": "0.0.0",
"@aws-cdk/aws-s3-assets": "0.0.0",
"@aws-cdk/core": "0.0.0",
"@aws-cdk/custom-resources": "0.0.0",
"constructs": "^3.3.69"
},
"engines": {
Expand Down
Loading

0 comments on commit c071367

Please sign in to comment.