Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(glue-alpha): glue L2 alpha construct #30833

Open
wants to merge 71 commits into
base: main
Choose a base branch
from

Conversation

natalie-white-aws
Copy link

@natalie-white-aws natalie-white-aws commented Jul 11, 2024

Issue # (if applicable)

RFC 0497-glue-l2-construct
Link: https://github.com/aws/aws-cdk-rfcs/blob/main/text/0497-glue-l2-construct.md

Reason for this change

Refactor existing Glue L2 alpha construct to adhere to CDK best practices, enforce Job types and parameters via interface constructs, and improve the developer experience pushing validations left and disallowing deprecated Glue and language versions.

Description of changes

The existing Job and Job Executable monoliths have been decomposed into Job Type and Language specific classes that implement and extend an abstract Job parent class. Developers will be able to see mandatory and optional parameters that apply just to their selected job type and language, rather than having to reference documentation and examples or find out during synth or deploy time that they've selected the wrong configuration.

BREAKING CHANGE: Developers must refactor their existing Job instantiation method calls to choose the right job type and language, and use the new constants static values to define the associated Job configuration settings. See the RFC and/or new README for examples.

Description of how you validated changes

We've added unit and integration tests for each valid job type and language combination and validated the creation of new Glue Jobs adhere to opinionated best practices such as continuous logging and runtime versions.

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

natalie-white-aws and others added 30 commits February 23, 2024 06:04
…-glue-l2 to mjanardhan/aws-cdk-glue-l2 to resolve non-trivial merge situation
@aws-cdk-automation aws-cdk-automation removed the pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. label Jul 12, 2024
@natalie-white-aws
Copy link
Author

natalie-white-aws commented Jul 12, 2024

Ah, the test coverage. Y'all care about that? :D We'll get on it.

@aws-cdk/aws-glue-alpha: =============================== Coverage summary ===============================
@aws-cdk/aws-glue-alpha: Statements : 85.06% ( 769/904 )
@aws-cdk/aws-glue-alpha: Branches : 65.89% ( 284/431 )
@aws-cdk/aws-glue-alpha: Functions : 69.5% ( 139/200 )
@aws-cdk/aws-glue-alpha: Lines : 86.1% ( 762/885 )
@aws-cdk/aws-glue-alpha: ================================================================================
@aws-cdk/aws-glue-alpha: Jest: "global" coverage threshold for branches (80%) not met: 65.89%
@aws-cdk/aws-glue-alpha: Test Suites: 20 passed, 20 total
@aws-cdk/aws-glue-alpha: Tests: 242 passed, 242 total
@aws-cdk/aws-glue-alpha: Snapshots: 0 total
@aws-cdk/aws-glue-alpha: Time: 42.296 s
@aws-cdk/aws-glue-alpha: Ran all test suites.

@TheRealAmazonKendra
Copy link
Contributor

Y'all care about that?

😐

@aws-cdk-automation aws-cdk-automation added the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Aug 6, 2024
@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 1c31609
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Comment on lines -200 to -213
## Database

A `Database` is a logical grouping of `Tables` in the Glue Catalog.

```ts
new glue.Database(this, 'MyDatabase', {
databaseName: 'my_database',
description: 'my_database_description',
});
```

## Table

A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come the documentation for Database, Table and many others are being deleted here? The code files are not being modified, is it still WIP or will they be removed or something?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @natalie-white-aws and the removal of these docs are unintended and will be updated to add them back.

Copy link
Contributor

@GavinZZ GavinZZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this huge effort on re-writing L2 glue. Left some feedback on it. Still working my way to finish the review.

A general feedback on the README is to add more examples with usage.

from multiple sources for analytics, machine learning (ML), and application
development.

Wihout an L2 construct, developers define Glue data sources, connections,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Wihout an L2 construct, developers define Glue data sources, connections,
Without an L2 construct, developers define Glue data sources, connections,


There are 3 types of jobs supported by AWS Glue: Spark ETL, Spark Streaming, and Python Shell jobs.
1. **ETL Jobs**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give an example of ETL job?

Comment on lines +298 to +303
/**
* The IAM role assumed by Glue to run this job.
*
* @default - undefined
*/
readonly role?: iam.IRole;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary for the import attribute?

*
* @see https://docs.aws.amazon.com/glue/latest/dg/getting-started-access.html
**/
readonly role: iam.IRole;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight improvement: Instead of making it required, a common practice L2 does is to auto-generate a new role if users don't supply one. Can we do the same here? How hard would it be?

* @param id The construct's id.
* @param attrs Attributes for the Glue Job we want to import
*/
public static fromJobAttributes(scope: constructs.Construct, id: string, attrs: JobImportAttributes): IJob {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is fromXXX function even needed on an abstract class where users won't be able to instantiate anyway? Should we consider adding it to the concrete class?

}

private setupSparkUI(role: iam.IRole, sparkUiProps: SparkUIProps) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Comment on lines +11 to +12
* Spark ETL Jobs class
* ETL jobs support pySpark and Scala languages, for which there are separate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit misleading as this file is dedicated for PySpark. I see there's another one for Scala. Can you update the docstrings?

Comment on lines +89 to +91
this.role = props.role, {
assumedBy: new iam.ServicePrincipal('glue.amazonaws.com'),
managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AWSGlueServiceRole')],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see you're using a comma operator to assign the values to this.role. Then why not make role optional?


// Enable CloudWatch metrics and continuous logging by default as a best practice
const continuousLoggingArgs = this.setupContinuousLogging(this.role, props.continuousLogging);
const profilingMetricsArgs = { '--enable-metrics': '' };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question as pyspark-etl one.

Comment on lines +124 to +145
const jobResource = new CfnJob(this, 'Resource', {
name: props.jobName,
description: props.description,
role: this.role.roleArn,
command: {
name: JobType.ETL,
scriptLocation: this.codeS3ObjectUrl(props.script),
pythonVersion: PythonVersion.THREE,
},
glueVersion: props.glueVersion ? props.glueVersion : GlueVersion.V3_0,
workerType: props.workerType ? props.workerType : WorkerType.G_1X,
numberOfWorkers: props.numberOfWorkers ? props.numberOfWorkers : 10,
maxRetries: props.maxRetries,
executionProperty: props.maxConcurrentRuns ? { maxConcurrentRuns: props.maxConcurrentRuns } : undefined,
notificationProperty: props.notifyDelayAfter ? { notifyDelayAfter: props.notifyDelayAfter.toMinutes() } : undefined,
timeout: props.timeout?.toMinutes(),
connections: props.connections ? { connections: props.connections.map((connection) => connection.connectionName) } : undefined,
securityConfiguration: props.securityConfiguration?.securityConfigurationName,
tags: props.tags,
executionClass: ExecutionClass.FLEX,
defaultArguments,
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general feedback, the standard and the flex constructs look very similar and a lot of codes are duplicated. It would make future maintaining hard where the same changes need to apply to two places. Do we really need two separate files and constructs for it? Can we make it the same construct?

@aws-cdk-automation aws-cdk-automation removed the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants