In this Lab, we will explore using Glue for serverless ETL. We will be using the data previously stored in S3: YOUR_USERNAME-datalake-demo-bucket and to make it queriable using standard SQL.
First, we will create an IAM Role for the Glue service to access our source data in S3. Glue will also require permissions to create additional entities and datasets in its catalog.
- Go to the IAM console: https://console.aws.amazon.com/iam/home?region=us-east-1#/roles. On the navigation bar on the left, select Roles.
- Click - Create role
-
Select type of trusted entity: AWS Service
-
Click - Next: Permissions
-
Check two (2) permissions: AmazonS3FullAccess and AWSGlueServiceRole
-
Click - Next: Tags
- Leave as default, no changes here
-
Click - Next: Review
-
Role name: AWSGlueServiceRoleLab
-
make sure that are two policies attached to this role (AmazonS3FullAccess, AWSGlueServiceRole)
-
You will later use this Role AWSGlueServiceRoleLab for running Glue tasks.
Glue Crawlers (not a web crawler) are a way to populate Glue data tables from files and database tables.
In this step, we will create Crawlers to discover the schema and structure of data stored in our S3 bucket.
-
Access the Glue console: https://console.aws.amazon.com/glue/home?region=us-east-1
-
On the left navigation panel, click on Crawlers >, and then Add Crawler to create a crawler.
- Crawler info
- Crawler name: CrawlDataFromKDG
- Click - Next
- Crawler source type: Data Stores.
- Next
- Choose a data store
- Data stores: S3
- Crawl data in: Specified path in my account
- Include path: s3://YOUR_USERNAME-datalake-demo-bucket/data/
- Click - Next
- Add another data store : No
- Click - Next
- Choose an IAM Role
- Choose: Choose an existing IAM Role
- Role Name: AWSGlueServiceRoleLab
- Click - Next
- Schedule
- Frequency: Run on demand
- Click - Next
- Configure the crawler's output
- Click - Add database
- Database name: awslabs_db
- Click - Create
- Click - Next
- Review all steps
- Review the configuration & make sure its as mentioned above
- Click - Finish
- Crawler info
-
You should see a message : "Crawler CrawlDataFromKDG was created to run on demand. Run it now?"
- Click on Run it now? to run the crawler immediately.
- Wait for few minutes. Crawler CrawlDataFromKDG is running
- Continue when the crawler Status has changed to Stopping and back to Ready
Let's view the data we crawled with Glue. The data is available as a Glue Database and Table.
- On the navigation on the left, select Databases
- Click on awslabs_db
- Let's view the tables and their contents in this database.
- Click on Tables in awslabs_db. There is table named data.
- View the data table. Explore the table schema for your dataset and see that it matches your expected schema and size.
- Inspect the values for
averageRecordSize
,recordCount
,compressionType
Your JSON data in s3 is now a relational dataset in the form of a Glue Database and Table. How can we query the data table?
Lets query Glue tables using Athena.
-
Open the Athena console : https://us-east-1.console.aws.amazon.com/athena/home?region=us-east-1#query
-
You may have to first create an output location for athena queries:
-
On the left panel (Database) drop down , select awslabs_db > select table data
-
View the SQL statement printed on the console on the right.
SELECT * FROM "awslabs_db"."data" limit 10;
- what does this query do?
-
Click on Run Query to execute the query. What do you printed out in the Results section?
-
Paste the follow SQL query in the editor:
SELECT activity_type, count(activity_type) FROM data GROUP BY activity_type ORDER BY activity_type
-
How many counts do you have for the following activites: Traveling and Running? Share this with your instructor.
-
Configure an RDS connection for Glue
- Create a VPC. You can use the provided CloudFormation template or use the following steps:
- Create a set of Private Subnets
- Create a NAT Gateway
- Create a Route Table with
0.0.0.0/0
pointing to NAT Gateway. Associate RT to Private Subnets
- Create a VPC Endpoint for s3. Associate the earlier RT to this Endpoint.
- Create an RDS Cluster. Avoid using Easy Create - use Standard Create to create an initial database
glue_db
- After the RDS writer is created, go to Glue console and Add connection
- Enter the relevant values. ☑️ on Require SSL if needed.
- Fill in the relevant DB auth values, using the earlier database name
glue_db
- Complete the form and Test Connection
- Create a VPC. You can use the provided CloudFormation template or use the following steps:
-
Create a Glue Job to import
reference_data
in s3 to a SQL table in RDS Aurora-
Finish Extra Materials #2
-
In the Glue console left sidebar, click on Jobs.
-
Click Add job.
- Name: Import reference_data to RDS
- IAM role: GlueServiceRoleLab
- Type and Glue version: Leave as default
- This job runs: A proposed script generated by AWS Glue. Click Next
- Choose a data source: reference_data
- Choose a transform type: Change schema
- Choose a data target: Create tables in your data target, Connection: Aurora cluster name, Database name: glue_db
- Map schema and Save job and edit script
- In the Edit Script view, click Run job
-
Test configuration e.g. from a bastion host. You should see the database
glue_db
with a table forreference_data
-
-
Analyze S3 data directly on Athena without using Glue tables.
-
Amazon provides publicly-accessible large datasets via registry.opendata.aws. These datasets include census data, satelite data, healthcare, social platforms etc.
-
Try importing a dataset of customer reviews from amazon.com using the
amazon_reviews_parquet
S3 bucket. -
Run the following command in Athena
CREATE EXTERNAL TABLE `amazon_reviews_parquet`( `marketplace` string, `customer_id` string, `review_id` string, `product_id` string, `product_parent` string, `product_title` string, `star_rating` int, `helpful_votes` int, `total_votes` int, `vine` string, `verified_purchase` string, `review_headline` string, `review_body` string, `review_date` bigint, `year` int) PARTITIONED BY ( `product_category` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3://amazon-reviews-pds/parquet' TBLPROPERTIES ( 'transient_lastDdlTime'='1570770352')
-
Run the repair command to update the metastore for new data:
MSCK REPAIR TABLE amazon_reviews_parquet;
-
Create a View for Books from the data:
CREATE OR REPLACE VIEW "Books" AS SELECT product_title, star_rating FROM amazon_reviews_parquet WHERE product_category = 'Books'
-
You can now preview a list of Reviews for Books sold on amazon.com
-