This challenge revolves around the basics of browser automation. The tool we are focusing on is Puppeteer, a browser automation tool akin to Selenium. Browser automation tools are useful in Software Quality Assurance and web scraping tools for sites with little to no public API.
The primary focus of this challenge revolves around building a functional browser automation tool, while keeping the infrastructure serverless with the help of AWS Lambda functions.
The browser automation infrastructure for this challenge will be built using Chrome, Puppeteer, Node, Dotnet Core, AWS Lambda, and AWS S3.
Chrome and Puppeteer are managed by the chrome-aws-lambda NPM package. This package is updated with the latest version of Chrome and Puppeteer, but more importantly, manages to compress the executables down to a deployment package that is less than the 50 MB deployment package limit imposed by AWS Lambda.
The Chrome browser is created by the chrome-aws-lambda package when the C# Lambda function calls the JS Lambda Layer. Once the browser is created, a websocket address is exposed for the C# Lambda function to connect to. From there, Chrome is controlled with the C# Lambda using Puppeteer-Sharp.
Lambda Layers can be thought of as file folders that are available to your lambda function at runtime. In this challenge, we use two Lambda Layers. The first layer is a javascript file that calls the chrome-aws-lambda NPM package. The second layer is a linux executable of Node (C# lambda function does not have Node natively available). These Layers are compressed and uploaded to AWS. The compressed versions that get uploaded can be found in assets. The uncompressed versions of these folders were added for reference and should not be altered for the purposes of this challenge.
A Master Lambda Function exists to process the initial input file that contains directions of which sites to scrape, and how to scrape them. The Master Lambda will interpret the JSON file, and invoke the appropriate Lambda function.
- Create AWS Account
- Install AWS CLI
- Install Dotnet 2.1
- Install LambdaSharp
- Update:
dotnet tool update -g LambdaSharp.Tool
- Update:
Using AWS Cloudformation under the covers, the LambdaSharp tool reads the Module.yml file to deploy all the defined AWS infrastructure. We will be modifying this file anytime we want to make changes to the AWS infrastructure.
Clone Repository
git clone git@github.com:LambdaSharp/SharpPuppets.git
Deploy Infrastructure
cd SharpPuppets
lash init --quick-start --tier puppet
lash deploy --tier puppet
Note: You can run
export LAMBDASHARP_TIER='puppet'
in your terminal or add it to your bash profile in order to skip adding--tier puppet
to everylash
command.
Upload steps.json into the Scrape Bucket. The Master Lambda function is configured to be triggered anytime a json file is uploaded into the bucket. The Master Lambda function will then trigger the appropriate Scraper Lambda function. Check the CloudWatch logs for each respective lambda function to validate they were called appropriately. Any console messages (LogInfo) will appear in CloudWatch.
Developing in the cloud, using headless chrome sucks.. to ease the pain, let's implement screenshots!
- Take screenshot using Puppeteer ScreenshotAsync
- Pass in
/tmp/screenshot.png
to ScreenshotAsyncthis saves the screenshot to the Lambda's temporary storage which has a 512 MB limit
- Upload the screenshot to S3 using S3Client.PutObjectAsync()
Trigger your lambda function. Check Scrape S3 bucket for a screenshot of the page you navigated to.
Puppeteer provides some basic actions, such as filling in an input field or clicking on a button. Conduct some actions on the site of your choosing!
Verify your screenshot after you conduct your actions to ensure the clicks and typing went through.
Lambda allows you to kickoff 1000 of itself.. so why not? We will create a new lambda function to serve the "Hub" functionality in a Hub-Node relationship.
- (optional) Add another lambda function to scrape a new site.
- (optional) Expand on your (steps.json)[./steps.json].
- Scrape something meaningful, such as images, songs, tweets, etc.
Get a cease and desist letter or get your IP address banned from the site of your choosing.
The winner of this challenge will be determined by the following formula
[ (Usefulness rating) * (amount scraped) / 1 minute ] = score
** Usefulness rating is up to the discretion of the judges.