Skip to content

LambdaSharp/SharpPuppets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SharpPuppets

About this Challenge

This challenge revolves around the basics of browser automation. The tool we are focusing on is Puppeteer, a browser automation tool akin to Selenium. Browser automation tools are useful in Software Quality Assurance and web scraping tools for sites with little to no public API.

The primary focus of this challenge revolves around building a functional browser automation tool, while keeping the infrastructure serverless with the help of AWS Lambda functions.

The browser automation infrastructure for this challenge will be built using Chrome, Puppeteer, Node, Dotnet Core, AWS Lambda, and AWS S3.

Challenge Infrastructure

Infrastructure Diagram

infra diagram

Chrome and Puppeteer are managed by the chrome-aws-lambda NPM package. This package is updated with the latest version of Chrome and Puppeteer, but more importantly, manages to compress the executables down to a deployment package that is less than the 50 MB deployment package limit imposed by AWS Lambda.

The Chrome browser is created by the chrome-aws-lambda package when the C# Lambda function calls the JS Lambda Layer. Once the browser is created, a websocket address is exposed for the C# Lambda function to connect to. From there, Chrome is controlled with the C# Lambda using Puppeteer-Sharp.

Lambda Layers can be thought of as file folders that are available to your lambda function at runtime. In this challenge, we use two Lambda Layers. The first layer is a javascript file that calls the chrome-aws-lambda NPM package. The second layer is a linux executable of Node (C# lambda function does not have Node natively available). These Layers are compressed and uploaded to AWS. The compressed versions that get uploaded can be found in assets. The uncompressed versions of these folders were added for reference and should not be altered for the purposes of this challenge.

A Master Lambda Function exists to process the initial input file that contains directions of which sites to scrape, and how to scrape them. The Master Lambda will interpret the JSON file, and invoke the appropriate Lambda function.

Level 1: Setup

Dependencies

Deploy

Using AWS Cloudformation under the covers, the LambdaSharp tool reads the Module.yml file to deploy all the defined AWS infrastructure. We will be modifying this file anytime we want to make changes to the AWS infrastructure.

Clone Repository

git clone git@github.com:LambdaSharp/SharpPuppets.git

Deploy Infrastructure

cd SharpPuppets
lash init --quick-start --tier puppet
lash deploy --tier puppet

Note: You can run export LAMBDASHARP_TIER='puppet' in your terminal or add it to your bash profile in order to skip adding --tier puppet to every lash command.

Validation (Level 1: Setup)

Upload steps.json into the Scrape Bucket. The Master Lambda function is configured to be triggered anytime a json file is uploaded into the bucket. The Master Lambda function will then trigger the appropriate Scraper Lambda function. Check the CloudWatch logs for each respective lambda function to validate they were called appropriately. Any console messages (LogInfo) will appear in CloudWatch.

Level 2: Screenshot

Developing in the cloud, using headless chrome sucks.. to ease the pain, let's implement screenshots!

How it works

Validation (Level 2: Screenshot)

Trigger your lambda function. Check Scrape S3 bucket for a screenshot of the page you navigated to.

Level 3: Scrape

Puppeteer provides some basic actions, such as filling in an input field or clicking on a button. Conduct some actions on the site of your choosing!

How it works

Validation (Level 3: Scrape)

Verify your screenshot after you conduct your actions to ensure the clicks and typing went through.

Boss: Scrape the Internet

Lambda allows you to kickoff 1000 of itself.. so why not? We will create a new lambda function to serve the "Hub" functionality in a Hub-Node relationship.

How it works

  • (optional) Add another lambda function to scrape a new site.
  • (optional) Expand on your (steps.json)[./steps.json].
  • Scrape something meaningful, such as images, songs, tweets, etc.

Validation (Boss: Scrape the Internet)

Get a cease and desist letter or get your IP address banned from the site of your choosing.

The winner of this challenge will be determined by the following formula

[ (Usefulness rating) * (amount scraped) / 1 minute ] = score

** Usefulness rating is up to the discretion of the judges.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages