This is a web scraper built with Golang and Colly that scrapes product data from ecommerce websites. It accepts an Excel file containing the scraping schema and outputs the results back to Excel.
go run main.go -e <excel_file>
The Excel file should contain the schema with a sheet per retailer. Each sheet has columns for category, URL, product page element, product attributes etc.
See the example fle in the schemas
directory.
The schema is the list of columns representing either the location or html elements to be scrapped, with the header consisting field/attribute names of the type of attributes to be scraped or location fields name. The schema should always consist of the minimum these four columns:
- Category
- URL
- Product Page
- Next Catalog
main.go
- Program entrypointcontroller.go
- Handles app logic and flowsmodel.go
- Data structuresscrapper.go
- Web scraping logic with Collysiteschema.go
- Parses schema from Exceldesktop.go
- GUI implementationcmdline.go
- Command line implementation
- Accepts Excel file as input scraping schema
- Scrapes category pages to find products
- Follows links to scrape individual product pages
- Extracts attributes like title, price, images as defined in the excel mapping file.
- Supports multiple retailers in one scraping run
- Respects robots.txt and noindex meta directives
- Outputs scraped data back to Excel
Run unit tests:
go test ./...
Pull requests are welcome! Please follow conventions in the existing code.
MIT