Exotic Amazon README

Exotic Amazon is a complete solution to crawl the entire site of amazon.com.

The web scraping for other e-commerce platforms are basically similar. Only a few business logic modifications are needed, and the infrastructure has already solved all other problems.

Thanks to the perfect Web data management infrastructure provided by PulsarRPA, the entire solution consists of no more than 3,500 lines of kotlin code, plus less than 700 lines of X-SQL to extract more than 650 fields.

Data

Best Seller - update every day, about 32,000 categories, about 4,000,000 product records
Most Wished For - update every day, about 25,000 categories, about 3,500,000 product records
New Releases - update every day, about 25,000 categories, about 3,000,000 product records
Movers and Shakers - about 20 categories, update every hour
Products - about 20,000,000 products, update every month
- 100+ fields
- Title, price, stock, image, description, specification, shop, and more
- Sponsored products, similar products, related products, and more
- Read reviews
- Top reviews
Review - update every day

Get Started

If the maven version number is 3.8.1 or above, you need to add the following code to .m2/settings.xml:

<mirrors>
    <mirror>
        <id>maven-default-http-blocker</id>
        <mirrorOf>dummy</mirrorOf>
        <name>Dummy mirror to override default blocking mirror that blocks http</name>
        <url>http://0.0.0.0/</url>
    </mirror>
</mirrors>

If this file doesn't exist, you can copy settings.xml to .m2 directory. A basic introduction to .m2/settings.xml can be found in the Q/A section.

For Chinese developers, we strongly suggest that you follow this instruction to accelerate the building.

Now we can start building:

# build
git clone https://github.com/platonai/exotic-amazon.git
cd exotic-amazon && mvn -DskipTests=true

####################
# On Linux:

# run in development mode
java -jar target/exotic-amazon*.jar
# run in production mode
ENV=prod java -jar target/exotic-amazon*.jar

####################
# On Windows:

# run in development mode
java -jar target/exotic-amazon-{the-actual-version}.jar
# run in production mode
ENV=prod java -jar target/exotic-amazon-{the-actual-version}.jar

Open System Glances to see the system status at a glance.

Results

Extract rules

All extract rules are written in X-SQL. Data type conversion, data cleansing are also handled inline by powerful X-SQL, which is part of the reason why we need X-SQL.

A good X-SQL example is x-asin.sql which extracts 70+ fields from each product page: x-asin.sql.

Save extract results in the local file system

The results are written in json to local file system by default:

Linux:

cd /tmp/pulsar-$USER/cache/web/export/amazon/json
ls

Windows:

echo %TMP%
echo %username%
cd %TMP%\pulsar-%username%/cache/web/export/amazon/json
dir

Mac:

echo $TMPDIR
echo $USER
echo $TMPDIR/pulsar-$USER/cache/web/export/amazon/json
ls

Save extract results into a database

There are several methods to persist the results into a database:

Serialize the results as key-value pairs, and save them as a field of WebPage, which is the core data structure across the whole system
Write the results to a JDBC compatible database, such as MySQL, PostgreSQL, MS SQL Server, Oracle, etc
Save the results to any destination as you wish by writing several line of additional code yourself

Save as WebPage.pageModel

By default, the extracted fields are also saved as key-value pairs to WebPage.pageModel.

Save to a JDBC compatible database

Set AmazonJdbcSinkSQLExtractor.jdbcCommitter correctly.
Database schema: schema
Page model and database schema mapping: extract-config.json
Page model and extract rules: X-SQLs

Save to a custom destination

You can write several line of additional code to save the results to any destination as you wish, check AmazonJdbcSinkSQLExtractor.onAfterExtract() to learn how to write your own persistence layer.

Technical Features

X-SQL: extended SQL to manage web data: Web crawling, scraping, Web content mining, Web BI
Bot stealth: web driver stealth, IP rotation, privacy context rotation, never get banned
High performance: highly optimized, rendering hundreds of pages in parallel on a single machine without be blocked
Low cost: scraping 100,000 browser rendered e-comm webpages, or n * 10,000,000 data points each day, only 8 core CPU/32G memory are required
Data quantity assurance: smart retry, accurate scheduling, web data lifecycle management
Large scale: fully distributed, designed for large scale crawling
Big data: various backend storage support: Local File/MongoDB/HBase/Gora
Logs & metrics: monitored closely and every event is recorded

Requirements

Minimum memory requirement is 4G, 8G is recommended for test environment, 32G is recommended for product environment
The latest version of the Java 11 JDK
Java and jar on the PATH
Google Chrome 90+
MongoDB started

Logs & Metrics

PulsarRPA has carefully designed the logging and metrics subsystem to record every event that occurs in the system.

PulsarRPA logs the status for every load execution, so it's easy to know what happened in the system, find out answers such as is the system running healthy, how many pages were successfully fetched, how many pages were retried, how many proxy ips were used, etc.

Typical page loading logs are as the following, check log-format to learn how to read the logs to learn the state of the whole system at a glance.

2022-09-24 11:46:26.045  INFO [-worker-14] a.p.p.c.c.L.Task - 3313. 💯 ⚡ U for N got 200 580.92 KiB in 1m14.277s, fc:1 | 75/284/96/277/6554 | 106.32.12.75 | 3xBpaR2 | https://www.walmart.com/ip/Restored-iPhone-7-32GB-Black-T-Mobile-Refurbished/329207863 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:09.190  INFO [-worker-32] a.p.p.c.c.L.Task - 3738. 💯 💿 U  got 200 452.91 KiB in 55.286s, last fetched 9h32m50s ago, fc:1 | 49/171/82/238/6172 | 121.205.220.179 | https://www.walmart.com/ip/Boost-Mobile-Apple-iPhone-SE-2-Cell-Phone-Black-64GB-Prepaid-Smartphone/490934488 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:46:28.567  INFO [-worker-17] a.p.p.c.c.L.Task - 2269. 💯 🔃 U for SC got 200 565.07 KiB <- 543.41 KiB in 1m22.767s, last fetched 16m58s ago, fc:6 | 58/230/98/295/6272 | 27.158.125.76 | 9uwu602 | https://www.walmart.com/ip/Straight-Talk-Apple-iPhone-11-64GB-Purple-Prepaid-Smartphone/356345388?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:18.390  INFO [r-worker-8] a.p.p.c.c.L.Task - 3732. 💔 ⚡ U for N got 1601 0 <- 0 in 32.201s, fc:1/1 Retry(1601) rsp: CRAWL, rrs: EMPTY_0B | 2zYxg52 | https://www.walmart.com/ip/Apple-iPhone-7-256GB-Jet-Black-AT-T-Locked-Smartphone-Grade-B-Used/182353175?variantFieldId=actual_color -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000
2022-09-24 11:47:13.860  INFO [-worker-60] a.p.p.c.c.L.Task - 2828. 🗙 🗙 U for SC got 200 0 <- 348.31 KiB <- 684.75 KiB in 0s, last fetched 18m55s ago, fc:2 | 34/130/52/181/5747 | 60.184.124.232 | 11zTa0r2 | https://www.walmart.com/ip/Walmart-Family-Mobile-Apple-iPhone-11-64GB-Black-Prepaid-Smartphone/209201965?athbdg=L1200 -expires PT24H -ignoreFailure -itemExpires PT1M -outLinkSelector a[href~=/ip/] -parse -requireSize 300000

There are three ways to view metrics:

Check logs/pulsar.m.log
Open System Glances which is a Web UI to show the most important metrics
Install graphite on the same machine, and open http://127.0.0.1/ to view the graphical report

Q & A

Q: How to use proxies?

A: Follow this guide for proxy rotation.

Q: What is `.m2/settings.xml`?

A: It is a configuration file for maven. Settings.xml contains configuration like local repository location, remote repository server, authentication information, etc. It generally exists in two locations:

Global settings:

${maven.home}/conf/settings.xml

User settings:

${user.home}/.m2/settings.xml

If this file doesn't exist, you can copy settings.xml to the .m2 directory.

Q: First scrape the detail page and then scrape the comment pages from the detail page, where is the code?

A: You can see the code logic for the following calls:

AmazonJdbcSinkSQLExtractor.collectHyperlinks ->
 amazonLinkCollector.collectReviewLinksFromProductPage,
 amazonLinkCollector.collectSecondaryReviewLinks,
 amazonLinkCollector.collectSecondaryReviewLinksFromPagination

Q: How to set the start time, end time and period of tasks?

A:

Read LoadOptions which describes what to do with a task
Refer to PredefinedTask, which defines Amazon specific tasks. The settings of PredefinedTask will eventually be converted to LoadOptions
Scheduled tasks are defined in CrawlScheduler

Q: How to store the scraping results?

A: Refer to [Save extract results into a database](#Save extract results into a database)

Contact the author

WeChat：galaxyeye Weibo：galaxyeye Email：galaxyeye@live.cn Twitter: galaxyeye8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.EN.md

README.EN.md

Exotic Amazon README

Data

Get Started

Results

Extract rules

Save extract results in the local file system

Save extract results into a database

Save as WebPage.pageModel

Save to a JDBC compatible database

Save to a custom destination

Technical Features

Requirements

Logs & Metrics

Q & A

Q: How to use proxies?

Q: What is `.m2/settings.xml`?

Q: First scrape the detail page and then scrape the comment pages from the detail page, where is the code?

Q: How to set the start time, end time and period of tasks?

Q: How to store the scraping results?

Contact the author

Files

README.EN.md

Latest commit

History

README.EN.md

File metadata and controls

Exotic Amazon README

Data

Get Started

Results

Extract rules

Save extract results in the local file system

Save extract results into a database

Save as WebPage.pageModel

Save to a JDBC compatible database

Save to a custom destination

Technical Features

Requirements

Logs & Metrics

Q & A

Q: How to use proxies?

Q: What is .m2/settings.xml?

**Q: First scrape the detail page and then scrape the comment pages from the detail page, where is the code? **

Q: How to set the start time, end time and period of tasks?

Q: How to store the scraping results?

Contact the author

Q: What is `.m2/settings.xml`?

Q: First scrape the detail page and then scrape the comment pages from the detail page, where is the code?