A crawler to collect reviews and product infomation on Amazon.com and save them to SQLite databases.
To get all reviews for a product, first get the Amazon Standard Identification Number (ASIN) of the product. It is the 10-character alphanumeric ID followed by /product/ in the url. Then add the following code to the main function
Item samsungTab3 = new Item("B00D02AGU4");
samsungTab3.fetchReview();
samsungTab3.writeReviewsToDatabase("reviewtest.db", false);
Data in the created SQlite database can be managed or exported using tools such as SQLiteStudio or RSQLite.
GetReviewerInfo reviewer_crawler = new GetReviewerInfo("reviewer_ids.txt","reviewer_test.db");
reviewer_crawler.crawl();
where reviewer_ids.txt is a file whose content lists all reviewer ids, and reviewer_test.db is a SQlite db.
To get all reviews from a product category, find out the node ID for the category from Amazon.com's url (&node=). The node ID will be the first arugument in the GetASINbyNode() constructor. Then you should estimate about how many products there are and divide that number by 5, which will be the third argument in GetASINbyNode(). For example:
GetASINbyNode getIDs = new GetASINbyNode("541966%2C1232597011", 1, 10);
getIDs.getIDList();
getIDs.writeIDsToCSV("idlist.txt");
ItemList thelist = new ItemList("idlist.txt");
thelist.writeReviewsToDatabase("reviews.db", false);
To get the pricing information you need to register as an Amazon Associate (https://affiliate-program.amazon.com). Then you need to add your Associate tag in the signInput() function in Item.java:
variablemap.put("AssociateTag", "your_tag_here");
You also need your Product Advertising API Key & Secret Key and add them to SignedRequestsHelper.java:
private String awsAccessKeyId = "your_api_key";
private String awsSecretKey = "your_seceret_key";
Once you have them you can change the second argument of writeReviewsToDatabase() to true, and pricing information will be saved in the same database in XML format.
To test your keys, try
Item testItem = new Item("B00D02AGU4");
System.out.println(testItem.getXMLLargeResponse());
- java.io.IOException means that the item no longer exist on Amazon.com. You do not have to do anything with that item.
- java.net.SocketTimeoutException means that connection to the website is taking too long. Rerun the crawler on the items with this exception.
2015-11-14 updated to reflect changes on Amazon website.
2016-06-25 Amazon is no longer displaying total number of helpful votes.
The code is released into public domain. If you find the code useful in your research work, I appreciate if you can cite "Market Dynamics and User-Generated Content about Tablet Computers" by Xin (Shane) Wang, Feng Mai and Roger H.L. Chiang, Marketing Science 33.3 (2014): 449-458