Being able to compare products online revolutionized the experience of shoppers around the world. However, due to ever-increasing scale of online marketplaces such as Amazon.com, Alibaba.com, and eBay, it has become impossible for consumers and businesses alike to evaluate features for every product in the desired search space.
In this project we find product feature summaries based only on natural text reviews. We use review data from Amazon.com from the excellent Amazon product reviews dataset provided by Julian McAuley. The work was done for the seminar Mining Massive Datasets at Hasso Plattner Institute in Potsdam, Germany.
The pipeline—as described below—is implemented as Apache Spark jobs in Java and Scala. It makes use of Spark's MLlib and GraphX toolkits as well as Twitters DIMSUM algorithm. It can be run locally and on Spark clusters.
We employ a three-step pipeline: Feature Extraction, Feature Clustering, and Modifier Weighting which are explained below.
A Feature in our sense is a word (-group) that stands for a product feature, such as display
, battery
, or color temperature
.
A Modifier is a word (-group) that describes a feature, such as better
, large
, or the coolest
.
We use filters to narrow down reviews to products of specific categories, brands, and price ranges. There are four filters currently implemented:
Filter Class | Input | Output |
---|---|---|
SampleFilter | List of Products, Sample Fraction (0..1), Seed | Sampled list of products |
BrandFilter | List of Products, Brand Name | Only Products of the given brand |
CategoryFilter | List of Products, Category Name | Only products of the given (sub-) category |
PriceFilter | List of Products, Minimum Price, Maximum Price | Only products of the given price range |
In the consecutive steps, only reviews with matching product ids will be evaluated.
We POS-tag the remaining reviews using the Stanford Core NLP library. Over each review, we slide a window of a fixed length to find ngrams which represent the features of the product and their modifiers. For this, we implemented templates - Finite State Machines that only accept an ngram if it matches the template. Having the knowledge about the structure of the template allows us to make assumptions on the position of the feature(s) and modifier(s) within the ngram. One of the templates we use is "[comparing adjective/adverb] + [noun]" with the former being the modifier and the latter being the feature.
This method clusters ngrams by their feature. If two ngrams contain the same feature (i.e. the same string), they are grouped together, so that we can have multiple modifiers for a feature.
DIMSUM (Dimensionality independent matrix similarity using map-reduce) is an algorithm proposed by Twitter in order to cut the number of comparisons between matrix columns with the use of sampling. We use the SPARK-implementation of this algorithm as we hoped that it would be faster as our Aggregation method. We compute the similarity between vectors of our Features given by the Word2Vec model previously computed. We build a coordinate-matrix out of these vectors, transpose it by swapping the cells indices and are then able to transform it into a rowmatrix. After computing the similarities of the columns, we build a graph (each Node is a Feature, each Edge a similarity over a given threshold) in order to find all connected components. Each connected component then contains words with similar vectors, which we interprete as words describing the same feature. In contrast to the original use-case of DIMSUM, our Matrix is dense, has many columns and few rows. As such the advantages of DIMSUM do not come into play, however, the quality of the algorithm is comparable to the aggregation.
After deduplicating the features, we build a linear model for each feature over the reviews where the rating of the review is the dependent variable and the Modifiers are the independent variables. For this task we use Spark's MLlib. For each Modifier we get a coefficient which indicates how much this Modifier influences the rating of the review. Finally we sort the Modifiers over all Models in order to see which combinations of Feature and Modifier are the most positive or most negative.
The following experiments were done using a cluster with these specifications:
- Master: Dell PowerEdge R310 (4x2.66 GHz, 8 GB RAM)
- 10 Slave Nodes: Dell OptiPlex 780 (2x2.6 GHz, 8 GB RAM)
- Shared HDFS
We use Amazon.com product reviews as provided in the Amazon product reviews. The program will only accept reviews and metadata in JSON format from the local file system or the Hadoop filesystem (HDFS).
mvn clean install
java -jar target/mmds-[version].jar [parameters, see below]
mvn clean install
mvn package
spark-submit --deploy-mode client --jars $(echo target/lib/*.jar | tr ' ' ',') target/mmds-[version].jar [parameters, see below]
Use spark-submit
parameters like --executor-memory 4G --driver-memory 8G --total-executor-cores 20
to scale.
The program accepts the following parameters:
.../mmds-[version].jar [ReviewFilePath] [NumberOfPartitions] [ClusteringAlgorithm] [MetadataFilePath]
Parameter | Expected Value | Example |
---|---|---|
ReviewFilePath | A valid file path on the local or Hadoop file system. The file must contain review JSON objects. | resources/samples/musical_instruments_top100.json |
NumberOfPartitions | Integer. Number of partitions used to parallelize Spark jobs. | 4 |
ClusteringAlgorithm | String. One of [TreeAggregate, DIMSUM, ExactMatch]. Default is ExactMatch . |
DIMSUM |
MetadataFile | A valid file path on the local or Hadoop file system. The file must contain metadata JSON objects. | resources/samples/musical_instruments_metadata_top100.json |
Maximilian Grundke, Axel Kroschk, Jaspar Mang, 2016.