[SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib #2847

jackylk · 2014-10-19T09:49:49Z

Apriori is the classic algorithm for frequent item set mining in a transactional data set. It will be useful if Apriori algorithm is added to MLLib in Spark. This PR add an implementation for it.
There is a point I am not sure wether it is most efficient. In order to filter out the eligible frequent item set, currently I am using a cartesian operation on two RDDs to calculate the degree of support of each item set, not sure wether it is better to use broadcast variable to achieve the same.

I will add an example to use this algorithm if requires

AmplabJenkins · 2014-10-19T09:52:10Z

Can one of the admins verify this patch?

AmplabJenkins · 2014-10-21T23:14:00Z

Can one of the admins verify this patch?

mengxr · 2014-11-07T01:47:42Z

Had an offline discussion with @jackylk . We plan to implement a more scalable version of Apriori, as described in PFP: Parallel FP-Growth for Query Recommendation (http://dl.acm.org/citation.cfm?id=1454027)

varadharajan · 2014-11-08T16:13:06Z

As mentioned in one of comments of SPARK-2432. I was wondering how the PFP version compares with YAFIM (http://pasa-bigdata.nju.edu.cn/people/ronggu/pub/YAFIM_ParLearning.pdf). Probably i will do a bit more reading on this.

denmoroz · 2014-12-08T14:46:39Z

Maybe it is better to use RDD[BitSet] as transactions RDD? Then you can add a preprocessor trait and make any transformations for source RDD to RDD of BitSets. For example, transformation of RDD[Array[String]] to RDD[BitSet].
It seems to me, that BitSet is the much better idea of transactions representation then Array[String] or Array[Int] or anything else.

Or even better idea is to make Transaction entity, which will contain it's BitSet representation and all necessary convinient methods. And then anyone could make a preprocessor of RDD[...Any Type...] to RDD[Transaction].

erikerlandson · 2014-12-08T15:15:58Z

As long as itemset mining is under consideration, has anybody tried a Spark implementation of "Logical Itemset Mining":
http://cvit.iiit.ac.in/papers/Chandrashekar2012Logical.pdf

denmoroz · 2014-12-08T15:22:34Z

Dou you use SON algorithm for Apriori parallel implementation?
(http://importantfish.com/limited-pass-algorithms/)

mengxr · 2015-01-15T01:49:18Z

Had an offline discussion with @jackylk and here is the summary:

Keep only the parallel FP-Growth implementation, because it is generally more efficient than Apriori, especially on medium/large datasets. @jackylk can share some performance testing results.
Rename the package "fim" (frequent itemset mining) to "fpm" (frequent pattern mining). There is no standard acronym this family of mining algorithms. Frequent patten mining is a broader term than frequent itemset mining. This package name is also used in Mahout.
Include links to the original FP-Growth paper and the PFP paper in the doc.
Have FPGrowth take minSupport at a parameter and implement run(RDD...): FPGrowthModel, where FPGrowthModel holds an RDD of frequent itemsets and counts.
Hide methods used internally.
Update code style: a) remove extra empty lines; b) fix indentation; c) change variable names; d) line width; etc.
Check whether we can use generic type for items (for Java API).

mengxr · 2015-01-15T05:21:46Z

add to whitelist

SparkQA · 2015-01-15T05:22:40Z

Test build #25596 has started for PR 2847 at commit 7b77ad7.

This patch merges cleanly.

SparkQA · 2015-01-15T05:23:34Z

Test build #25596 has finished for PR 2847 at commit 7b77ad7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-15T05:23:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25596/
Test FAILed.

SparkQA · 2015-01-19T06:12:40Z

Test build #25742 has started for PR 2847 at commit eb3e4ca.

This patch merges cleanly.

jackylk · 2015-01-19T06:24:53Z

Yes, I have tested the parallel FP-Growth algorithm using a open data set from http://fimi.ua.ac.be/data/, performance test result can be found at https://issues.apache.org/jira/browse/SPARK-4001

All modification is done except for the 7th (generic type), please review the code for now.
I am still considering whether it is worthy to implement generic type since it adds more complexity to the code

SparkQA · 2015-01-19T06:55:20Z

Test build #25742 has finished for PR 2847 at commit eb3e4ca.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-01-19T06:55:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25742/
Test FAILed.

jackylk · 2015-01-19T08:12:11Z

Please test again

SparkQA · 2015-01-19T08:12:36Z

Test build #25752 has started for PR 2847 at commit d110ab2.

This patch merges cleanly.

SparkQA · 2015-01-19T09:23:19Z

Test build #25752 has finished for PR 2847 at commit d110ab2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-01-21T02:00:29Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala

+    }
+
+    // Sort it and create the item combinations
+    val sortedItems = items.sortWith(_._1 > _._1).sortWith(_._2 > _._2).toArray


Why sorting twice? The second will overwrite the first. Besides, using sortBy(-_._2) would be better.

mengxr · 2015-01-21T02:21:22Z

@jackylk I made a brief scan of the implementation. Besides inline comments, I have some high-level suggestions:

It would be good if we can make the code easier for someone with the paper in hand. I don't think the notation is good in the PFP paper. But when we create a variable, we should put a comment mentioning the corresponding variable name in the paper.
Maybe for reason 1), I cannot find an exact match between the implementation and the algorithm described in the paper. It seems that you implemented Figure 1 of the paper but it is not the PFP algorithm. Could you double check?
The tree building code needs some unit tests. So it is easy to convince reviewers that the implementation is correct. And if using trees can compress the data, we should use aggregateByKey instead of groupByKey.

zhangyouhua2014 · 2015-01-23T08:50:17Z

@mengxr . I am working with Jacky together to develop and test this algorithm. I answered this question：
We refer to the PFP paper, but reduces the process of building the tree, omit this process it can use this time to do other things. Specific steps are as follows:
1, the transaction database DB is distributed to more than one worker nodes, after two scans transaction database, get conditional pattern sequences.
   1.1, the first scan DB, get a frequent itemsets L1. For example: (a, 6), (b, 5), (c, 3)
   1.2, according to 1.1) was L1 scanning DB again, to filter out non-frequent item, get conditional pattern sequence conditionSEQ. For example: (c, (a, b)), (b, (a)),
   After two scans DB get conditionSEQ, conditionSEQ DB is much smaller than the amount of information.
2, reduce operations performed using groupByKey operator will conditionSEQ on a machine of the same key into the presence of the same key conditionSEQ worker set on each machine after the merger. The following is based conditionSEQ to mining frequent item sets.
3, on each worker, using a priori principle of collective operations conditionSEQ find frequent item sets.
4. Finally, the use of operators collect aggregate results.
DB algorithm change will spread across multiple worker nodes only need to scan twice to obtain the conditions set pattern sequence conditionSEQ small amount of information in the collection; frequent item set mining is onditionSEQ processed only once reduce, network interaction is small, so fast.

mengxr · 2015-01-26T06:50:16Z

@zhangyouhua2014

We refer to the PFP paper, but reduces the process of building the tree, omit this process it can use > this time to do other things.

By "reduce", did you mean skipping the process of growing trees? The FP-Growth algorithm reduces memory requirement using the tree representation of candidate sets. If we skip this step, it is hard to call it FPGrowth. Did you do any performance comparison between your version and the PFP implementation?

2, reduce operations performed using groupByKey operator will conditionSEQ on a machine of the same key into the presence of the same key conditionSEQ worker set on each machine after the merger. The following is based conditionSEQ to mining frequent item sets.

It is important to grow the tree on the mapper side to save communication cost. groupByKey doesn't do that. I was suggesting using aggregateByKey. For each key, we start with an empty tree, with seqOp growing the tree and combOp merging two trees. Besides, the partition key is the hash value of the last item of the sequence. We should be able to reduce communication cost (see my inline comments at L135).

zhangyouhua2014 · 2015-01-26T07:30:53Z

@mengxr
1 I mean I use step 1(that Equivalent to create FPTree and condition FPTree ) we have reduce data size and create condition FPTree（only include frequently item not transition data）, when using condition FPTree mining frequently item set，it is have a small candidate set.
2 I have test it and compared mahout pfp，it is a good performance that about 10 time.
3 afer use groupByKey,ming frequently item set in each node that include Specified key，so it is not network communication overhead.
4 is there have aggregateByKey operator in new spark version?

mengxr · 2015-01-26T19:55:00Z

1 I mean I use step 1(that Equivalent to create FPTree and condition FPTree ) we have reduce data size and create condition FPTree（only include frequently item not transition data）, when using condition FPTree mining frequently item set，it is have a small candidate set.

The advantage of FP-Growth over Apriori is the tree structure to present candidate set. Both algorithms are taking advantage on the fact that the candidate set is small. I'm asking whether the current implementation uses the tree structure to save communication.

2 I have test it and compared mahout pfp，it is a good performance that about 10 time.

I'm not surprised by the 10x speed-up. It is not equivalent to say the current implementation is correct and high-performance. I believe that we can be much faster.

3 afer use groupByKey,ming frequently item set in each node that include Specified key，so it is not network communication overhead.

groupByKey collects everything to reducers. aggregateByKey does part of the aggregation on mappers. There is definitely space for improvement.

4 is there have aggregateByKey operator in new spark version?

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions

mengxr · 2015-01-28T01:34:18Z

Had an offline discussion with @jackylk and @zhangyouhua2014 . We plan to add a utility class named FPTree with the following (the exact names are TBD):

def add(transaction: Array[String]): this.type

def merge(tree: FPTree): this.type

def extract(threshold: Int, validateSuffix: String => Boolean): Iterator[Array[String]]

and the use aggregateByKey to grow trees in parallel.

SparkQA · 2015-01-30T16:52:48Z

Test build #26406 has started for PR 2847 at commit 93f3280.

This patch merges cleanly.

SparkQA · 2015-01-30T16:53:42Z

Test build #26406 has finished for PR 2847 at commit 93f3280.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FPGrowthModel (val frequentPattern: Array[(Array[String], Long)]) extends Serializable
- class FPTree extends Serializable
- class FPTreeNode(val item: String, var count: Int) extends Serializable

AmplabJenkins · 2015-01-30T16:53:43Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26406/
Test FAILed.

jackylk · 2015-01-30T16:53:43Z

@mengxr
I have modified according to the comments, please review

SparkQA · 2015-01-30T16:57:48Z

Test build #26407 has started for PR 2847 at commit ec21f7d.

This patch merges cleanly.

SparkQA · 2015-01-30T18:09:42Z

Test build #26407 has finished for PR 2847 at commit ec21f7d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FPGrowthModel (val frequentPattern: Array[(Array[String], Long)]) extends Serializable
- class FPTree extends Serializable
- class FPTreeNode(val item: String, var count: Int) extends Serializable

AmplabJenkins · 2015-01-30T18:09:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26407/
Test PASSed.

mengxr · 2015-01-30T18:24:27Z

@jackylk Thanks for the update! Did you see any performance improvement on your dataset with aggregateByKey? I'm quite interested in how much shuffle we can save (hopefully) on a real dataset.

jackylk · 2015-01-31T02:47:13Z

I have not tested performance yet. I will test it at weekend

simplify FPTree and update FPGrowth

SparkQA · 2015-02-02T03:07:48Z

Test build #26486 has started for PR 2847 at commit bee3093.

This patch merges cleanly.

SparkQA · 2015-02-02T04:04:33Z

Test build #26486 has finished for PR 2847 at commit bee3093.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) extends Serializable
- class Node[T](val parent: Node[T]) extends Serializable

AmplabJenkins · 2015-02-02T04:04:37Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26486/
Test FAILed.

mengxr · 2015-02-02T04:08:12Z

LGTM. Merged into master. Thanks!! (The failed test is a known flakey test. All relevant tests passed.)

jackylk added 2 commits October 19, 2014 02:19

adding apriori algorithm for frequent item set mining in Spark

da2cba7

modify per scalastyle check

889b33f

jackylk added 2 commits November 27, 2014 01:04

add 2 apriori implemenation and fp-growth implementation

f68a0bd

fix scalastyle check

7b77ad7

jackylk changed the title ~~[SPARK-4001][MLlib] adding apriori algorithm for frequent item set mining in Spark~~ [SPARK-4001][MLlib] adding apriori algorithm for frequent item set mining in Spark (WIP) Nov 26, 2014

jackylk changed the title ~~[SPARK-4001][MLlib] adding apriori algorithm for frequent item set mining in Spark (WIP)~~ [SPARK-4001][MLlib] adding apriori and fp-growth algorithm for frequent itemset mining in Spark (WIP) Nov 26, 2014

jackylk added 2 commits January 18, 2015 22:07

refactory according to comments

03df2b6

add FPGrowth

eb3e4ca

jackylk added 2 commits January 18, 2015 23:48

Add Parallel FPGrowth algorithm

a6c5081

change test case to use MLlibTestSparkContext

d110ab2

jackylk changed the title ~~[SPARK-4001][MLlib] adding apriori and fp-growth algorithm for frequent itemset mining in Spark (WIP)~~ [SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib Jan 19, 2015

mengxr reviewed Jan 21, 2015
View reviewed changes

create FPTree class

93f3280

fix scalastyle

ec21f7d

mengxr and others added 2 commits February 1, 2015 02:12

simplify FPTree and update FPGrowth

7e69725

Merge pull request #1 from mengxr/SPARK-4001

bee3093

simplify FPTree and update FPGrowth

asfgit closed this in 859f724 Feb 2, 2015

[SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib #2847

[SPARK-4001][MLlib] adding parallel FP-Growth algorithm for frequent pattern mining in MLlib #2847

Conversation

jackylk commented Oct 19, 2014

AmplabJenkins commented Oct 19, 2014

AmplabJenkins commented Oct 21, 2014

mengxr commented Nov 7, 2014

varadharajan commented Nov 8, 2014

denmoroz commented Dec 8, 2014

erikerlandson commented Dec 8, 2014

denmoroz commented Dec 8, 2014

mengxr commented Jan 15, 2015

mengxr commented Jan 15, 2015

SparkQA commented Jan 15, 2015

SparkQA commented Jan 15, 2015

AmplabJenkins commented Jan 15, 2015

SparkQA commented Jan 19, 2015

jackylk commented Jan 19, 2015

SparkQA commented Jan 19, 2015

AmplabJenkins commented Jan 19, 2015

jackylk commented Jan 19, 2015

SparkQA commented Jan 19, 2015

SparkQA commented Jan 19, 2015

mengxr Jan 21, 2015

Choose a reason for hiding this comment

mengxr commented Jan 21, 2015

zhangyouhua2014 commented Jan 23, 2015

mengxr commented Jan 26, 2015

zhangyouhua2014 commented Jan 26, 2015

mengxr commented Jan 26, 2015

mengxr commented Jan 28, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

jackylk commented Jan 30, 2015

SparkQA commented Jan 30, 2015

SparkQA commented Jan 30, 2015

AmplabJenkins commented Jan 30, 2015

mengxr commented Jan 30, 2015

jackylk commented Jan 31, 2015

SparkQA commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

mengxr commented Feb 2, 2015