[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

zhangjiajin · 2015-07-15T03:51:54Z

Collect enough frequent prefixes before projection in PrefixSpan

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.

…efixSpan.

Initilize local master branch.

…efixeSpan

zhangjiajin · 2015-07-15T03:57:35Z

@mengxr This is new PR, please review it. TKS.

mengxr · 2015-07-15T04:20:26Z

cc @feynmanliang

SparkQA · 2015-07-15T04:26:13Z

Test build #37309 has finished for PR 7412 at commit 6560c69.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

feynmanliang · 2015-07-15T05:45:05Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

-    allPatterns
+
+    var patternsCount = lengthOnePatternsAndCounts.length
+    var allPatternAndCounts = sequences.sparkContext.parallelize(


No need to parallelize if you remove the collect on L88 (will still need collect on L90 for now)

feynmanliang · 2015-07-22T21:51:49Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+        getPatternCountsAndPrefixSuffixPairs(minCount, largePrefixSuffixPairs)
+      patternsCount = nextPatternAndCounts.count()
+      largePrefixSuffixPairs.unpersist()
+      val splitedPrefixSuffixPairs = splitPrefixSuffixPairs(nextPrefixSuffixPairs)


nit: split**_t**_ed

Actually, instead of ._1 and ._2 below, why not just assign like in L94 here as well?

feynmanliang · 2015-07-22T22:15:20Z

Made some suggestions; see how perf changes after them. Unfortunately, scanning the dataset to ensure suffixes are bounded will introduce a performance hit. I still think it's worth it though since it's certainly better than just failin.

It may be worthwhile to test that these changes prevent executor failure due to overload. One way to do that would be to use a large enough dataset and set spark.akka.maxFrameSize small enough s.t. the first method fails but the latter method passes.

zhangjiajin · 2015-07-27T03:42:51Z

@feynmanliang About splitPrefixSuffixPairs, I compared these two methods. I find your method's running time more than mine. And the result is not correct. I don't know why it was so, please check it, thank you.

zhangjiajin · 2015-07-27T03:57:55Z

@feynmanliang You are right, it is worth to prevent executor failure. I very much agree with it. And I tested it according to your suggestion.
The results of the performance test are not stable, I try to find out the reason, perhaps due to the environment, I will post it after solving this problem.

SparkQA · 2015-07-27T05:09:11Z

Test build #38502 has finished for PR 7412 at commit 64271b3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

[Spark-8998]Collect Enough Prefixes Improvements

SparkQA · 2015-07-29T03:08:23Z

Test build #38791 has finished for PR 7412 at commit ad23aa9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2015-07-29T08:01:52Z

mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala

+   * The maximum number of items allowed in a projected database before local processing. If a
+   * projected database exceeds this size, another iteration of distributed PrefixSpan is run.
+   */
+  private val maxLocalProjDBSize: Long = 10000


Please leave a TODO to make it configurable with a better default value. 10000 may be too small.

mengxr · 2015-07-29T08:04:53Z

@zhangjiajin @feynmanliang I made some minor comments. We can address the default value of maxLocalProjDBSize in a follow-up PR. To generate the final RDD, I would recommend using sc.parallel(localSmall, 1) ++ getPatternsInLocal instead of chaining multiple RDDs. Let's fix those and merge this PR, so we can unblock related PRs. Thanks!

zhangjiajin added 15 commits July 7, 2015 15:30

Add new algorithm PrefixSpan and test file.

91fd7e6

Modified the code according to the review comments.

575995f

Delete Prefixspan.scala

951fd42

Use PrefixSpan.scala instead of Prefixspan.scala. Delete Prefixspan.scala

Delete PrefixspanSuite.scala

a2eb14c

Use PrefixSpanSuite.scala instead of PrefixspanSuite.scala, Delete PrefixspanSuite.scala.

Fixed a Scala style error.

89bc368

Modified the code according to the review comments.

1dd33ad

Fix some Scala style errors.

4c60fb3

Fix a Scala style error.

ba5df34

Add new object LocalPrefixSpan, and do some optimization.

574e56c

Modified the code according to the review comments.

ca9c4c8

Add feature: Collect enough frequent prefixes before projection in Pr…

22b0ef4

…efixSpan.

fix a scala style error.

078d410

initialize file before rebase.

4dd1c8a

Merge branch 'master' of https://github.com/apache/spark

a8fde87

Initilize local master branch.

Add feature: Collect enough frequent prefixes before projection in Pr…

6560c69

…efixeSpan

feynmanliang reviewed Jul 15, 2015
View reviewed changes

feynmanliang reviewed Jul 22, 2015
View reviewed changes

Modified codes according to comments.

64271b3

Feynman Liang and others added 8 commits July 28, 2015 14:36

Fix splitPrefixSuffixPairs

6e149fa

Add getters

01c9ae9

Inline code for readability

cb2a4fc

Use lists for prefixes to reuse data

da0091b

Use Iterable[Array[_]] over Array[Array[_]] for database

1235cfc

Readability improvements and comments

c2caa5c

Improve extend prefix readability

87fa021

Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal

ad23aa9

[Spark-8998]Collect Enough Prefixes Improvements

mengxr reviewed Jul 29, 2015
View reviewed changes

feynmanliang mentioned this pull request Jul 30, 2015

[SPARK-8998][MLlib] Distribute PrefixSpan computation for large projected databases #7783

Closed

asfgit closed this in d212a31 Jul 30, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

zhangjiajin commented Jul 15, 2015

zhangjiajin commented Jul 15, 2015

mengxr commented Jul 15, 2015

SparkQA commented Jul 15, 2015

feynmanliang Jul 15, 2015

zhangjiajin Jul 15, 2015

feynmanliang Jul 22, 2015

feynmanliang Jul 22, 2015

zhangjiajin Jul 27, 2015

feynmanliang commented Jul 22, 2015

zhangjiajin commented Jul 27, 2015

zhangjiajin commented Jul 27, 2015

SparkQA commented Jul 27, 2015

SparkQA commented Jul 29, 2015

mengxr Jul 29, 2015

zhangjiajin Jul 30, 2015

mengxr commented Jul 29, 2015

[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

Conversation

zhangjiajin commented Jul 15, 2015

zhangjiajin commented Jul 15, 2015

mengxr commented Jul 15, 2015

SparkQA commented Jul 15, 2015

feynmanliang Jul 15, 2015

Choose a reason for hiding this comment

zhangjiajin Jul 15, 2015

Choose a reason for hiding this comment

feynmanliang Jul 22, 2015

Choose a reason for hiding this comment

feynmanliang Jul 22, 2015

Choose a reason for hiding this comment

zhangjiajin Jul 27, 2015

Choose a reason for hiding this comment

feynmanliang commented Jul 22, 2015

zhangjiajin commented Jul 27, 2015

zhangjiajin commented Jul 27, 2015

SparkQA commented Jul 27, 2015

SparkQA commented Jul 29, 2015

mengxr Jul 29, 2015

Choose a reason for hiding this comment

zhangjiajin Jul 30, 2015

Choose a reason for hiding this comment

mengxr commented Jul 29, 2015