Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8998][MLlib] Collect enough frequent prefixes before local processing in PrefixSpan (new) #7412

Closed
wants to merge 28 commits into from

Conversation

zhangjiajin
Copy link
Contributor

Collect enough frequent prefixes before projection in PrefixSpan

@zhangjiajin
Copy link
Contributor Author

@mengxr This is new PR, please review it. TKS.

@mengxr
Copy link
Contributor

mengxr commented Jul 15, 2015

cc @feynmanliang

@SparkQA
Copy link

SparkQA commented Jul 15, 2015

Test build #37309 has finished for PR 7412 at commit 6560c69.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

allPatterns

var patternsCount = lengthOnePatternsAndCounts.length
var allPatternAndCounts = sequences.sparkContext.parallelize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to parallelize if you remove the collect on L88 (will still need collect on L90 for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

getPatternCountsAndPrefixSuffixPairs(minCount, largePrefixSuffixPairs)
patternsCount = nextPatternAndCounts.count()
largePrefixSuffixPairs.unpersist()
val splitedPrefixSuffixPairs = splitPrefixSuffixPairs(nextPrefixSuffixPairs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: split**_t**_ed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, instead of ._1 and ._2 below, why not just assign like in L94 here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@feynmanliang
Copy link
Contributor

Made some suggestions; see how perf changes after them. Unfortunately, scanning the dataset to ensure suffixes are bounded will introduce a performance hit. I still think it's worth it though since it's certainly better than just failin.

It may be worthwhile to test that these changes prevent executor failure due to overload. One way to do that would be to use a large enough dataset and set spark.akka.maxFrameSize small enough s.t. the first method fails but the latter method passes.

@zhangjiajin
Copy link
Contributor Author

@feynmanliang About splitPrefixSuffixPairs, I compared these two methods. I find your method's running time more than mine. And the result is not correct. I don't know why it was so, please check it, thank you.

@zhangjiajin
Copy link
Contributor Author

@feynmanliang You are right, it is worth to prevent executor failure. I very much agree with it. And I tested it according to your suggestion.
The results of the performance test are not stable, I try to find out the reason, perhaps due to the environment, I will post it after solving this problem.

@SparkQA
Copy link

SparkQA commented Jul 27, 2015

Test build #38502 has finished for PR 7412 at commit 64271b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38791 has finished for PR 7412 at commit ad23aa9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* The maximum number of items allowed in a projected database before local processing. If a
* projected database exceeds this size, another iteration of distributed PrefixSpan is run.
*/
private val maxLocalProjDBSize: Long = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please leave a TODO to make it configurable with a better default value. 10000 may be too small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2015

@zhangjiajin @feynmanliang I made some minor comments. We can address the default value of maxLocalProjDBSize in a follow-up PR. To generate the final RDD, I would recommend using sc.parallel(localSmall, 1) ++ getPatternsInLocal instead of chaining multiple RDDs. Let's fix those and merge this PR, so we can unblock related PRs. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants