[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

hhbyyh · 2015-04-23T11:48:05Z

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley @jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

Add a trait LDAOptimizer, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
-adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

srowen · 2015-04-23T11:50:35Z

Since SPARK-7090 was a duplicate, I closed it. Retag this for SPARK-7089?

hhbyyh · 2015-04-23T11:53:46Z

@srowen Oh Thanks, I closed 7089 just now... Can I just use 7090?

SparkQA · 2015-04-23T12:06:46Z

Test build #30836 has finished for PR 5661 at commit 0bb8400.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

SparkQA · 2015-04-23T15:19:01Z

Test build #30843 has finished for PR 5661 at commit e756ce4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

jkbradley · 2015-04-26T23:49:32Z

Sorry for the delay! I'll review the PR now

jkbradley · 2015-04-27T00:14:29Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

- * according to the Asuncion et al. (2009) paper referenced below.
- *
- * References:
- *  - Original LDA paper (journal version):


This 1 reference should stay here.

jkbradley · 2015-04-27T00:16:19Z

@hhbyyh Thanks for the PR! It looks good, except for 1 item on which I think we weren't clear before:

I meant for us to separate the Optimizer and LearningState concepts.

Optimizer should be a class which stores parameters and not much else. Optimizer.initialState should return an instance of a LearningState class.
LearningState should have the next() and getModel() methods.

Could you please refactor according to that? It should only require moving code some, but I think it will help clarify the distinction between the parameters and the learning state.

hhbyyh · 2015-04-27T00:55:34Z

@jkbradley Thanks for the review. I'd send update according to the suggestions soon.

jkbradley · 2015-04-27T02:59:22Z

@hhbyyh Thanks for reminding me of the discussion in the other PR. I guess it's hard to say what's better given that I've contradicted myself now about whether to split the Optimizer and LearningState concepts. I think it's fine if you keep them both under the Optimizer concept. Thanks!

hhbyyh · 2015-04-27T04:33:43Z

Thanks @jkbradley. I think Optimizer is simpler and provide sufficient flexibility for now. I made some changes according to other comments.
ps. not sure why Jenkins hasn't picked up the new change...

jkbradley · 2015-04-27T17:27:15Z

mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala

+ * hold optimizer-specific parameters for users to set.
+ */
+@Experimental
+trait LDAOptimizer{


still need space: LDAOptimizer {

jkbradley · 2015-04-27T17:27:32Z

@hhbyyh Thanks for the updates! I made a few small comments, but can you please fix them in your next PR which adds OnlineLDA? (That way, we can go ahead and merge this one.)

LGTM pending tests

jkbradley · 2015-04-27T17:28:25Z

Btw, there have been some issues with Jenkins recently (not starting tests or posting results automatically)

SparkQA · 2015-04-27T18:03:25Z

Test build #720 has finished for PR 5661 at commit 0e2e006.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- trait LDAOptimizer
- class EMLDAOptimizer extends LDAOptimizer
This patch does not change any dependencies.

SparkQA · 2015-04-27T18:20:37Z

Test build #30985 has started for PR 5661 at commit 0e2e006.

hhbyyh · 2015-04-27T23:09:05Z

@jkbradley Thanks, I think it's fine to merge the current version.
And if the pending API name change is a concern, I can do a quick update. ( Need to wait for test)

jkbradley · 2015-04-27T23:35:01Z

No, I'd just wait for the test. I think that previous test was cancelled, so I'll start a new one.

SparkQA · 2015-04-27T23:35:37Z

Test build #724 has started for PR 5661 at commit 0e2e006.

hhbyyh · 2015-04-28T01:20:19Z

The test has finished, yet it's not posting to github

jkbradley · 2015-04-28T02:02:09Z

OK, I'll merge it into master. Thanks!

… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion

hhbyyh added 2 commits April 22, 2015 20:49

protoptype for discussion

ec2f857

refactor LDA with Optimizer

0bb8400

hhbyyh added 2 commits April 23, 2015 20:37

Merge remote-tracking branch 'upstream/master' into ldaRefactor

d74fd8f

solve mima exception

e756ce4

jkbradley reviewed Apr 27, 2015
View reviewed changes

Merge remote-tracking branch 'upstream/master' into ldaRefactor

08a45da

respond to review comments

0e2e006

jkbradley reviewed Apr 27, 2015
View reviewed changes

asfgit closed this in 4d9e560 Apr 28, 2015

jkbradley mentioned this pull request Apr 28, 2015

[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

hhbyyh commented Apr 23, 2015

srowen commented Apr 23, 2015

hhbyyh commented Apr 23, 2015

SparkQA commented Apr 23, 2015

SparkQA commented Apr 23, 2015

jkbradley commented Apr 26, 2015

jkbradley Apr 27, 2015

jkbradley commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley Apr 27, 2015

jkbradley commented Apr 27, 2015

jkbradley commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SparkQA commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley commented Apr 27, 2015

SparkQA commented Apr 27, 2015

hhbyyh commented Apr 28, 2015

jkbradley commented Apr 28, 2015

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

Conversation

hhbyyh commented Apr 23, 2015

srowen commented Apr 23, 2015

hhbyyh commented Apr 23, 2015

SparkQA commented Apr 23, 2015

SparkQA commented Apr 23, 2015

jkbradley commented Apr 26, 2015

jkbradley Apr 27, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley Apr 27, 2015

Choose a reason for hiding this comment

jkbradley commented Apr 27, 2015

jkbradley commented Apr 27, 2015

SparkQA commented Apr 27, 2015

SparkQA commented Apr 27, 2015

hhbyyh commented Apr 27, 2015

jkbradley commented Apr 27, 2015

SparkQA commented Apr 27, 2015

hhbyyh commented Apr 28, 2015

jkbradley commented Apr 28, 2015