Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661

Closed
wants to merge 6 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Apr 23, 2015

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley @jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

  1. Add a trait LDAOptimizer, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.
  2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
    -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
    -move the code from LDA.initalState to initalState of EMLDAOptimizer
  3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.
  4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

@srowen
Copy link
Member

srowen commented Apr 23, 2015

Since SPARK-7090 was a duplicate, I closed it. Retag this for SPARK-7089?

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 23, 2015

@srowen Oh Thanks, I closed 7089 just now... Can I just use 7090?

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30836 has finished for PR 5661 at commit 0bb8400.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 23, 2015

Test build #30843 has finished for PR 5661 at commit e756ce4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

@jkbradley
Copy link
Member

Sorry for the delay! I'll review the PR now

* according to the Asuncion et al. (2009) paper referenced below.
*
* References:
* - Original LDA paper (journal version):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 1 reference should stay here.

@jkbradley
Copy link
Member

@hhbyyh Thanks for the PR! It looks good, except for 1 item on which I think we weren't clear before:

I meant for us to separate the Optimizer and LearningState concepts.

  • Optimizer should be a class which stores parameters and not much else. Optimizer.initialState should return an instance of a LearningState class.
  • LearningState should have the next() and getModel() methods.

Could you please refactor according to that? It should only require moving code some, but I think it will help clarify the distinction between the parameters and the learning state.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 27, 2015

@jkbradley Thanks for the review. I'd send update according to the suggestions soon.

@jkbradley
Copy link
Member

@hhbyyh Thanks for reminding me of the discussion in the other PR. I guess it's hard to say what's better given that I've contradicted myself now about whether to split the Optimizer and LearningState concepts. I think it's fine if you keep them both under the Optimizer concept. Thanks!

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 27, 2015

Thanks @jkbradley. I think Optimizer is simpler and provide sufficient flexibility for now. I made some changes according to other comments.
ps. not sure why Jenkins hasn't picked up the new change...

* hold optimizer-specific parameters for users to set.
*/
@Experimental
trait LDAOptimizer{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need space: LDAOptimizer {

@jkbradley
Copy link
Member

@hhbyyh Thanks for the updates! I made a few small comments, but can you please fix them in your next PR which adds OnlineLDA? (That way, we can go ahead and merge this one.)

LGTM pending tests

@jkbradley
Copy link
Member

Btw, there have been some issues with Jenkins recently (not starting tests or posting results automatically)

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #720 has finished for PR 5661 at commit 0e2e006.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LDAOptimizer
    • class EMLDAOptimizer extends LDAOptimizer
  • This patch does not change any dependencies.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #30985 has started for PR 5661 at commit 0e2e006.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 27, 2015

@jkbradley Thanks, I think it's fine to merge the current version.
And if the pending API name change is a concern, I can do a quick update. ( Need to wait for test)

@jkbradley
Copy link
Member

No, I'd just wait for the test. I think that previous test was cancelled, so I'll start a new one.

@SparkQA
Copy link

SparkQA commented Apr 27, 2015

Test build #724 has started for PR 5661 at commit 0e2e006.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Apr 28, 2015

The test has finished, yet it's not posting to github

@jkbradley
Copy link
Member

OK, I'll merge it into master. Thanks!

@asfgit asfgit closed this in 4d9e560 Apr 28, 2015
jeanlyn pushed a commit to jeanlyn/spark that referenced this pull request May 14, 2015
… extensibility

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.

2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
        -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
        -move the code from LDA.initalState to initalState of EMLDAOptimizer

3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.

4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits:

0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
nemccarthy pushed a commit to nemccarthy/spark that referenced this pull request Jun 19, 2015
… extensibility

jira: https://issues.apache.org/jira/browse/SPARK-7090

LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.

Concrete changes:

1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.

2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future)
        -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
        -move the code from LDA.initalState to initalState of EMLDAOptimizer

3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer.

4. Change the return type of LDA.run from DistributedLDAModel to LDAModel.

Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits:

0e2e006 [Yuhao Yang] respond to review comments
08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
e756ce4 [Yuhao Yang] solve mima exception
d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor
0bb8400 [Yuhao Yang] refactor LDA with Optimizer
ec2f857 [Yuhao Yang] protoptype for discussion
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants