-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-7090][MLlib] Introduce LDAOptimizer to LDA to further improve extensibility #5661
Conversation
Since SPARK-7090 was a duplicate, I closed it. Retag this for SPARK-7089? |
@srowen Oh Thanks, I closed 7089 just now... Can I just use 7090? |
Test build #30836 has finished for PR 5661 at commit
|
Test build #30843 has finished for PR 5661 at commit
|
Sorry for the delay! I'll review the PR now |
* according to the Asuncion et al. (2009) paper referenced below. | ||
* | ||
* References: | ||
* - Original LDA paper (journal version): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 1 reference should stay here.
@hhbyyh Thanks for the PR! It looks good, except for 1 item on which I think we weren't clear before: I meant for us to separate the Optimizer and LearningState concepts.
Could you please refactor according to that? It should only require moving code some, but I think it will help clarify the distinction between the parameters and the learning state. |
@jkbradley Thanks for the review. I'd send update according to the suggestions soon. |
@hhbyyh Thanks for reminding me of the discussion in the other PR. I guess it's hard to say what's better given that I've contradicted myself now about whether to split the Optimizer and LearningState concepts. I think it's fine if you keep them both under the Optimizer concept. Thanks! |
Thanks @jkbradley. I think Optimizer is simpler and provide sufficient flexibility for now. I made some changes according to other comments. |
* hold optimizer-specific parameters for users to set. | ||
*/ | ||
@Experimental | ||
trait LDAOptimizer{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need space: LDAOptimizer {
@hhbyyh Thanks for the updates! I made a few small comments, but can you please fix them in your next PR which adds OnlineLDA? (That way, we can go ahead and merge this one.) LGTM pending tests |
Btw, there have been some issues with Jenkins recently (not starting tests or posting results automatically) |
Test build #720 has finished for PR 5661 at commit
|
Test build #30985 has started for PR 5661 at commit |
@jkbradley Thanks, I think it's fine to merge the current version. |
No, I'd just wait for the test. I think that previous test was cancelled, so I'll start a new one. |
Test build #724 has started for PR 5661 at commit |
The test has finished, yet it's not posting to github |
OK, I'll merge it into master. Thanks! |
… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion
… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion
jira: https://issues.apache.org/jira/browse/SPARK-7090
LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms.
As Joseph Bradley @jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly.
Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA.
Concrete changes:
LDAOptimizer
, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm.-adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite.
-move the code from LDA.initalState to initalState of EMLDAOptimizer
Further work:
add OnlineLDAOptimizer and other possible Optimizers once ready.