-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-5556][MLLib][WIP] Gibbs LDA, Refactor LDA for multiple LDA algorithms (EM+Gibbs) #4807
Conversation
add to whitelist |
ok to test |
Test build #28051 has finished for PR 4807 at commit
|
…should be public but is not showing as public but should be
Test build #28052 has finished for PR 4807 at commit
|
As is, it can be merged (as far as work on refactoring goes). I had
|
private[clustering] trait LearningState { | ||
def next(): LearningState | ||
def topicsMatrix: Matrix | ||
def describeTopics(maxTermsPerTopic: Int): Array[(Array[Int], Array[Double])] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary, right? Should be removed from the LearningState
class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it not necessary? The LDASuite which contains the Distributed/LocalModels calls it. How they are created, is up to the specific implementation of LDA. Could you be more specific why its not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different implementations can be hidden by topicsMatrix
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for a separate method is twofold. First, although you could calculate it from topicsMatrix
in theory, the size of topicsMatrix
could be very large (too large to fit in the driver memory, as the docs warn). The describeTopics is intended to provide an interface for the implementation to extract a topics matrix bounded to only the top maxTermsPerTopic
topics. It is less likely this runs the driver out of memory and keeps computation of the top n
topics distributed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about def topicsMatrix: Matrix
=> def termTopicDistributions: RDD[(Long, Vector)]
?
def topicDistributions: RDD[(Long, Vector)]
=> def docTopicDistributions: RDD[(Long,Vector)]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably @jkbradley can weigh in here. I think both changes seem reasonable, then have the Matrix computed from the RDD. If there is agreement, i can make the change on the PR.
@jkbradley, @mengxr Whether have time to look at here? |
I apologize for the lack of response! I'm going to try to make a pass ASAP |
Sounds good. I think its reasonable that this PR only includes refactoring, not Gibbs. Then evaluate LightLDA vs FastLDA and choose which one makes sense. If changes look good, I will go ahead and make the changes proposed by @witgo |
@EntilZha I made a quick pass to get a sense of the structure. My main comment is that these changes seem to mix the concepts of algorithm and model together. I think that's why @witgo was confused about putting describeTopics within the learning state; that confused me a lot too. You are right that we need some changes to the code to support other algorithms, but I'd like to keep the separation between algorithms and models. Here's what I'd propose: Models:
For algorithms, I like keeping everything in class LDA. That can make calls out to Optimizers (1 for each algorithm), and the Optimizers can return an instance of LDAModel. I'd eliminate LearningState and instead put everything in Optimizer, including initialization. I don't see a need for multiple classes since they belong under the same concept. I'm sorry I took so long to respond, but I'll keep up-to-date from here out. I hadn't realized quite so much was blocking on this PR. Also, I know I'm suggesting significant changes to the PR, but it should actually require fewer changes to master and still allow multiple algorithms. CC: @hhbyyh since you have investment in this via [https://github.com//pull/4419]. I believe OnlineLDA could fit under these abstractions. |
@jkbradley Thanks for the proposal and it looks reasonable. Just two minor things not clear,
I'm good with other parts. Thanks. |
Just double checking, your suggestion would be to revert to master, implement those general changes, then commit/push the modified branch? Primary reason I did it this way was to refactor/abstract along the method boundaries that exist right now, but as you noted it does mix model/algorithm. I like your approach on extending the abstract class with trait. I haven't taken much time to work on it, but could do that over the next couple days. I also plan on being at Databricks on Wednesday for the training if you want to chat then. |
Here's a proposal. Let me know what you think!
Multiple run methods do make that separation clearer, but they also force beginner users (who don't know what these algorithms are) to choose an algorithm before they can try LDA. I'd prefer to keep a single run() method and specify the algorithm as a String parameter. One con of a single run() method is that users will get back an LDAModel which they will need to cast to a more specific type (if they want to use the specialization's extra functionality). I think we could eliminate this issue later on by opening up each algorithm as its own Estimator (so that LDA would become a meta-Estimator, if you will).
That is an issue, for sure. I'd propose:
For users, Optimizer classes simply store algorithm-specific parameters. Users can use the default Optimizer, or they can specify the optimizer via String (with default algorithm parameters) or via Optimizer (with configured algorithm parameters). @EntilZha It might be easiest to revert to master (to make diffs easier), but you can decide. That would be great if you have time to work on it in the next couple of days, thanks. I'll be out of town (but online) Wednesday unfortunately, but I hope it goes well! |
Thanks for the reply. And the ideas looks good to me. I'll go ahead with the correctness verification. |
Hi @jkbradley . I tried to refactor LDA as proposed and find it will be a big change (ut and examples included).
A few questions:
I think it's better to leave only one place for determining which algorithm to use, since otherwise it can be confusing and conflicted. Another question is about existing parameters in LDA: Actually I find LDA and OnlineLDA share quite few things and it's kind of difficult to merge them together. Maybe for OnlineLDA, separating it to another File is a better choice . (Later I'll provide an interface / example for stream). |
@hhbyyh I agree with points 1-4. One clarification:
I did not intend for us to pass extra parameters to the run() method. I'll respond to the Online LDA-specific items in the other PR: [https://github.com//pull/4419] |
… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in #4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion
@EntilZha I think this PR doesn't need to be updated now that [https://github.com//pull/5661] has been merged (for JIRA [https://issues.apache.org/jira/browse/SPARK-7090]). Thank you though for this initial PR and discussion! Could you please close this PR? It will still be great if we can get Gibbs sampling in for the next release cycle |
Commenting here and then on ticket. If there is interest in using the Gibbs implementation I wrote for next release using the interface/Refactor from that PR I am open to that. |
Definitely interest; let's coordinate on the JIRA and a new PR, especially with @witgo |
… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion
… extensibility jira: https://issues.apache.org/jira/browse/SPARK-7090 LDA was implemented with extensibility in mind. And with the development of OnlineLDA and Gibbs Sampling, we are collecting more detailed requirements from different algorithms. As Joseph Bradley jkbradley proposed in apache#4807 and with some further discussion, we'd like to adjust the code structure a little to present the common interface and extension point clearly. Basically class LDA would be a common entrance for LDA computing. And each LDA object will refer to a LDAOptimizer for the concrete algorithm implementation. Users can customize LDAOptimizer with specific parameters and assign it to LDA. Concrete changes: 1. Add a trait `LDAOptimizer`, which defines the common iterface for concrete implementations. Each subClass is a wrapper for a specific LDA algorithm. 2. Move EMOptimizer to file LDAOptimizer and inherits from LDAOptimizer, rename to EMLDAOptimizer. (in case a more generic EMOptimizer comes in the future) -adjust the constructor of EMOptimizer, since all the parameters should be passed in through initialState method. This can avoid unwanted confusion or overwrite. -move the code from LDA.initalState to initalState of EMLDAOptimizer 3. Add property ldaOptimizer to LDA and its getter/setter, and EMLDAOptimizer is the default Optimizer. 4. Change the return type of LDA.run from DistributedLDAModel to LDAModel. Further work: add OnlineLDAOptimizer and other possible Optimizers once ready. Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#5661 from hhbyyh/ldaRefactor and squashes the following commits: 0e2e006 [Yuhao Yang] respond to review comments 08a45da [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor e756ce4 [Yuhao Yang] solve mima exception d74fd8f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaRefactor 0bb8400 [Yuhao Yang] refactor LDA with Optimizer ec2f857 [Yuhao Yang] protoptype for discussion
JIRA: https://issues.apache.org/jira/browse/SPARK-5556
As discussed in that issue, it would be great to have multiple LDA algorithm options, principally EM (implemented already in #4047) and Gibbs.
Goals of PR:
At the moment, I am looking for feedback on the refactoring while working on putting the Gibbs code in.
Summary of Changes:
These traits were created with the purpose of encapsulating everything about implementation, while interfacing with the entry point
LDA.run
andDistributedLDAModel
.The entirety of an LDA implementation can be captured by an object and class which extend these traits. Specifically, the
LearningStateInitializer
provides the method for returning theLearningState
which maintains state.Lastly, the algorithm can be set via an enum which is pattern matched to create the correct thing. My thought is the default algorithm should be whichever performs better.
Gibbs Implementation
Old design doc is here:
Primary Gibbs algorithm from here (mostly notation/math, GraphX based, not table based): http://www.cs.ucsb.edu/~mingjia/cs240/doc/273811.pdf
Implements FastLDA from here: http://www.ics.uci.edu/~newman/pubs/fastlda.pdf
Specific Points for Feedback