-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM #4047
Conversation
Test build #25558 has started for PR 4047 at commit
|
Test build #25558 has finished for PR 4047 at commit
|
Test FAILed. |
/** | ||
* An example Latent Dirichlet Allocation (LDA) app. Run with | ||
* {{{ | ||
* ./bin/run-example mllib.DenseKMeans [options] <input> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(rename)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
c6e4308
to
0472632
Compare
Test build #25560 has started for PR 4047 at commit
|
Test FAILed. |
Test build #25560 has finished for PR 4047 at commit
|
Test PASSed. |
By the way, I'm running larger-scale tests, and I'll post results once they are ready! |
Mind posting the data set size (vocab, doc, etc) and type of cluster? About to start some performance tests and would be cool to hit both the Chinese dataset size and what you are testing. |
There are two questions.
Here is a relevant industrial practice: |
@EntilZha Here’s a sketch of my plan. Datasets:
Data preparation:
Scaling tests: (doing these first)
Accuracy tests: (doing these second)
These tests will run on a 16-node EC2 cluster of r3.2xlarge instances. |
@witgo I agree that there are 2 different use regimes for LDA: interpretable topics and featurization. The current implementation follows pretty much every other graph-based implementation I’ve seen:
I have not heard of methods which can avoid this amount of communication for LDA. I’m sure the implementation can be optimized, so please make comments here or JIRAs afterwards about that. For modified models, it might be possible to communicate less: sparsity-inducing priors, hierarchical models, etc. |
Test build #25689 has started for PR 4047 at commit
|
Test build #25689 has finished for PR 4047 at commit
|
Test PASSed. |
Here are some initial test results. There are 2 sets since I had run some before the updates from @mengxr and other tests after the updates. Summary: Iterations keep getting longer. Will need to work on scalability, but it at least runs on medium-sized datasets. Updates from @mengxr improve scalability. Still need to test large numbers of topics. How tests were runI ran using this branch: I used the collection of stopwords from @dlwh here: [https://github.com/dlwh/spark/blob/feature/lda/examples/src/main/scala/org/apache/spark/examples/mllib/SimpleLatentDirichletAllocation.scala] I ran using a (partial) dump of Wikipedia) consisting of about 4.7GB of gzipped text files. My goal is to do a few sets of tests, scaling:
I ran:
These used a 16-node EC2 cluster of r3.2xlarge machines. Results (before recent GC-related updates)Take-aways: Iterations keep getting longer. Will need to work on scalability, but it at least runs on medium-sized datasets. Scaling corpus size
Note that iteration times keep getting longer.
Scaling k
(larger tests died) Results (after recent GC-related updates)Main take-away: Updates from @mengxr improve scaling. (Notice the not dying on the later tests.) Scaling corpus size
|
Is there plan to include the inference for new (unseen) document based on the generated distribution? Thanks |
@hhbyyh Yes, please review the design doc linked from the JIRA. There is quite a bit of functionality which will not be in this initial PR. |
I've had a good chance to look at PR while making changes to my own code. I really liked the Graph initialization code (especially the partition strategy), I was able to copy that and get a 2x boost almost across the board in computation phases. Questions in PR:Naming (LDA vs LatentDirichletAllocation)I think that Linear Discriminant Analysis and Latent Dirichlet Allocation are different enough that it doesn't warrant making the class names significantly longer, so LDA is probably good. Package NameI think a separate topicmodeling package is fitting, it makes it very clear that LDA is for topic modeling. The second motivation for a topicmodeling package is that it looks like there will be this EM implementation and soon the Gibbs version I am using, so both could fit in this namespace well. Along those same lines, perhaps keep LDAModel for the abstract class and for implementations use some way to denote whether it is Gibbs ( LDAGibbs, GibbsLDA, LDAWithGibbs) or EM based (LDAEM, EMLDA, LDAWithEM). Naming is hard... Unit TestOn tests, I am for including a small training example, similar to here: Lastly, it looks like the abstract classes/traits haven't made it in yet to have multiple implementations satisfy a common LDA API. I have taken a careful look at the design doc and your code, and am fairly confident that I have a way to do this with minimal code changes. I will post something once I refactor what I have to satisfy it, and post the proposed changes that would need to be made. The general sketch is to make LDA a trait, LearningState a trait, and have implementations have a signature something like object LDAGibbs extends LDA and provide a LearningState implementation. More soon. |
@EntilZha Thanks for taking a look! W.r.t. the class names, I really hope we can keep a single LDA class and have an optimizer parameter which lets users specify the optimization/inference method. We are trying to move away from the AlgWithOptimizer naming conventions. That will end up affecting the abstractions you were discussing too (though maybe those abstractions will exist but be private/protected). |
What might be the best way to have the EM and Gibbs LDA implementations play well with each other? If the aim is to not have separate LDA classes, on first thought I think maybe then LearningState would be where to put the work in. So, that would mean creating the LearningState trait, then within LDA, there would be multiple definitions of classes that satisfy LearningState. So you might have GibbsLearningState and EMLearningState (solves that naming problem easily). It seems like most of the implementation would be within LearningState anyway, not per se within LDA. This has the advantage you named, in the LDA constructor we could set the default LearningState type to either EM or Gibbs (probably whichever performs better), allowing users to specify which algorithm they would like if wanted. |
By the way, the 2 sets of timing results above are not that comparable since I realized that I limited the vocab size in the later tests. I'm re-running tests to see if the updates from @mengxr helped. |
@EntilZha I agree with your sketch: abstracting LearningState, and having different versions for each algorithm. |
Nice, I am not too far from completing my refactoring. How would it be best to share it, open a different PR or link to it here? Eventually, will it be two PRs... not sure best way to do this. |
Sounds good! I'd recommend linking to it from here for now and creating a PR later. |
Just finished refactoring, here is the combined API/LDAModel code + Gibbs implementing it. It should give a pretty good idea of what I was thinking about using LearningState. Note though: 1. although it compiles, I haven't run it since there are some methods I still need to implement, 2. I need to update the docs after I finish implementing missing methods 3. I also need to go through to properly make objects/classes/methods/vars have the correct access level (public, private, etc) |
this.topicSmoothing | ||
} else { | ||
(50.0 / k) + 1.0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing. How about moving the if-else to getTermSmoothing
? For logic separation and also users can have a interface to collect the para
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will do.
* | ||
* Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions), | ||
* but values in (0,1) are not yet supported. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here. I feel the doc should be in setters.
@mengxr Almost done with updating per your feedback, but 1 question: Do you recommend doc only in the setter methods, or in both the getters and setters? |
…ab computation. Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
@mengxr Oops, that commit hid some of my responses to your feedback above. The 2 issues remaining are (1) Who is responsible for materializing graphs? and (2) Where should the param docs appear (getters and/or setters)? (My last commit moved the docs to the setters.) |
Test FAILed. |
Test failure with system...will test again |
Test build #574 has started for PR 4047 at commit
|
@jkbradley Both setter and getter are public methods and hence both of them should have JavaDoc. I prefer doc the default value in setter because it is used when a user want to change the parameter. |
@mengxr OK, I'll copy the doc to both, but only put the default value in the setter. |
Test build #574 has finished for PR 4047 at commit
|
@mengxr I believe that's it. Thanks for the feedback! |
Ok...now I believe that's it! |
Test build #26624 has started for PR 4047 at commit
|
Test build #26624 has finished for PR 4047 at commit
|
Test PASSed. |
LGTM. Merged into master. Thanks everyone for collaborating on LDA! @jkbradley Please create follow-up JIRAs and see who are interested in working on LDA features. |
This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).
The design doc for this PR has been updated since I initially posted it. In particular, see the API and Planning for the Future sections.
Goals
Sketch of contents of this PR
Dependency: This makes MLlib depend on GraphX.
Files and classes:
Data/model representation and algorithm:
Design notes
Please refer to the JIRA for more discussion + the design doc for this PR
Here, I list the main changes AFTER the design doc was posted.
Design decisions:
Items planned for future PRs:
Questions for reviewers
Other notes
This has not been tested much for scaling. I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics. Running it for 500 iterations made it fail because of GC problems. I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.
Thanks to…
CC: @mengxr