Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

Closed
wants to merge 6 commits into from

Conversation

chouqin
Copy link
Contributor

@chouqin chouqin commented Aug 14, 2014

findbinsForLevel is applied to every LabeledPoint to find bins for all nodes at a given level. Given a specific LabeledPoint and a specific feature, the bin to put this labeled point should always be same.But in current implementation, findBin on a (labeledpoint, feature) pair is called for all nodes and all levels, which is a waste of computation.

In my implementation, findBin for each (labeledpoint, feature) pair is executed only once before the start of level-wise training of decision tree. Then, at each level, this feature2bin array can be reused.

What's more, findbinsForLevel now return a array of smaller size, all the nodes on which this labeledPoint is valid share the same feature2bin array, instead of each node having a copy of it.

CC: @mengxr @manishamde @jkbradley, Please have a look at this, thanks.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@manishamde
Copy link
Contributor

Thanks for the PR @chouqin. The redundant findBin calculation should definitely be performed once and it will definitely speed up the computation. A couple of thoughts after looking at your implementation:

  1. Do you need the store the original features along with the bins?
  2. You are using Int for bin ids. You can pack them tightly as Byte if the number of bins are less than 256. One can use multiple Byte format for other bins sizes.

I have an implementation similar to yours
https://github.com/manishamde/spark/compare/ent

A slight difference is that I am creating an internal TreePoint class that can store the bin mapping and this class is extended while performing Random Forest computation. Finally, I think @jkbradley is working on more optimizations on top these changes. I will let him elaborate on that.

@emef
Copy link
Contributor

emef commented Aug 14, 2014

@manishamde I've been experimenting with a gradient boosting implementation that would definitely benefit from having the labeled point conversion done once.

@manishamde
Copy link
Contributor

@emef Yup. GBT, AdaBoost and RF implementations will also benefits from this LabeledPointConversion. Each will extend the TreePoint class in different ways: 1) GBT will add pseudo-residuals, 2) AdaBoost will add sample weights, and 3) RF will add poisson resampled weights for trees.

@mengxr
Copy link
Contributor

mengxr commented Aug 14, 2014

@chouqin Thanks for optimizing decision tree! As @manishamde mentioned, @jkbradley has been working on decision tree optimization and bug fixes, including this one and several others. Considering his following PRs will based on his version #1950, do you mind helping review his code?

Btw, I'm fully responsible for duplicated efforts in MLlib. The correct procedure of open-source contribution should be: 1) create a JIRA and describe and discuss to work to be done, 2) get assigned for the work, 3) submit a PR. However, few of us follows this procedure closely. Usually a JIRA is created just before submitting the PR, this caused duplicated efforts. I will try to do a better job at this.

@jkbradley
Copy link
Member

@chouqin My apologies as well. But I hope you find the soon-to-follow PRs useful, with additional optimizations.

@chouqin
Copy link
Contributor Author

chouqin commented Aug 15, 2014

@mengxr @jkbradley never mind, I will help you review @1950 :)

@chouqin
Copy link
Contributor Author

chouqin commented Aug 15, 2014

I close this PR now and focus on #1950

@chouqin chouqin closed this Aug 15, 2014
szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants