-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941
Conversation
Can one of the admins verify this patch? |
Thanks for the PR @chouqin. The redundant findBin calculation should definitely be performed once and it will definitely speed up the computation. A couple of thoughts after looking at your implementation:
I have an implementation similar to yours A slight difference is that I am creating an internal TreePoint class that can store the bin mapping and this class is extended while performing Random Forest computation. Finally, I think @jkbradley is working on more optimizations on top these changes. I will let him elaborate on that. |
@manishamde I've been experimenting with a gradient boosting implementation that would definitely benefit from having the labeled point conversion done once. |
@emef Yup. GBT, AdaBoost and RF implementations will also benefits from this LabeledPointConversion. Each will extend the TreePoint class in different ways: 1) GBT will add pseudo-residuals, 2) AdaBoost will add sample weights, and 3) RF will add poisson resampled weights for trees. |
@chouqin Thanks for optimizing decision tree! As @manishamde mentioned, @jkbradley has been working on decision tree optimization and bug fixes, including this one and several others. Considering his following PRs will based on his version #1950, do you mind helping review his code? Btw, I'm fully responsible for duplicated efforts in MLlib. The correct procedure of open-source contribution should be: 1) create a JIRA and describe and discuss to work to be done, 2) get assigned for the work, 3) submit a PR. However, few of us follows this procedure closely. Usually a JIRA is created just before submitting the PR, this caused duplicated efforts. I will try to do a better job at this. |
@chouqin My apologies as well. But I hope you find the soon-to-follow PRs useful, with additional optimizations. |
@mengxr @jkbradley never mind, I will help you review @1950 :) |
I close this PR now and focus on #1950 |
These changes to testing were included in apache/datafusion-comet#213
findbinsForLevel
is applied to everyLabeledPoint
to find bins for all nodes at a given level. Given a specificLabeledPoint
and a specific feature, the bin to put this labeled point should always be same.But in current implementation,findBin
on a (labeledpoint, feature) pair is called for all nodes and all levels, which is a waste of computation.In my implementation,
findBin
for each (labeledpoint, feature) pair is executed only once before the start of level-wise training of decision tree. Then, at each level, thisfeature2bin
array can be reused.What's more,
findbinsForLevel
now return a array of smaller size, all the nodes on which this labeledPoint is valid share the samefeature2bin
array, instead of each node having a copy of it.CC: @mengxr @manishamde @jkbradley, Please have a look at this, thanks.