[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

chouqin · 2014-08-14T08:12:38Z

findbinsForLevel is applied to every LabeledPoint to find bins for all nodes at a given level. Given a specific LabeledPoint and a specific feature, the bin to put this labeled point should always be same.But in current implementation, findBin on a (labeledpoint, feature) pair is called for all nodes and all levels, which is a waste of computation.

In my implementation, findBin for each (labeledpoint, feature) pair is executed only once before the start of level-wise training of decision tree. Then, at each level, this feature2bin array can be reused.

What's more, findbinsForLevel now return a array of smaller size, all the nodes on which this labeledPoint is valid share the same feature2bin array, instead of each node having a copy of it.

CC: @mengxr @manishamde @jkbradley, Please have a look at this, thanks.

AmplabJenkins · 2014-08-14T08:13:18Z

Can one of the admins verify this patch?

manishamde · 2014-08-14T14:15:08Z

Thanks for the PR @chouqin. The redundant findBin calculation should definitely be performed once and it will definitely speed up the computation. A couple of thoughts after looking at your implementation:

Do you need the store the original features along with the bins?
You are using Int for bin ids. You can pack them tightly as Byte if the number of bins are less than 256. One can use multiple Byte format for other bins sizes.

I have an implementation similar to yours
https://github.com/manishamde/spark/compare/ent

A slight difference is that I am creating an internal TreePoint class that can store the bin mapping and this class is extended while performing Random Forest computation. Finally, I think @jkbradley is working on more optimizations on top these changes. I will let him elaborate on that.

emef · 2014-08-14T17:49:55Z

@manishamde I've been experimenting with a gradient boosting implementation that would definitely benefit from having the labeled point conversion done once.

manishamde · 2014-08-14T19:26:51Z

@emef Yup. GBT, AdaBoost and RF implementations will also benefits from this LabeledPointConversion. Each will extend the TreePoint class in different ways: 1) GBT will add pseudo-residuals, 2) AdaBoost will add sample weights, and 3) RF will add poisson resampled weights for trees.

mengxr · 2014-08-14T21:02:44Z

@chouqin Thanks for optimizing decision tree! As @manishamde mentioned, @jkbradley has been working on decision tree optimization and bug fixes, including this one and several others. Considering his following PRs will based on his version #1950, do you mind helping review his code?

Btw, I'm fully responsible for duplicated efforts in MLlib. The correct procedure of open-source contribution should be: 1) create a JIRA and describe and discuss to work to be done, 2) get assigned for the work, 3) submit a PR. However, few of us follows this procedure closely. Usually a JIRA is created just before submitting the PR, this caused duplicated efforts. I will try to do a better job at this.

jkbradley · 2014-08-14T21:34:21Z

@chouqin My apologies as well. But I hope you find the soon-to-follow PRs useful, with additional optimizations.

chouqin · 2014-08-15T06:24:39Z

@mengxr @jkbradley never mind, I will help you review @1950 :)

chouqin · 2014-08-15T07:01:40Z

I close this PR now and focus on #1950

These changes to testing were included in apache/datafusion-comet#213

qiping.lqp added 6 commits August 14, 2014 14:06

improve decision tree: findbins called only once for each feature

065a42c

fix unit test for decision tree

45549f7

fix style: line length doesn't exceed 100.

4a7bc0b

add comments

59b6781

fix indentation

4ef9e7f

fix indentation too

f1a20e3

chouqin mentioned this pull request Aug 15, 2014

[SPARK-3022] [SPARK-3041] [mllib] Call findBins once per level + unordered feature bug fix #1950

Closed

chouqin closed this Aug 15, 2014

szehon-ho pushed a commit to szehon-ho/spark that referenced this pull request Aug 7, 2024

[MINOR] feat: Enable Comet broadcast by default (apache#1941)

972bf81

These changes to testing were included in apache/datafusion-comet#213

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

chouqin commented Aug 14, 2014

AmplabJenkins commented Aug 14, 2014

manishamde commented Aug 14, 2014

emef commented Aug 14, 2014

manishamde commented Aug 14, 2014

mengxr commented Aug 14, 2014

jkbradley commented Aug 14, 2014

chouqin commented Aug 15, 2014

chouqin commented Aug 15, 2014

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

[SPARK-3022] [mllib] FindBinsForLevel in decision tree should call findBin only once for each feature #1941

Conversation

chouqin commented Aug 14, 2014

AmplabJenkins commented Aug 14, 2014

manishamde commented Aug 14, 2014

emef commented Aug 14, 2014

manishamde commented Aug 14, 2014

mengxr commented Aug 14, 2014

jkbradley commented Aug 14, 2014

chouqin commented Aug 15, 2014

chouqin commented Aug 15, 2014