Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization of row-wise histogram construction #3522

Merged
merged 61 commits into from
Nov 13, 2020

Conversation

shiyu1994
Copy link
Collaborator

This PR is to optimize row-wise histogram construction. For dense multi-value bin, the original bins, without offsets added, are stored, which saves memory when the max bin number per feature is small. For both dense and sparse multi-value, we try to use single precision floating point in histogram buffers, and use avx-256 instructions to speedup histogram construction.

src/io/dataset.cpp Outdated Show resolved Hide resolved
}
CHECK(local_offsets.size() == offsets.size());
for (size_t i = 0; i < local_offsets.size(); ++i) {
CHECK(local_offsets[i] == offsets[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is local_offsets used for? only for checking?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the passed-in offset is already the offset of dense features? why need to re-campute local_offset ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for checking.

sum_dense_ratio /= static_cast<double>(most_freq_bins.size());
CHECK(local_offsets.size() == offsets.size());
for (size_t i = 0; i < local_offsets.size(); ++i) {
CHECK(local_offsets[i] == offsets[i]);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as above.

@shiyu1994
Copy link
Collaborator Author

Update row-wise and sep-row-wise time. The master branch uses row-wise. N/A's in sep-row-wise mean degenerating to row-wise.

Dataset Rowwise Time Sep Rowwise Time Master Rowwise Time Rowwise Speedup Sep Rowwise Speedup
higgs 137.19±1.81 N/A 155.33±2.60 1.13 N/A
yahoo 116.79±0.61 108.46±3.87 146.13±1.99 1.25 1.35
msltr 149.59±0.94 169.43±2.30 163.52±1.29 1.09 0.97
dataexpo_onehot 70.97±0.60 N/A 78.62±1.47 1.11 N/A
allstate 174.22±2.97 185.05±2.81 182.61±2.44 1.05 0.99
adult 0.99±0.02 1.33±0.06 1.17±0.04 1.18 0.88
amazon 0.70±0.01 N/A 0.71±0.04 1.02 N/A
appetency 1.65±0.08 1.94±0.11 1.62±0.11 0.98 0.84
click 18.74±0.51 N/A 19.24±0.68 1.03 N/A
internet 0.33±0.01 0.36±0.03 0.43±0.02 1.28 1.20
kick 3.68±0.10 N/A 3.78±0.09 1.03 N/A
upselling 3.23±0.12 3.59±0.15 3.11±0.20 0.96 0.87
nips_b 58.80±0.99 N/A 62.93±0.66 1.07 N/A
nips_c 8.41±0.24 9.71±0.44 8.71±0.30 1.04 0.90
year 39.27±0.61 N/A 46.94±0.61 1.20 N/A

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants