Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] Fix bug in data distributed learning with local empty leaf #4185

Merged
merged 1 commit into from
Apr 22, 2021

Conversation

shiyu1994
Copy link
Collaborator

In data distributed learning (including both data and voting modes), when the value distribution of a feature differs too much across machines (for example, two machines have totally nonoverlapped ranges of the same feature), there can be a leaf which contains 0 local data on a machine during training.

(Note that this happens more easily when pre_partition=true. With pre_partition=false, the data partition will be random and feature value distributions across machines are likely to be similar.)

In that case, the histogram buffers of that leaf won't be cleared because the ConstructHistograms method in Dataset will simply exit.

inline void ConstructHistograms(
const std::vector<int8_t>& is_feature_used,
const data_size_t* data_indices, data_size_t num_data,
const score_t* gradients, const score_t* hessians,
score_t* ordered_gradients, score_t* ordered_hessians,
TrainingShareStates* share_state, hist_t* hist_data) const {
if (num_data <= 0) {
return;
}

Thus the histogram clear code in this method is not executed.

LightGBM/src/io/dataset.cpp

Lines 1192 to 1193 in 6ad3e6e

std::memset(reinterpret_cast<void*>(data_ptr), 0,
num_bin * kHistEntrySize);

But still, with data distributed learning, the histograms will be transferred to other machines to get the global histograms, for example, in data mode, we have
#pragma omp parallel for schedule(static)
for (int feature_index = 0; feature_index < this->num_features_; ++feature_index) {
if (this->col_sampler_.is_feature_used_bytree()[feature_index] == false)
continue;
// copy to buffer
std::memcpy(input_buffer_.data() + buffer_write_start_pos_[feature_index],
this->smaller_leaf_histogram_array_[feature_index].RawData(),
this->smaller_leaf_histogram_array_[feature_index].SizeOfHistgram());
}
// Reduce scatter for histogram
Network::ReduceScatter(input_buffer_.data(), reduce_scatter_size_, sizeof(hist_t), block_start_.data(),
block_len_.data(), output_buffer_.data(), static_cast<comm_size_t>(output_buffer_.size()), &HistogramSumReducer);

Since the histogram buffer is not cleared, content from previous iteration remains in the buffer, which results in a wrong global histogram after synchronization.

This PR is to guarantee that under data distributed learning, when an empty local leaf appears, its histogram content is cleared before synchronization. And it should fix issue #4026 and potentially issue #4178.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explanation and changes make sense to me, thank you very much for the thorough explanation!

@freetz-tiplu
Copy link

This fix does fix the case for data-parallel training but there are still some errors in the voting-parallel case as seen in #4414

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants