Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with call to c_api Predict functions in multi threaded way from golang #3751

Closed
tarunreddy1018 opened this issue Jan 12, 2021 · 58 comments
Labels

Comments

@tarunreddy1018
Copy link

tarunreddy1018 commented Jan 12, 2021

How we are using LightGBM:

We are using the LightGBM c_api in our service. But this c_api is being called from golang. Basically we have a golang wrapper around the c_api. And we call the c_api functions from golang using cgo.

We get the “lib_lightgbm.so” library file from the Github release section and use them.

Version of LightGBM being used:

3.1.1 (Observed with all versions >= 3.0.0)

LightGBM component:

C++ Library

Environment info:

Operating System: Observed on both Linux and MacOS

Architecture: x86_64

CPU model: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz

C++ compiler version: gcc version 7.2.0 (Debian 7.2.0-1)

CMake version: 3.9.0

GLIBC version: ldd (Debian GLIBC 2.28-10) 2.28

Context:

Starting with LightGbm 3.0.0 version, there is an update in Predict functions where only the necessary parts of the predict function have been given with "unique_lock", and rest of the parts where they just do computation were given with "shared_lock".

Unlike the previous version like 2.3.1, where a "lock_guard" was used in the beginning of the predict function itself. This led to the situation where only one thread can execute this predict function at a time. This could not scale better in production environments where we do many predictions in a parallel way.

So, we decided to upgrade to a newer version of LightGbm 3.1.1. Which really improved the performance. Since now multiple threads can execute the predict function in a parallel way because of the "shared_lock".

Issue:

But, Now we saw that the predictions were not consistent. That is if there are many threads which are invoking this predict function in a parallel way on a large scale, And if we give the same input multiple times to the service, it turns out that we are getting different predictions each time we make a call.

This was not the issue with earlier LightGBM version that we used 2.3.1. All the predictions were consistent with that version. It could probably be because of the lock that it had in the predict function because of which we did not notice this issue.

To test this out quickly, we had to put a lock before we made a call to predict function. And now all the predictions were as expected and consistent.

So, we thought that there could be some race condition in the code or LightGbm was not meant to do predictions in a parallel way. To investigate it further we started looking into the c_api of predict function. And found out that it has a vector of vectors from which each thread gets its vector using thread id (tid) and copies the input data into that vector and passes it to predict function. Once the predict call is completed it clears the vector.

Definition of Vector of Vectors:
std::vector<std::vector<double, Common::AlignmentAllocator<double, kAlignedSize>>> predict_buf_

Code where Vector of Vectors is used:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                                             double* output)  {
          int tid = omp_get_thread_num();
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            CopyToPredictBuffer(predict_buf_[tid].data(), features);
            boosting_->Predict(predict_buf_[tid].data(), output, &early_stop_);
            ClearPredictBuffer(predict_buf_[tid].data(),
                               predict_buf_[tid].size(), features);
          }
        };

So, we suspected that the problem could be arising from this vector of vectors and tried logging the tid into a file and surprisingly we saw that all of the calls were logging "tid" as "0".

This means that all the threads were using the same vector with "tid" as "0". Which led to a race condition.

Could you please let us know if the latest update was not meant to do predictions in a parallel way or Could it be some issue with our golang code calling c_api (Probably with openMP)?

@StrikerRUS
Copy link
Collaborator

Linking #2760.

cc @JoanFM

@StrikerRUS StrikerRUS added the bug label Jan 12, 2021
@JoanFM
Copy link
Contributor

JoanFM commented Jan 12, 2021

I will try to take a look if I take some time but not sure if I will be able soon

@JoanFM
Copy link
Contributor

JoanFM commented Jan 12, 2021

From the description it seems to be the thread_id the source of the problem then? So can it be that this is more of a preexisting issue that #2760 just exposed?

@JoanFM
Copy link
Contributor

JoanFM commented Jan 12, 2021

Could this be related to https://forum.openmp.org//viewtopic.php?f=3&t=653&start=0 ?

@tarunreddy1018
Copy link
Author

@JoanFM Let me check with a sample c++ code if my issue is also the same as given in the link. I will get back to you on this soon.

@tarunreddy1018
Copy link
Author

Hi @AlbertoEAF, Can you check this thread. That one was an older thread which I had posted before I did any observations

@JoanFM
Copy link
Contributor

JoanFM commented Jan 13, 2021

Did u manage to reproduce the issue as in the thread @tarunreddy1018 ?

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 13, 2021

Hi @JoanFM , I have not yet tried reproducing the issue, I will first try to check running few sample c++ code as mentioned in the link. This will help in giving a clear picture. If the thread_id (tid) still comes out to be 0, then the issue was something related to our openmp library linking and not really with c api. But will check that and get back to you on the same thread.

Mean while can you let me know, starting from v3.0.0 there were changes in locking strategy right. Having "shared_locks" and "unique_locks". So, seeing "shared_locks" in the predict function call made me believe that we can call predict functions in a parallel way. This should give performance improvement. So, were the predict functions written with an intent to make them call in a parallel way?

I mean was there any test done, which calls predict function in multi threaded way and that ensures thread safety. If any one else also faced this issue or its just me who is facing this issue. There are 2 things that could be the potential issues here

  1. Issue with some openmp linking and its usage with golang code calling c api.
  2. Issue in c api. (Issue with thread safety)

@JoanFM
Copy link
Contributor

JoanFM commented Jan 13, 2021

Hi @JoanFM , I have not yet tried reproducing the issue, I will first try to check running few sample c++ code as mentioned in the link. This will help in giving a clear picture. If the thread_id (tid) still comes out to be 0, then the issue was something related to our openmp library linking and not really with c api. But will check that and get back to you on the same thread.

Mean while can you let me know, starting from v3.0.0 there were changes in locking strategy right. Having "shared_locks" and "unique_locks". So, seeing "shared_locks" in the predict function call made me believe that we can call predict functions in a parallel way. This should give performance improvement. So, were the predict functions written with an intent to make them call in a parallel way?

Hey @tarunreddy1018 ,

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

When I opened the PR I had the intention to use that in a project but I did not get to the point of using it in a multithreaded way so did not fully test the behavior.

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jan 13, 2021

Thanks for the very detailed report @tarunreddy1018 ;)

Linking #2760.

cc @JoanFM

Great catch @StrikerRUS! I was looking for that

From the description it seems to be the thread_id the source of the problem then? So can it be that this is more of a preexisting issue that #2760 just exposed?

I think @JoanFM might be dead right, as in my experience lightgbmlib 2.3.150 (which corresponds to some commit between v2.3.1 and v2.3.2) this random scores already existed when using Java. Such is the case that we had to use synchronization on our Java provider code.

Here is the proof, we had to wrap our provider's Predict calls (line 147) with a single-access region (no parallelism):
https://github.com/feedzai/feedzai-openml-java/blob/24a62dca18ddd5cf538256f1acc621cb3b092220/openml-lightgbm/src/main/java/com/feedzai/openml/provider/lightgbm/LightGBMSWIG.java#L138

and we used at the time lightgbmlib 2.3.150 (https://github.com/feedzai/feedzai-openml-java/blob/24a62dca18ddd5cf538256f1acc621cb3b092220/openml-lightgbm/pom.xml#L26)

If we didn't do the synchronization (no parallelism) we would have random scores.

I think @tarunreddy1018 's observations are dead right too regarding those thread id's and could be a good source for investigation.
I just don't understand why he didn't have such issues with the old code when I had, but clearly the issues were there already.


I never looked into LightGBM's multi-threading implementation, so here are my 2 cents - take it with a grain of salt:

In the end it means that we should probably go back to unique locks in the meantime until the issue is fixed as a stop-gap if that fixes it, and use shared locks only on regions that truly do read-only calls. It might be that this stop-gap is useless if there are previous issues like 2.3.150 seemed to show though.

This thread gives more info that further proves the issue with predictions and multi-threading: #3675 (comment).

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jan 13, 2021

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

A method being const in C++ can still manipulate a lot of data. In LightGBM on top of that we do a lot of pointer manipulation so all guarantees of constness go out the door.
We probably need to do a much more thorough review of the code.

@AlbertoEAF
Copy link
Contributor

@tarunreddy1018 can you just reformat your original question to have the code block encased by a line starting with 3 backquotes and cpp (i.e. "```cpp"), followed by the code block and then terminated with a line with 3 backquotes: "```"?

Thanks :)

@JoanFM
Copy link
Contributor

JoanFM commented Jan 13, 2021

I did add the shared and unique locks under the assumption that const functions have no risk to change the object while keeping unique locks for non-const functions. So from this point of view I do not see why it would be unsafe to run it in a multithreaded way, however I am not so sure about the design and the robustness that was put from the beginning into it.

I wonder if with so much macros, pointer manipulation and shared variables a method being const is any guarantee of constness?
Most probably it's not and the code needs to be reviewed more carefully.

I remember doing looking at the code and seemed safe but sure I may have missed something.

@tarunreddy1018
Copy link
Author

@AlbertoEAF reformatted

@AlbertoEAF
Copy link
Contributor

I remember doing looking at the code and seemed safe but sure I may have missed something.

Yeah and from my point of view, the problems start before that even. I don't have time to look into this but at least @tarunreddy1018 's thread id observations might be a hint for someone to explore.

In the meanwhile I'd suggest @tarunreddy1018 to put a lock in your code at least for now if you have some urgency as I don't think this can easily be fixed soon. It's still faster to use v3.1 than v2.3.1 due to other improvements behind the scenes and the new APIs such as the *Fast() prediction methods.

@tarunreddy1018
Copy link
Author

@AlbertoEAF So, you would suggest to make a change in c api myself to have a lock in predict call. And build the library from that. And use this library right? This should give some performance improvements compared to 2.3.1

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 13, 2021

@AlbertoEAF Sure, I will check with thread id thing as soon as I get some time.

@AlbertoEAF
Copy link
Contributor

@AlbertoEAF So, you would suggest to make a change in c api myself to have a lock in predict call. And build the library from that. And use this library right? This should give some performance improvements compared to 2.3.1

@tarunreddy1018 I was suggesting doing the lock on the golang side around predicts, much easier :)

Of course you can always modify the code and re-compile lightgbm but I don't think that will get you more performance than the solution above and will let you later try out different versions with and without your lock, meaning that when a new version comes out fixed you only need to remove your lock in golang ;)

@tarunreddy1018
Copy link
Author

Yes Thanks, Will keep things posted as soon as I find any observations. That should help @AlbertoEAF :)

@JoanFM
Copy link
Contributor

JoanFM commented Jan 13, 2021

I had a little closer look at the code, and it seems that this predict_buf_ is resized to the size of OMP_NUM_THREADS() it would be interesting to know which value u get there.

My question is how it was this multithreading using OMP was being used if there was a lock guard protecting the c_api.

I understand u are calling at this function, am I right?

  void Predict(int start_iteration, int num_iteration, int predict_type, int nrow, int ncol,
               std::function<std::vector<std::pair<int, double>>(int row_idx)> get_row_fun,
               const Config& config,
               double* out_result, int64_t* out_len) const {
    SHARED_LOCK(mutex_);
    auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);
    bool is_predict_leaf = false;
    bool predict_contrib = false;
    if (predict_type == C_API_PREDICT_LEAF_INDEX) {
      is_predict_leaf = true;
    } else if (predict_type == C_API_PREDICT_CONTRIB) {
      predict_contrib = true;
    }
    int64_t num_pred_in_one_row = boosting_->NumPredictOneRow(start_iteration, num_iteration, is_predict_leaf, predict_contrib);
    auto pred_fun = predictor.GetPredictFunction();
    OMP_INIT_EX();
    #pragma omp parallel for schedule(static)
    for (int i = 0; i < nrow; ++i) {
      OMP_LOOP_EX_BEGIN();
      auto one_row = get_row_fun(i);
      auto pred_wrt_ptr = out_result + static_cast<size_t>(num_pred_in_one_row) * i;
      pred_fun(one_row, pred_wrt_ptr);
      OMP_LOOP_EX_END();
    }
    OMP_THROW_EX();
    *out_len = num_pred_in_one_row * nrow;
  }

And the code snippet you shared is coming from the predictor. Since this predictor seems to live in the stack of every thread, I do not see why the predict_buf_ can be affected by different threads.

I am sure I must be missing something ...

Could it be that somehow 'regular' multithreading vs 'OpenMP' multithreading do not play well along together?

I am sorry @tarunreddy1018, but I finally did not implement and apply this feature in productio after all.

@JoanFM
Copy link
Contributor

JoanFM commented Jan 14, 2021

Another observation I found is,

It feels that what OpenMP is trying to do and what the unique / shared locks are trying to do are two different parallelization models.

OpenMP is trying to parallelize one single prediction as much as possible around multiple trees, thus I guess optimizing for the speed of a single predictor while the shared/unique lock tries to enable to use the same predictor with multiple threads which may not optimize for a single prediction but could improve throughput.

Could it be a hint of the possible problem?

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 14, 2021

Hi, @JoanFM , The methods that we have been trying were LGBM_BoosterPredictForMatSingleRow and LGBM_BoosterPredictForMatSingleRowFast and it looks like these two methods are not creating a new Predictor object for each thread. But rather they have a "unique_lock" and they only let one thread to create a new Predictor object and all the other threads use this predictor object once created. This in turn makes the the predict_buf_ vector living inside this Predictor object be used by all the other threads and makes it unsafe. And more over the size of this predict_buf_ is set to OMP_NUM_THREADS() which I have checked is equal to the nthreads parameter value that we pass in the config while making the predict call. This was strange because this size of predict_buf_ was since equal to OMP_NUM_THREADS() that means it is meant to be used and parallelize a single row prediction not across multiple predictions right?

Based on your observation it seems correct that each thread creates a new Predictor object as seen from this code CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);. And as you said this predictor seems to live in the stack of every thread. But this Predict function that you posted above is not being used by these two predict methods that I have mentioned above. This makes me believe that some how this Predict object is not unique for each call but rather being shared and this led to the thread-unsafe

This is the function call stack how the Predictor object is being created by the above two mentioned methods

fastConfig_ptr->booster->SetSingleRowPredictor(start_iteration, num_iteration, predict_type, fastConfig_ptr->config); (Called inside "LGBM_BoosterPredictForMatSingleRow")
single_row_predictor_[predict_type].reset(new SingleRowPredictor(predict_type, boosting_.get(),
config, start_iteration, num_iteration)); (Called inside "SetSingleRowPredictor")
predictor_.reset(new Predictor(boosting, start_iter, iter_, is_raw_score, is_predict_leaf, predict_contrib,
early_stop_, early_stop_freq_, early_stop_margin_))(Called inside the constructor of "SingleRowPredictor");
void SetSingleRowPredictor(int start_iteration, int num_iteration, int predict_type, const Config& config) {
      UNIQUE_LOCK(mutex_)
      if (single_row_predictor_[predict_type].get() == nullptr ||
          !single_row_predictor_[predict_type]->IsPredictorEqual(config, num_iteration, boosting_.get())) {
        single_row_predictor_[predict_type].reset(new SingleRowPredictor(predict_type, boosting_.get(),
                                                                         config, start_iteration, num_iteration));
      }
  }

This clearly shows that the above creation of new SingleRowPredictor is happening only once, Because in the code it looks like it is only creating if it is not created before.

This makes all threads across different prediction calls use same Predictor object. Which in turn makes predict_buf_ unsafe.

@JoanFM
Copy link
Contributor

JoanFM commented Jan 14, 2021

So maybe there is a problem that some of the Predictions can be thread safe and other can't.

I do think that the OMP num_threads are supposed to work for a single prediction, but I have not much expertise on that

@tarunreddy1018
Copy link
Author

I was thinking can we have the same logic as we had in other predict functions which are thread-safe. I mean to say that can we also have the other 2 predict functions also create a new Predictor object for each call as they do in other predict function. I believe this makes the other two thread safe as well. Right now the state of those 2 predict functions are such that we can only call one at a time. I am not completely sure if this is something related to bug or that was the intent from the beginning. Because I believe if other predict functions create a new Predictor object for each call why not these 2 predict do the same?

@JoanFM
Copy link
Contributor

JoanFM commented Jan 14, 2021

I would agree, but I am not surr if allocating the buffer at each call may degrade the performance of single threaded too much? but anyhow it would benefit from more consistency.

But as you I am not sure if it had some specific intentions

@AlbertoEAF
Copy link
Contributor

Because I believe if other predict functions create a new Predictor object for each call why not these 2 predict do the same?

Which ones create a Predictor all the time? The versions that are not "SingleRow" ?
If so it makes sense, as creating a Predictor per single prediction can generate a large overhead. Remember that in low-latency scenarios memory allocation is one of the biggest factors.

I don't mean it can't be tried, but it would have to be benchmarked, and the single-thread scenario too. Maybe it's really the case that the SingleRow versions are not meant for thread-safety, as it would require too many memory allocations / unit of work.

@guolinke can we have your thoughts on the matter or of someone who has dealt with this in the past?

@tarunreddy1018
Copy link
Author

@AlbertoEAF yes for example LGBM_BoosterPredictForMats, LGBM_BoosterPredictForCSR, LGBM_BoosterPredictForMat all these predict functions make a call to this line

auto predictor = CreatePredictor(start_iteration, num_iteration, predict_type, ncol, config);

This makes me suspect that all these predict functions create a new Predict object per thread call. And this Predictor Object has
predict_buf_ vector of vectors as one of its field and each call to this Predict functions parallelize the work among multiple omp threads and the number of threads will be specified by this parameter num_threads=1 that we pass when we make the call to predict function.

What I believe is, these methods are meant to be used for larger datasets when there are many rows of input. And lets says the number of threads is 8 that we specify and it creates 8 omp threads and all these rows will be divided across these 8 omp threads to speed up the computation.

So, the amount of performance gain that we get by dividing these large set of rows across limited number of threads is more as compared to calling the LGBM_BoosterPredictForMatSingleRow for each row individually and creating a Predictor object for each call. So, I believe this single row methods were meant to optimize keeping in mind that each row will be called independently one after the other rather than calling in a multi threaded way.

Also I think if we want to have LGBM_BoosterPredictForMatSingleRow to create a Predictor object for each call in order to make it thread safe, we would rather use this LGBM_BoosterPredictForMat for single row prediction where nrow = 1. In fact the single row predict functions are just a special case for the Mat predict function where we just have one row. But they are optimized to just do single row predictions.

@tarunreddy1018
Copy link
Author

Hi, I have just verified and the predictions were consistent and as expected in multi threaded way with predict functions other than these two LGBM_BoosterPredictForMatSingleRow and LGBM_BoosterPredictForMatSingleRowFast. So, my guess was right that these single row predict functions are not thread safe since all of them share the same Predictor object.

@AlbertoEAF
Copy link
Contributor

Yes exactly @tarunreddy1018 :)

Also I think if we want to have LGBM_BoosterPredictForMatSingleRow to create a Predictor object for each call in order to make it thread safe, we would rather use this LGBM_BoosterPredictForMat for single row prediction where nrow = 1. In fact the single row predict functions are just a special case for the Mat predict function where we just have one row. But they are optimized to just do single row predictions.

True you can always use the "batch" prediction methods with a single instance if you want to do parallel calls ;). And yes, the single instance versions are optimized variants that minimize allocation overheads.

I think it would be nice to have confirmation from @guolinke, @StrikerRUS, or another of the main devs that this is not a bug. If it's really the case that they were never meant to be thread-safe, we should add that to the C API documentation as you're not the first person having this issue ;)

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 16, 2021

Hi, @AlbertoEAF , are all your benchmarks based on putting a unique_lock in the predict function?

I had 2 suggestions that we can try:

  1. Either have a unique_lock in the predict function (Giving up on performance, But good in terms of allocation overhead)
  2. Or Still have a shared_lock but now each call will use its own vector. Thus avoiding race conditions. (Improving on performance, But has slight memory overhead)

I believe 2nd approach is good, because as per the c api code the RowFunctionFromDenseMatric_helper does anyhow create a new vector for each call. And using this RowPairFunctionFromDenseMatric function it creates a vector of pairs. And this is what gets passed to Predictor object. I believe instead of getting vector from this predict_buf_ inside the Predictor object where we copy our feature values to this buffer we can afford to create a new vector. (OR) Can we skip this intermediate step of getting a vector of pairs in RowPairFunctionFromDenseMatric and directly use this vector returned by this RowFunctionFromDenseMatric_helper. Because I could not find any use of that step. I believe we can get over that step.

These are just my thoughts. Because the performance boost that we get with having a shared lock and making it thread safe looks more promising.

std::function<std::vector<double>(int row_idx)>
RowFunctionFromDenseMatric_helper(const void* data, int num_row, int num_col, int is_row_major) {
  const T* data_ptr = reinterpret_cast<const T*>(data);
  if (is_row_major) {
    return [=] (int row_idx) {
      std::vector<double> ret(num_col);
      auto tmp_ptr = data_ptr + static_cast<size_t>(num_col) * row_idx;
      for (int i = 0; i < num_col; ++i) {
        ret[i] = static_cast<double>(*(tmp_ptr + i));
      }
      return ret;
    };
  } else {
    return [=] (int row_idx) {
      std::vector<double> ret(num_col);
      for (int i = 0; i < num_col; ++i) {
        ret[i] = static_cast<double>(*(data_ptr + static_cast<size_t>(num_row) * i + row_idx));
      }
      return ret;
    };
  }
}
std::function<std::vector<std::pair<int, double>>(int row_idx)>
RowPairFunctionFromDenseMatric(const void* data, int num_row, int num_col, int data_type, int is_row_major) {
  auto inner_function = RowFunctionFromDenseMatric(data, num_row, num_col, data_type, is_row_major);
  if (inner_function != nullptr) {
    return [inner_function] (int row_idx) {
      auto raw_values = inner_function(row_idx);
      std::vector<std::pair<int, double>> ret;
      ret.reserve(raw_values.size());
      for (int i = 0; i < static_cast<int>(raw_values.size()); ++i) {
        if (std::fabs(raw_values[i]) > kZeroThreshold || std::isnan(raw_values[i])) {
          ret.emplace_back(i, raw_values[i]);
        }
      }
      return ret;
    };
  }
  return nullptr;
}

AlbertoEAF added a commit to feedzai/LightGBM that referenced this issue Jan 16, 2021
By using a unique lock instead of the shared lock the timings are very similar,
but predictions are correct.

Even so, by designing a small C++ benchmark with a very simple LGBM model,more threads on a simple model are slower than
the single-thread case. This is probably due to very small work units,
the lock contention overhead increases.

We should in the future benchmark with more complex models to see if supporting
threading on these calls is worth it in performance gains.

If not, then we could choose to not to provide thread-safety and remove
the locks altogether for maximal throughput.

See microsoft#3751 for timings.

See gist for benchmark code:

  https://gist.github.com/AlbertoEAF/5972db15a27c294bab65b97e1bc4c315
@JoanFM
Copy link
Contributor

JoanFM commented Jan 16, 2021

maybe the best way is to add unique lock for now as quick patch and then maybe we can add a set of functions for multithreaded usage? or that would be too much mantainance overhead?

@tarunreddy1018
Copy link
Author

@JoanFM Yes, agree. We can make it shared_lock or may be as you said have a seperate function for multi threaded usage. If doing so gives good performance sometime later. For now may be I will try out with shared lock approach and try having a new vector for each call and test it out for my use case. If it looks good then may be I will report later. So, that a thought can be given on that direction.

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jan 16, 2021

Hi @tarunreddy1018 @JoanFM :) As you can see from the PR #3771 description and the benchmark code the change to unique lock at the PredictSingleRow level is enough to make it fully thread-safe.

Notice that neither RowFunctionFromDenseMatric_helper nor RowPairFunctionFromDenseMatric require locks as they allocate new memory without sharing internal variables.

The lock is only at the PredictSingleRow call and I even tested putting the lock for the Fast case only before the line
https://github.com/microsoft/LightGBM/blob/master/src/c_api.cpp#L398 as it is guaranteed it calls a different function (get_row_fun), but the timing was the same.

So a shared_lock wouldn't help us here, except if we go down one extra level of lock granularity, but this is already pretty tricky as we have virtual functions etc and in terms of maintenance would be pretty hard to maintain.

Does it make sense? Or did I miss something in your points?


Regarding @tarunreddy1018 's suggestion of combining RowFunctionFromDenseMatric_helper + RowPairFunctionFromDenseMatric could make sense if we really have a useless intermediate representation (didn't look at it yet).

@tarunreddy1018
Copy link
Author

So a shared_lock wouldn't help us here, except if we go down one extra level of lock granularity, but this is already pretty tricky as we have virtual functions etc and in terms of maintenance would be pretty hard to maintain.

@AlbertoEAF In the above statement, you meant to say that the performance gains that you got with shared lock and unique lock are almost similar right. So, shared lock might not help here right?

@JoanFM
Copy link
Contributor

JoanFM commented Jan 16, 2021

I think that the measure that needs to be counted is the amount of predictions we can do in a unit of time and not the speed of the prediction right?

@AlbertoEAF
Copy link
Contributor

@AlbertoEAF In the above statement, you meant to say that the performance gains that you got with shared lock and unique lock are almost similar right. So, shared lock might not help here right?

I meant doing locking only inside the predictor exactly where needed in the inner calls of SingleRowPredict, so every statement that is parallelizable could be parallelized.
But yes, I also got the same execution time with unique and shared locks on that PR benchmark (though the shared locks produced wrong scores obviously).

I think that the measure that needs to be counted is the amount of predictions we can do in a unit of time and not the speed of the prediction right?

Exactly, as I do 3M predictions, if it had a higher throughput the program would finish sooner, thus measuring total wall time is what we want here.

@JoanFM
Copy link
Contributor

JoanFM commented Jan 16, 2021

Ok I see @AlbertoEAF thanks!

@tarunreddy1018
Copy link
Author

Thanks @AlbertoEAF @JoanFM

@AlbertoEAF
Copy link
Contributor

Hi again @tarunreddy1018 :)

I tried your idea of merging those two functions. Approach 1 was the laziest, turning the ..._helper() to a function that returns a function (basically computes the offset vs the data pointer), this allows us to have a single representation (pairs (int, value)): 706f2af...feedzai:single-row-predict-fix-lock-faster-RowPairFunctionFromDenseMatric2

You can check the code above.

I got exactly the same timings as having the two intermediate vectors and going from the 1st vector to the pairs vector.

I don't think there's much chance of improvement here anymore, the bottleneck is somewhere else.

In fact when I was designing the *Fast() API, initially you could only use a fixed *data pointer for prediction that you set at the ...FastInit() call, and you replaced the *data values inplace. Even then, that alternative which didn't allocate memory per call ran with very little performance differences to the current version which allows using different data adresses per prediction (< 5% execution time savings).


True @tarunreddy1018 , we can still try the other way around, getting rid of the pairs by changing code inside the Predictor but I'm afraid of that because that changes a lot of code, and I don't want to go into all that hassle :) But if you're up for it I can help you get the benchmark going and you can try changing it yourself ;)

@AlbertoEAF
Copy link
Contributor

AlbertoEAF commented Jan 17, 2021

Ok I just profiled the benchmark, and it's normal that those changes have little/no effect. All the time is spent in internal calls already. Consider as 100% execution time the part of the code with the C API predict call LGBM_BoosterPredictForMatSingleRowFast, the Predict method spent >97% of execution time.

I used valgrind, so I am not accounting for the mutex contention time that we've seen is not negligible as we increased number of threads.

@AlbertoEAF
Copy link
Contributor

Ok, after reading the code over and over I can say 2 things:

  1. Code is highly coupled in the back-end between Booster & Predictor. There's no sane way of changing that. So, if you want full multi-threaded performance you should instead REMOVE the unique lock from the single predictor (thread-unsafe) and you build on your side a thread pool with N threads, and 1 Booster and FastConfigHandle PER thread. That warrants you 100% parallelism :) The biggest overhead is the memory of the duplicated booster, which is not critical to be honest on modern machines, as it's usually < 10MB.
  2. If your features are not sparse we're paying 3 data copies per prediction on MatSingleRow: 1 -> from input dataset "row" to vector , 2 -> to pairs (int, value), 3 -> back to buffer / vector inside the Predictor - the Predictor's buffer. Indeed, it could be optimized.

@tarunreddy1018
Copy link
Author

@AlbertoEAF Right, I agree with that. I will try out as you suggested. Thanks for the information. I guess this is fine for now.

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 18, 2021

@AlbertoEAF Hi, I have tried with the LGBM_BoosterPredictForMatSingleRowFast where I used a seperate vector for each call. Instead of using the same vector in predict_buf_ for all calls. This made sure that my predictions were consistent and correct.

And also by doing so, the performance is really great for us. And also creating one extra vector for each call did not had much memory overhead as well. This was the only change that I had to do, rest everything the SIngleRowPredictor, Predictor, FastConfigHandle object everything remained the same just 1 object for all the calls. Its just that all the parallel calls to this LGBM_BoosterPredictForMatSingleRowFast use their own seperate vector for its processing.

This was the code which I have tweaked in predictor.hpp

Before Change:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                           double* output) {
          int tid = omp_get_thread_num();
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            CopyToPredictBuffer(predict_buf_[tid].data(), features);
            boosting_->Predict(predict_buf_[tid].data(), output, &early_stop_);
            ClearPredictBuffer(predict_buf_[tid].data(),
                               predict_buf_[tid].size(), features);
          }
        };

After Change:

predict_fun_ = [=](const std::vector<std::pair<int, double>>& features,
                           double* output) {
          if (num_feature_ > kFeatureThreshold &&
              features.size() < KSparseThreshold) {
            auto buf = CopyToPredictMap(features);
            boosting_->PredictByMap(buf, output, &early_stop_);
          } else {
            std::vector<double> predict_buf_;
            predict_buf_.resize(num_feature_, 0.0f);
            CopyToPredictBuffer(predict_buf_.data(), features);
            boosting_->Predict(predict_buf_.data(), output, &early_stop_);
            //Removes all elements in vector
            predict_buf_.clear();
            //Frees the memory which is not used by the vector
            predict_buf_.shrink_to_fit();
          }
        };

Probably we can try this approach, This does not alter any other parts of the code also with minimal change and good performance with not so much memory overhead.

Each of these parallel calls uses num_threads=1.

@AlbertoEAF
Copy link
Contributor

Hello @tarunreddy1018,

First of all, interesting that you only need a new predict_buf_ and can share everything else, that might be useful :)

But did you try using the regular implementation with the lock and without setting num_threads=1?

I ask this because we might have different "optimal" implementations & use cases. For instance, if the model is not trivially simple, unless I'm mistaken a prediction is already parallelized internally, and hence doing several predictions in parallel with 1 thread vs 1 prediction at a time but with multiple threads each. I'm not sure this is better, and it might vary according to # trees, average tree depth and # features.

In this implemenation you're removing that opportunity of parallelism inside the predict call right?

Also take into account LightGBM is actually optimized for prediction in batch, which also relies on this code, thus this might degrade its performance and probably is not acceptable by the core devs.

By the way, due to C++'s RAII, predict_buf_ will be deleted automatically after the closing brace, so no need to call clear & shrink_to_fit. ;)

@JoanFM
Copy link
Contributor

JoanFM commented Jan 18, 2021

Hello @tarunreddy1018,

First of all, interesting that you only need a new predict_buf_ and can share everything else, that might be useful :)

But did you try using the regular implementation with the lock and without setting num_threads=1?

I ask this because we might have different "optimal" implementations & use cases. For instance, if the model is not trivially simple, unless I'm mistaken a prediction is already parallelized internally, and hence doing several predictions in parallel with 1 thread vs 1 prediction at a time but with multiple threads each. I'm not sure this is better, and it might vary according to # trees, average tree depth and # features.

In this implemenation you're removing that opportunity of parallelism inside the predict call right?

Also take into account LightGBM is actually optimized for prediction in batch, which also relies on this code, thus this might degrade its performance and probably is not acceptable by the core devs.

By the way, due to C++'s RAII, predict_buf will be deleted automatically after the closing brace, so no need to call clear & shrink_to_fit._ ;)

I agree with @AlbertoEAF I think these new predict functions are suited to have multiple predictions with different threads, while OpenMP tries to have one prediction with multiple threads.

Maybe these 2 distinctions can be made when choosing the predict function to use, but the mantainance overhead may be a little too much

@tarunreddy1018
Copy link
Author

@AlbertoEAF yes, agree with that. Probably In my use case having num_threads=1 worked well and it looks like my code is changed in favour to that where I am removing the possibility to parallelize individual predictions with openmp which might not be ideal for other case scenario and might not be general and be inclined to a specific case if we make the change.

StrikerRUS pushed a commit that referenced this issue Jan 21, 2021
By using a unique lock instead of the shared lock the timings are very similar,
but predictions are correct.

Even so, by designing a small C++ benchmark with a very simple LGBM model,more threads on a simple model are slower than
the single-thread case. This is probably due to very small work units,
the lock contention overhead increases.

We should in the future benchmark with more complex models to see if supporting
threading on these calls is worth it in performance gains.

If not, then we could choose to not to provide thread-safety and remove
the locks altogether for maximal throughput.

See #3751 for timings.

See gist for benchmark code:

  https://gist.github.com/AlbertoEAF/5972db15a27c294bab65b97e1bc4c315
@AlbertoEAF
Copy link
Contributor

@tarunreddy1018 can we close this issue?

If you want to see improvements regarding threading maybe open a new issue like a feature request regarding threading support with no locking and mention this issue and #3675 so we know all the details we discussed.

@tarunreddy1018
Copy link
Author

tarunreddy1018 commented Jan 27, 2021

@AlbertoEAF Sure, you can close this issue for now. Thanks

@AlbertoEAF
Copy link
Contributor

@tarunreddy1018 I don't have permissions, must be you closing it :)

@StrikerRUS
Copy link
Collaborator

Fixed via #3771.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants