-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: timeout argument for client.sample
#18
Comments
We have worked on this a couple of times and we want to resolve it. We do not have an ETA. |
The Reverb datasets do allow a |
I had to move away from reverb for now since I had trouble working with states with variable shape (I'm not sure if that is supported but it seemed possible to me at time) but I'll can try to recall as best I can the context for this. If you're referring to That being said, I want to acknowledge that it's very possible that I was trying to use reverb to do things it wasn't meant to support! I wouldn't be offended if that was the gist of your reply ;) |
The new Trajectory{Writer,Dataset} do provide more flexibility; it's also possible to have variable shaped tensors as long as you're not concatenating them into time trajectories. That said, the python client sampler lacks a timeout argument - only the Datasets allow it. So I'll leave this open until we propagate the |
Is this still pending? Can I look into this? |
The FR is still open so feel free to take a look! Just a warning though that the implementation is a bit involved so it isn't quite as just "propagating an argument". |
Hi @acassirer, I sat last night and tried to understand that the problem encountered in the #4 is a common one in distributed systems or client-server architectures: how to handle timeouts gracefully when the server goes offline or becomes unresponsive. When using the Reverb library, as the example shows, the client can indeed get blocked indefinitely if the server crashes or is terminated. Our current workaround was using the The provided suggestion from the supposed response regarding using The updated In this FR do, we have to modify the https://github.com/deepmind/reverb/blob/58f5f018082860caa4057d24d75d725709dcd2bb/reverb/client.py#L345 def sample(
timeout: Optional[float] = None, somewhat like this and then maybe we have to modify the RPC call related to this method? |
Hey, The sampler in the c++ layer have to be updated and the pybind layer have to be modified accordingly. The RPC call might have to change in two ways:
|
std::pair<int64_t, absl::Status> FetchSamples(
internal::Queue<std::unique_ptr<Sample>>* queue, int64_t num_samples,
absl::Duration rate_limiter_timeout) override {
std::unique_ptr<grpc::ClientReaderWriterInterface<SampleStreamRequest,
SampleStreamResponse>>
the timeout is already being set - request.mutable_rate_limiter_timeout()->set_milliseconds(
NonnegativeDurationToInt64Millis(rate_limiter_timeout)); This is what I modified in the sampler.cc std::pair<int64_t, absl::Status> FetchSamples(
internal::Queue<std::unique_ptr<Sample>>* queue, int64_t num_samples,
absl::Duration rate_limiter_timeout) override {
std::unique_ptr<grpc::ClientReaderWriterInterface<SampleStreamRequest,
SampleStreamResponse>>
stream;
{
absl::MutexLock lock(&mu_);
if (closed_) {
return {0, absl::CancelledError("`Close` called on Sampler.")};
}
context_ = std::make_unique<grpc::ClientContext>();
context_->set_wait_for_ready(false);
// Setting the deadline for the gRPC context
context_->set_deadline(absl::ToChronoTime(absl::Now() + rate_limiter_timeout));
stream = stub_->SampleStream(context_.get());
}
|
That does indeed look sensible with the exception that the time it takes to connect is not zero so there is a potential issue when you establish a connection only to have the gRPC deadline expire before the rate limiter returns a valid sample. This would result in the data being lost as it is successfully sampled from the table but never returned to the caller. |
So should I do something like this? // Buffer time to account for connection overhead
constexpr auto CONNECTION_BUFFER_TIME = std::chrono::milliseconds(50); // Or some other suitable value
// Setting the deadline for the gRPC context
context_->set_deadline(absl::ToChronoTime(absl::Now() + rate_limiter_timeout + CONNECTION_BUFFER_TIME));
|
Yes I think something like that would be reasonable. The important thing will be to add test coverage in the C++ layer and Python layer. |
So, beside the test coverage, are there any other changes, do I have to make changes? I am also opening the PR so that you can guide me better. =) |
You would have to change the pybind layer as well of course in order to expose it to Python. Then there is the story with the datasets in which this MUST NOT be enabled. Then the test coverage will show more of whether this solution is working or not. |
Reading about pybind11 https://buildmedia.readthedocs.org/media/pdf/pybind11/stable/pybind11.pdf |
Similarly to #4, it would be useful to be able to back out of sampling without needing to wrap things in a thread or use an executor. I agree that in many cases, you'd want to sample asynchronously to maximize throughput, but there are cases where the predictability and simplicity are preferable even if comes at the expense of efficiency (e.g., in research). A timeout argument would simplify the synchronous setting without sacrificing safeguards from indefinite waiting.
The text was updated successfully, but these errors were encountered: