-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Improve rg.log
function
#2640
Conversation
- Accept `num_threads` to log batches concurrently - All batches will be processed even if errores were found - For those with errors a more descriptive error will be raised
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## develop #2640 +/- ##
===========================================
- Coverage 94.11% 93.87% -0.24%
===========================================
Files 170 170
Lines 8732 8722 -10
===========================================
- Hits 8218 8188 -30
- Misses 514 534 +20
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
- Increase default timeout - retry when a `httpx.TransportError` error occurs
Works for me locally. I'm glad to see that the hacky Lastly, I think we're still missing tests for |
Thanks for your feedback @tomaarsen!!
Since the I think the approach of launching a separate one when
Yes, I need to add some tests to check this flow. |
Indeed, |
name=name_of_copy, | ||
target_workspace=workspace, | ||
), | ||
json_body=CopyDatasetRequest(name=name_of_copy, target_workspace=workspace), | ||
) | ||
|
||
def delete(self, name: str, workspace: Optional[str] = None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see the backoff
variables here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should apply changes step by step. We can consider adding a backoff mechanism to another method in a separate PR. Otherwise, a lot of changes will be included in the same PR, which can be a great bug farm. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it to rg.load
which seam the most relevant to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For rg.load
, things could be a bit different. For instance, we should decrease the batch size, or we should prefetch some data before splitting and parallelizing the data loading. But yes. we can have a similar approach to improve also that method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decreasing batch size on failure seems very smart for rg.load
in particular.
src/argilla/client/client.py
Outdated
batch_size: int = 100, | ||
verbose: bool = True, | ||
chunk_size: Optional[int] = None, | ||
num_threads: int = 1, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see max_retries
and the backoff
variables here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @frascuchon looks good. I would like to see the max_retries
, and backoff
in the other log functions, but ideally, they should also be added to the rg.load
.
As in my previous comment, if it works for you, we can focus on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small nitpicks. Looks good otherwise, I'm always glad to see a PR that removes more code than it adds.
- Use the Deprecated section for CHANGELOG - Include new default batch size for rg.log
Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>
…ubrix into feat/improve-rg.log-functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I think this is all set
Description
Allow log data batches concurrently
num_threads
to log batches concurrentlyhttpx.TransportError
occursClose partially #2533
Type of change
(Please delete options that are not relevant. Remember to title the PR according to the type of change)
How Has This Been Tested
(Please describe the tests that you ran to verify your changes. And ideally, reference
tests
)TDB
Checklist