feat: Improve `rg.log` function #2640

frascuchon · 2023-04-03T14:01:37Z

Description

Allow log data batches concurrently

Accept num_threads to log batches concurrently
Add retries when an httpx.TransportError occurs

Close partially #2533

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

TDB

Checklist

I have merged the original branch into my forked branch
I added relevant documentation
follows the style guidelines of this project
I did a self-review of my code
I made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

- Accept `num_threads` to log batches concurrently - All batches will be processed even if errores were found - For those with errors a more descriptive error will be raised

codecov · 2023-04-03T15:01:30Z

Codecov Report

Patch coverage: 88.46% and project coverage change: -0.24 ⚠️

Comparison is base (18fd8de) 94.11% compared to head (520bcf8) 93.87%.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2640      +/-   ##
===========================================
- Coverage    94.11%   93.87%   -0.24%     
===========================================
  Files          170      170              
  Lines         8732     8722      -10     
===========================================
- Hits          8218     8188      -30     
- Misses         514      534      +20

Flag	Coverage Δ
pytest	`93.87% <88.46%> (-0.24%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/argilla/client/api.py	`91.48% <40.00%> (-3.86%)`	⬇️
src/argilla/client/client.py	`93.24% <100.00%> (+0.02%)`	⬆️
src/argilla/client/sdk/commons/api.py	`49.01% <100.00%> (-5.89%)`	⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

- Increase default timeout - retry when a `httpx.TransportError` error occurs

tomaarsen · 2023-04-03T19:15:17Z

Works for me locally. I'm glad to see that the hacky _ArgillaLogAgent could be removed.
The default of num_threads: int = 1 differs slightly to how I usually see threading, where 0 implies no threading and 1 implies one thread created. My understanding is that in this PR, 1 implies "no new thread created", is that right? I must say that I think I prefer this PR's approach, although it does differ from what I tend to see (e.g. in torch DataLoader num_workers).

Lastly, I think we're still missing tests for num_threads > 1.

frascuchon · 2023-04-03T22:58:14Z

Thanks for your feedback @tomaarsen!!

Works for me locally. I'm glad to see that the hacky _ArgillaLogAgent could be removed. The default of num_threads: int = 1 differs slightly to how I usually see threading, where 0 implies no threading and 1 implies one thread created. My understanding is that in this PR, 1 implies "no new thread created", is that right? I must say that I think I prefer this PR's approach, although it does differ from what I tend to see (e.g. in torch DataLoader num_workers).

Since the rg.log is a blocking function when background=False, the advantage of using a separate thread instead of the main one to process batches has no real improvement. Anyway, it's true that can raise confusion among users about the real number of separate threads used for computation.

I think the approach of launching a separate one when num_thread = 1 and using the main one when num_thread=0 is more accurate to the param purpose. So, I will change it to fit that behavior

Lastly, I think we're still missing tests for num_threads > 1.

Yes, I need to add some tests to check this flow.

tomaarsen · 2023-04-04T05:18:25Z

Indeed, num_threads=1 (or num_workers=1, etc) tends to be a performance decrease relative to using the main thread only (via 0). It's still offered usually, though.

src/argilla/client/client.py

src/argilla/client/api.py

davidberenstein1957 · 2023-04-12T06:21:51Z

src/argilla/client/client.py

-                name=name_of_copy,
-                target_workspace=workspace,
-            ),
+            json_body=CopyDatasetRequest(name=name_of_copy, target_workspace=workspace),
        )

    def delete(self, name: str, workspace: Optional[str] = None):


I would also like to see the backoff variables here.

We should apply changes step by step. We can consider adding a backoff mechanism to another method in a separate PR. Otherwise, a lot of changes will be included in the same PR, which can be a great bug farm. :-)

I added it to rg.load which seam the most relevant to me.

For rg.load, things could be a bit different. For instance, we should decrease the batch size, or we should prefetch some data before splitting and parallelizing the data loading. But yes. we can have a similar approach to improve also that method

Decreasing batch size on failure seems very smart for rg.load in particular.

davidberenstein1957 · 2023-04-12T06:22:15Z

src/argilla/client/client.py

+        batch_size: int = 100,
+        verbose: bool = True,
+        chunk_size: Optional[int] = None,
+        num_threads: int = 1,


I would also like to see max_retries and the backoff variables here.

davidberenstein1957

Hi @frascuchon looks good. I would like to see the max_retries, and backoff in the other log functions, but ideally, they should also be added to the rg.load.

frascuchon · 2023-04-12T16:03:42Z

Hi @frascuchon looks good. I would like to see the max_retries, and backoff in the other log functions, but ideally, they should also be added to the rg.load.

As in my previous comment, if it works for you, we can focus on the rg.log improvement in this PR and tackle the rg.load in a separate one.

tomaarsen

Some small nitpicks. Looks good otherwise, I'm always glad to see a PR that removes more code than it adds.

CHANGELOG.md

src/argilla/client/client.py

- Use the Deprecated section for CHANGELOG - Include new default batch size for rg.log

CHANGELOG.md

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

…g-functions

…ubrix into feat/improve-rg.log-functions

CHANGELOG.md

tomaarsen

Looks great, I think this is all set

frascuchon added 4 commits April 3, 2023 15:59

Allow log data batches concurrently

1289cc6

- Accept `num_threads` to log batches concurrently - All batches will be processed even if errores were found - For those with errors a more descriptive error will be raised

Adapt tests

2722ca4

Upgrade changelog

a616935

Fixing tests

36d24bc

Other log improvements

d561cdf

- Increase default timeout - retry when a `httpx.TransportError` error occurs

frascuchon requested review from davidberenstein1957 and tomaarsen April 3, 2023 17:13

frascuchon marked this pull request as ready for review April 3, 2023 17:13

Update CHANGELOG

623129b

frascuchon added 3 commits April 4, 2023 01:08

Set num_threads=0 for non threading at all

695df64

Add basic test

226ba24

Remove extra files

0393e56

frascuchon commented Apr 4, 2023

View reviewed changes

src/argilla/client/client.py Outdated Show resolved Hide resolved

frascuchon added 2 commits April 4, 2023 12:25

Update src/argilla/client/client.py

859ae32

Merge branch 'develop' into feat/improve-rg.log-functions

a1ab0b5

davidberenstein1957 reviewed Apr 12, 2023

View reviewed changes

src/argilla/client/api.py Show resolved Hide resolved

davidberenstein1957 reviewed Apr 12, 2023

View reviewed changes

frascuchon added 2 commits April 12, 2023 16:32

Remove log_async method from client class

1d2d52d

Using log(background=True) for log_async with deprecation warning

75ec040

davidberenstein1957 mentioned this pull request Apr 12, 2023

feat: Improve rg.load function #2687

Closed

Merge branch 'develop' into feat/improve-rg.log-functions

a23f932

tomaarsen suggested changes Apr 13, 2023

View reviewed changes

Apply CHANGELOG suggestions

fdafeec

- Use the Deprecated section for CHANGELOG - Include new default batch size for rg.log

Remove duplicated statement

df06bc1

frascuchon commented Apr 17, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

frascuchon and others added 4 commits April 17, 2023 09:39

Apply suggestions from code review

2e119ce

Co-authored-by: Tom Aarsen <37621491+tomaarsen@users.noreply.github.com>

Merge remote-tracking branch 'origin/develop' into feat/improve-rg.lo…

49d8b74

…g-functions

Merge branch 'feat/improve-rg.log-functions' of github.com:recognai/r…

28f421c

…ubrix into feat/improve-rg.log-functions

Merge branch 'develop' into feat/improve-rg.log-functions

03a90fe

frascuchon requested a review from tomaarsen April 17, 2023 10:23

frascuchon commented Apr 17, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

frascuchon commented Apr 17, 2023

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

Apply suggestions from code review

520bcf8

tomaarsen approved these changes Apr 18, 2023

View reviewed changes

frascuchon merged commit 33c0fab into develop Apr 20, 2023

frascuchon deleted the feat/improve-rg.log-functions branch April 20, 2023 04:21

davidberenstein1957 mentioned this pull request May 16, 2023

add retry and backoff behavior to rg.log/ rg.load #2533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Improve `rg.log` function #2640

feat: Improve `rg.log` function #2640

frascuchon commented Apr 3, 2023 •

edited

Loading

codecov bot commented Apr 3, 2023 •

edited

Loading

tomaarsen commented Apr 3, 2023

frascuchon commented Apr 3, 2023

tomaarsen commented Apr 4, 2023

davidberenstein1957 Apr 12, 2023

frascuchon Apr 12, 2023

davidberenstein1957 Apr 12, 2023

davidberenstein1957 Apr 12, 2023

frascuchon Apr 12, 2023

tomaarsen Apr 12, 2023

davidberenstein1957 Apr 12, 2023

davidberenstein1957 left a comment

frascuchon commented Apr 12, 2023

tomaarsen left a comment

tomaarsen left a comment

feat: Improve rg.log function #2640

feat: Improve rg.log function #2640

Conversation

frascuchon commented Apr 3, 2023 • edited Loading

Description

codecov bot commented Apr 3, 2023 • edited Loading

Codecov Report

tomaarsen commented Apr 3, 2023

frascuchon commented Apr 3, 2023

tomaarsen commented Apr 4, 2023

davidberenstein1957 Apr 12, 2023

Choose a reason for hiding this comment

frascuchon Apr 12, 2023

Choose a reason for hiding this comment

davidberenstein1957 Apr 12, 2023

Choose a reason for hiding this comment

davidberenstein1957 Apr 12, 2023

Choose a reason for hiding this comment

frascuchon Apr 12, 2023

Choose a reason for hiding this comment

tomaarsen Apr 12, 2023

Choose a reason for hiding this comment

davidberenstein1957 Apr 12, 2023

Choose a reason for hiding this comment

davidberenstein1957 left a comment

Choose a reason for hiding this comment

frascuchon commented Apr 12, 2023

tomaarsen left a comment

Choose a reason for hiding this comment

tomaarsen left a comment

Choose a reason for hiding this comment

feat: Improve `rg.log` function #2640

feat: Improve `rg.log` function #2640

frascuchon commented Apr 3, 2023 •

edited

Loading

codecov bot commented Apr 3, 2023 •

edited

Loading