Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Improve rg.log function #2640

Merged
merged 21 commits into from
Apr 20, 2023
Merged

Conversation

frascuchon
Copy link
Member

@frascuchon frascuchon commented Apr 3, 2023

Description

Allow log data batches concurrently

  • Accept num_threads to log batches concurrently
  • Add retries when an httpx.TransportError occurs

Close partially #2533

Type of change

(Please delete options that are not relevant. Remember to title the PR according to the type of change)

  • Improvement (change adding some improvement to an existing functionality)

How Has This Been Tested

(Please describe the tests that you ran to verify your changes. And ideally, reference tests)

TDB

Checklist

  • I have merged the original branch into my forked branch
  • I added relevant documentation
  • follows the style guidelines of this project
  • I did a self-review of my code
  • I made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

- Accept `num_threads` to log batches concurrently
- All batches will be processed even if errores were found
- For those with errors a more descriptive error will be raised
@codecov
Copy link

codecov bot commented Apr 3, 2023

Codecov Report

Patch coverage: 88.46% and project coverage change: -0.24 ⚠️

Comparison is base (18fd8de) 94.11% compared to head (520bcf8) 93.87%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #2640      +/-   ##
===========================================
- Coverage    94.11%   93.87%   -0.24%     
===========================================
  Files          170      170              
  Lines         8732     8722      -10     
===========================================
- Hits          8218     8188      -30     
- Misses         514      534      +20     
Flag Coverage Δ
pytest 93.87% <88.46%> (-0.24%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/argilla/client/api.py 91.48% <40.00%> (-3.86%) ⬇️
src/argilla/client/client.py 93.24% <100.00%> (+0.02%) ⬆️
src/argilla/client/sdk/commons/api.py 49.01% <100.00%> (-5.89%) ⬇️

... and 3 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

- Increase default timeout
- retry when a `httpx.TransportError` error occurs
@frascuchon frascuchon marked this pull request as ready for review April 3, 2023 17:13
@tomaarsen
Copy link
Contributor

Works for me locally. I'm glad to see that the hacky _ArgillaLogAgent could be removed.
The default of num_threads: int = 1 differs slightly to how I usually see threading, where 0 implies no threading and 1 implies one thread created. My understanding is that in this PR, 1 implies "no new thread created", is that right? I must say that I think I prefer this PR's approach, although it does differ from what I tend to see (e.g. in torch DataLoader num_workers).

Lastly, I think we're still missing tests for num_threads > 1.

@frascuchon
Copy link
Member Author

Thanks for your feedback @tomaarsen!!

Works for me locally. I'm glad to see that the hacky _ArgillaLogAgent could be removed. The default of num_threads: int = 1 differs slightly to how I usually see threading, where 0 implies no threading and 1 implies one thread created. My understanding is that in this PR, 1 implies "no new thread created", is that right? I must say that I think I prefer this PR's approach, although it does differ from what I tend to see (e.g. in torch DataLoader num_workers).

Since the rg.log is a blocking function when background=False, the advantage of using a separate thread instead of the main one to process batches has no real improvement. Anyway, it's true that can raise confusion among users about the real number of separate threads used for computation.

I think the approach of launching a separate one when num_thread = 1 and using the main one when num_thread=0 is more accurate to the param purpose. So, I will change it to fit that behavior

Lastly, I think we're still missing tests for num_threads > 1.

Yes, I need to add some tests to check this flow.

@tomaarsen
Copy link
Contributor

Indeed, num_threads=1 (or num_workers=1, etc) tends to be a performance decrease relative to using the main thread only (via 0). It's still offered usually, though.

name=name_of_copy,
target_workspace=workspace,
),
json_body=CopyDatasetRequest(name=name_of_copy, target_workspace=workspace),
)

def delete(self, name: str, workspace: Optional[str] = None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see the backoff variables here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should apply changes step by step. We can consider adding a backoff mechanism to another method in a separate PR. Otherwise, a lot of changes will be included in the same PR, which can be a great bug farm. :-)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added it to rg.load which seam the most relevant to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For rg.load, things could be a bit different. For instance, we should decrease the batch size, or we should prefetch some data before splitting and parallelizing the data loading. But yes. we can have a similar approach to improve also that method

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decreasing batch size on failure seems very smart for rg.load in particular.

batch_size: int = 100,
verbose: bool = True,
chunk_size: Optional[int] = None,
num_threads: int = 1,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see max_retries and the backoff variables here.

Copy link
Member

@davidberenstein1957 davidberenstein1957 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @frascuchon looks good. I would like to see the max_retries, and backoff in the other log functions, but ideally, they should also be added to the rg.load.

@frascuchon
Copy link
Member Author

Hi @frascuchon looks good. I would like to see the max_retries, and backoff in the other log functions, but ideally, they should also be added to the rg.load.

As in my previous comment, if it works for you, we can focus on the rg.log improvement in this PR and tackle the rg.load in a separate one.

Copy link
Contributor

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small nitpicks. Looks good otherwise, I'm always glad to see a PR that removes more code than it adds.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
src/argilla/client/client.py Outdated Show resolved Hide resolved
src/argilla/client/client.py Outdated Show resolved Hide resolved
src/argilla/client/client.py Outdated Show resolved Hide resolved
src/argilla/client/client.py Outdated Show resolved Hide resolved
- Use the Deprecated section for CHANGELOG
- Include new default batch size for rg.log
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@tomaarsen tomaarsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I think this is all set

@frascuchon frascuchon merged commit 33c0fab into develop Apr 20, 2023
@frascuchon frascuchon deleted the feat/improve-rg.log-functions branch April 20, 2023 04:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants