Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance-test: suggest a more comparable loading time for non-eager and eager execution #1319

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dafnapension
Copy link
Collaborator

@dafnapension dafnapension commented Nov 2, 2024

Currently, Load Time reported in the performance test summary is the cumulative time spent in loader.load_data() . This is good for all types of loaders.
However, still within Loader.load_data(), immediately following the download of the dataset (from HF, for example, in case of LoadHF), comes MultiStream.from_iterables, the first introduction of the downloaded data into unitxt's recipe. In this introduction, eager-execution differs in its action from non-eager-execution:
In eager-mode, unitxt loops over each and every instance, to include it into the ListStreams constituting the MultiStream being generated. Whereas in the non-eager mode, unitxt just creates generators, for the GeneratorStreams constituting the MultiStream generated in this case, and returns from load_data().

Now, with load_iterables recently introduced into loaders.py by @elronbandel , we can clearly fetch net load time of datasets, excluding the introduction of each and every instance into unitxt that is only done in eager mode.

Here are demonstrating snakeviz-s:
For eager mode, more time is spent in Loader.load_data() than in LoaderHF.load_iterables:
image

Whereas for non-eager mode -- the same (the time to generate the generators is unnoticeable):
image

@dafnapension
Copy link
Collaborator Author

dafnapension commented Nov 2, 2024

Hi @elronbandel ,
I am not sure how important is the issue raised here. We usually compare performance of PR against main either when both are eager or both are non-eager, so why bother. On the other hand - why not be accurate, and have each mode 'pay' for its going over the fresh instances in its reported Net Time. What is your view about this?

@dafnapension dafnapension force-pushed the pure_loading branch 6 times, most recently from 3983657 to b070ca1 Compare November 7, 2024 15:49
… execution

Signed-off-by: dafnapension <dafnashein@yahoo.com>
@dafnapension dafnapension force-pushed the pure_loading branch 2 times, most recently from b0e7350 to 5c95302 Compare November 7, 2024 18:29
…e of dataset, excluding the introduction of each and every instance into unitxt that is only done in eager mode

Signed-off-by: dafnapension <dafnashein@yahoo.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant