Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About time in results dataframe #647

Closed
toncho11 opened this issue Sep 23, 2024 · 4 comments
Closed

About time in results dataframe #647

toncho11 opened this issue Sep 23, 2024 · 4 comments

Comments

@toncho11
Copy link
Contributor

toncho11 commented Sep 23, 2024

I think the documentation of MOABB is not very clear on the "time" column. Please point me to a source if I am wrong.

I have 2 pipelines that take 53 minutes to run on many datasets and subjects. I do "WithinSession". So each line in the results is a session. The "time" column comprises of both the training and classification (not obvious).

But the time is not by session time, but by fold (thanks @gcattan). It is line

"time": duration / 5.0, # 5 fold CV
So if I sum the "time" column in the results I get something like 5 minutes. Now I need to multiply this by 5 to get the total time spent on training and classification. So now 53 minutes of total run - 25 minutes is 28 minutes.

  • Is the above reasoning correct? Does multiplying by 5 really gives us the total time spent on training and classification?
  • So what are these 28 minutes indeed? Is this time spent on IO (loading data) and filtering (for the paradigm) and maybe other pre-processing steps?
  • So we usually get the mean time of a fold with:
    print(results.groupby("pipeline").mean("score")[["score", "time"]])
    because the mean of the "time" column is better estimation than the total time?
@PierreGtch
Copy link
Collaborator

Hi @toncho11,

The “time” corresponds to the average time it takes to train and test the pipeline on one CV fold.
Indeed, loading the data is NOT counted in this time column, so the remaining 28 minutes are for loading and pre-processing the data. (note that loading and preprocessing the data is only done once for all the pipelines).

This “time” column allows you to compare the different pipelines together, not plan how long an experiment will take.

@gcattan
Copy link
Contributor

gcattan commented Sep 25, 2024

Hi @PierreGtch . Thanks for your answer. It would be great if this is documented in the evaluation classes!

@toncho11
Copy link
Contributor Author

Thank you @PierreGtch! Confirming all this was very important!

I just wanted to correct my previous query. It should be:

print(results.groupby("pipeline").mean()[["score", "time"]])

@toncho11
Copy link
Contributor Author

Also time is reported in seconds. For example 0.18 in the time column means 180 milliseconds (average time per fold).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants