Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable ETA and progress percentage. #1963

Merged
merged 1 commit into from
Mar 16, 2024

Conversation

LebedevRI
Copy link
Contributor

@LebedevRI LebedevRI commented Apr 26, 2021

Reliable ETA and progress percentage.

This has been bugging me for years. :)
Count of finished edges isn't a great statistic, it isn't
really obvious if LLVM will take 8 minues to build, or 10 minutes.

But, it's actually pretty straight-forward to get some
more useful information. We already know how much time each edge
has taken, so we could just do the dumb thing, and assume that
every edge in the plan takes the same amount of time.

Or, we can do better. .ninja_log already contains
the historical data on how long each edge took to produce it's outs,
so we simply need to ensure that we populate edges with that info,
and then we can greatly improve our predictions.
The math is pretty simple i think.

This is largely a port of a similar change i did to LLVM LIT:
https://reviews.llvm.org/D99073

With this, i get something quite lovely:

llvm-project/build-Clang12$ NINJA_STATUS="[%f/%t %p %P][%w + %W] " /repositories/ninja/build-Clang-debug/ninja opt
[288/2527  11%   4%][00:27 + 08:52] Building CXX object lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/AppendingTypeTableBuilder.cpp.o

I hope people will find this useful, and it could be merged.
CC @nico

Please let me know which kinds of test coverage this needs?

Fix #115.

@LebedevRI LebedevRI force-pushed the reliable-eta branch 2 times, most recently from 3b578c8 to 556bc11 Compare April 26, 2021 22:41
@LebedevRI
Copy link
Contributor Author

So far, i'm aware of two caveats:

  1. All times are in milliseconds. That may or may not affect the quality of prediction. Honestly, i'm not sure why they are in milliseconds. Even if they were in nanoseconds, int64_t would be enough for ~3 centuries. .ninja_log file size conservation?
  2. The prediction becomes bogus if the last rebuild was recovered from ccache.
    I guess we need to stop using previous times if the actual rate is very different from the predicted one.

@jhasse
Copy link
Collaborator

jhasse commented Apr 29, 2021

Can you add "Fix #115." to the commit message?

@LebedevRI
Copy link
Contributor Author

So far, i'm aware of two caveats:

  1. All times are in milliseconds. That may or may not affect the quality of prediction. Honestly, i'm not sure why they are in milliseconds. Even if they were in nanoseconds, int64_t would be enough for ~3 centuries. .ninja_log file size conservation?
  2. The prediction becomes bogus if the last rebuild was recovered from ccache.
    I guess we need to stop using previous times if the actual rate is very different from the predicted one.

FWIW i've fixed second point (by hardcoding some magic numbers).
First one i'm not so sure.

Will look into failing build.

Copy link
Contributor

@mathstuf mathstuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the code looks OK to me. One question I have is how this interacts with restat = 1 removing edges and dyndeps adding and removing edges from the plan during the build. Can tests be added?

doc/manual.asciidoc Outdated Show resolved Hide resolved
@LebedevRI
Copy link
Contributor Author

Overall the code looks OK to me. One question I have is how this interacts with restat = 1 removing edges and dyndeps adding and removing edges from the plan during the build.

I have no idea what restat = 1 is, but this does gracefully handles adding/removing edges
(that haven't started yet i presume?) during the build. That was indeed important to handle.
See builder_->status_->EdgeAddedToPlan(edge); in Plan::EdgeWanted() e.g.

Can tests be added?

I haven't looked into that yet.

@LebedevRI LebedevRI force-pushed the reliable-eta branch 2 times, most recently from f34e70d to 5c3ebdf Compare May 1, 2021 13:42
@LebedevRI
Copy link
Contributor Author

LebedevRI commented May 1, 2021

Did some analysis. With clean build:
image
With rebuild from ccache:
image

For ccache the largest avg dt is 0.2s at 100% completion,
while for clean build the minimal avg dt is always larger than ~2.0s after ~5% completion.
So looks like we can say that we can reuse the previous timings if the curr/prev averages differ by less than ~10x.

ninja_log.clean-sheet.csv
ninja_log.ccache.csv

@LebedevRI
Copy link
Contributor Author

(please don't merge. this is still broken when e.g. ccache is used)

@LebedevRI
Copy link
Contributor Author

Hm, and i just realized this is still here.

Essentially, this functionality requires that the last execution time, and the time it will take this time,
are similar. If they are significantly different (e.g. if on the second run it's being restored from ccache,
or it was restored from ccache last time, but this time it didn't hit), prediction will just be wrong.

As far as i can tell, the best thing we can do, is to heuristically stop using last execution time
when the new time is significantly different. It kinda works, but not really.

Can codeowners here please comment if the functionality this PR provides is acceptable given that drawback?

@jonesmz

This comment was marked as abuse.

@LebedevRI
Copy link
Contributor Author

So i take, in other words this would be fine.
Let me rebase this...

@LebedevRI
Copy link
Contributor Author

Okay, rebased.
I've tested it somewhat, and it does appear to be still working after the rebase.
What is needed to get this through the finish line this time? :)

@LebedevRI LebedevRI force-pushed the reliable-eta branch 3 times, most recently from 5f04500 to c0af2fc Compare August 24, 2022 20:33
@jonesmz

This comment was marked as abuse.

src/graph.h Outdated

// Historical info: how long did this edge take last time,
// as per .ninja_log, if known? Defaults to -1 if unknown.
int64_t prev_elapsed_time;

This comment was marked as abuse.

This comment was marked as abuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, naked unitless data is a problem, i agree. I would think it would make sense to change
all relevant occurrences at once, so i'm not sure an one-off change is a win?

This comment was marked as abuse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main problem with that is that it (and some other changes) goes directly against the current code style,
and having uniform code is generally better, and can be uplifted all at once,
as opposed to having a weird mismatch of different coding styles.

But if the maintainers (@jhasse ? @nico ?) can chime in and review the change as a whole,
i can sure make those changes if they are deemed better by maintainers.

This comment was marked as abuse.

src/ninja.cc Outdated Show resolved Hide resolved
src/status.cc Show resolved Hide resolved
@LebedevRI
Copy link
Contributor Author

Re build failure:
i wonder if instead of curl -L -O https://github.com/Kitware/CMake/releases/download/v3.16.4/cmake-3.16.4-Linux-x86_64.sh
it would be more stable to use https://apt.kitware.com/

doc/manual.asciidoc Outdated Show resolved Hide resolved
doc/manual.asciidoc Outdated Show resolved Hide resolved
@LebedevRI LebedevRI force-pushed the reliable-eta branch 2 times, most recently from 044dc50 to 966de6a Compare October 24, 2022 20:49
src/status.cc Outdated Show resolved Hide resolved
doc/manual.asciidoc Outdated Show resolved Hide resolved
src/status.cc Outdated Show resolved Hide resolved
@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@jhasse

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@LebedevRI
Copy link
Contributor Author

ping

@mathstuf
Copy link
Contributor

Please do not repeatedly "poke" issues/PRs with content-less comments such as this. Note that there has been activity lately (mostly merging bug fixes and some CI updates). Hopefully there will be more time in the future for such feature PRs to get the attention the need and deserve, but "pings" really do not help.

However, given the CI changes lately, a rebase of this PR to adapt to the new configuration is likely useful.

This has been bugging me for *years*. :)
Count of finished edges isn't a great statistic, it isn't
really obvious if LLVM will take 8 minues to build, or 10 minutes.

But, it's actually pretty straight-forward to get some
more useful information. We already know how much time each edge
has taken, so we could just do the dumb thing, and assume that
every edge in the plan takes the same amount of time.

Or, we can do better. `.ninja_log` already contains
the historical data on how long each edge took to produce it's outs,
so we simply need to ensure that we populate edges with that info,
and then we can greatly improve our predictions.
The math is pretty simple i think.

This is largely a port of a similar change i did to LLVM LIT:
https://reviews.llvm.org/D99073

With this, i get something quite lovely:
```
llvm-project/build-Clang12$ NINJA_STATUS="[%f/%t %p %P][%w + %W] " /repositories/ninja/build-Clang-debug/ninja opt
[288/2527  11%   4%][00:27 + 08:52] Building CXX object lib/DebugInfo/CodeView/CMakeFiles/LLVMDebugInfoCodeView.dir/AppendingTypeTableBuilder.cpp.o
```

I hope people will find this useful, and it could be merged.
@LebedevRI
Copy link
Contributor Author

ping.

@jhasse jhasse added this to the 1.12.0 milestone Mar 16, 2024
@jhasse jhasse merged commit 8d47b88 into ninja-build:master Mar 16, 2024
10 checks passed
@jhasse
Copy link
Collaborator

jhasse commented Mar 16, 2024

Thank you for your hard work and your patience! A great feature :)

The pings resulted in me unsubscribing and not being interested in having a look at this PR. Please refrain from doing that next time.

@LebedevRI
Copy link
Contributor Author

LebedevRI commented Mar 16, 2024

Wow thank you @jhasse, i had almost given up on this thing :)
If there are issues with it, just let me know.

The pings resulted in me unsubscribing and not being interested in having a look at this PR. Please refrain from doing that next time.

Yeah, believe me, i very VERY VERY much know the feeling :(
Sorry about that, i didn't quite know what the preferred etiquette was here, and thought i was helping.

@LebedevRI LebedevRI deleted the reliable-eta branch March 16, 2024 12:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Estimate time left on build
4 participants