Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Use parseable buildinfo #12741

Closed
wants to merge 9 commits into from
Closed

ENH: Use parseable buildinfo #12741

wants to merge 9 commits into from

Conversation

larsoner
Copy link
Contributor

@larsoner larsoner commented Aug 6, 2024

Subject: Help people figure out why rebuilds happen by:

  1. Making .buildinfo["config"] parseable rather than a hash
  2. Logging more verbosely by default reasons why rebuilds are happening.
  3. Adding info about what template is problematic when rebuilds happen.

I was going to open this as a feature request but then got started on a proof-of-concept so figured I'd open it as a proposal.

Feature or Bugfix

  • Feature

Purpose

When Sphinx encounters a difference in the build information, the current level of information provided is just:

building [html]: targets for 969 source files that are out of date

If you use -vv and look closely (there are a ton of lines printed!) there is this bit of extra information:

[build target] did not match: build_info

This PR changes it so that it has this extra line even in non-verbose-verbose mode by logger.infoing this:

[build target] did not match: build_info, copying .buildinfo to .buildinfo.old

But then also making the .buildinfo no longer just store a hash, but rather a json.dumps of the representation of the config. This makes it possible to much more easily diagnose the problem. In my case, here's what I could see looking at a diff between the .buildinfo and .buildinfo_old:

image

And if this isn't immediately readable -- i.e., not obvious that exclude_explicit_doc_regex was what changed -- in principle you could json.loads(...) and re-dump with nicer spacing and then look at the diff, etc. In this sense it's much more informative than a hash mismatch. For my build, this file was only ~23kB so not too big/painful I think.

In my particular case, this allowed me to figure out that the rebuild is being caused by sphinx-gallery doing an re.compile on a str created by joining elements of a set, whose order will not be stable build-to-build. But I think it's generally more broadly useful in cases where the buildinfo changes happen.

Then I also hit another problem, namely that we were generating a template on the fly which also was causing everything to be out of date. So I added a logging message for that:

[build target] template avatars.html is newer (2024-08-06 20:36:50.84) than previous build info (2024-08-06 20:17:22.360), all docs will be out of date

allowing me to easily see that the avatars.html that we generate at the start of each doc build should more carefully be updated (or perhaps not be a template file actually!).

In any case, if this PR's functionality is desired, probably lots of text and tests etc. need to be updated but I wanted to see if people agree this is useful first!

@AA-Turner
Copy link
Member

Thank you!

Immediate thoughts:

  • We would either need to bump the .buildinfo version to 2 or add a new file enumerating the configuration.
  • Should this output be verbose-by-default?
  • Is it worth storing this information in the file in a readable way, or should we just compute the differences and show the reason to the user (similar to pytest's assertion failure diffs)

Then I also hit another problem, namely that we were generating a template on the fly which also was causing everything to be out of date. So I added a logging message for that

Please could this be a different PR? That will be quicker to merge!

A

@larsoner
Copy link
Contributor Author

larsoner commented Aug 8, 2024

We would either need to bump the .buildinfo version to 2 or add a new file enumerating the configuration.
...
Please could this be a different PR? That will be quicker to merge!

Bumped to 2 and split off #12746 !

Should this output be verbose-by-default?

So far I have a logger.info message that the .buildinfo differs and say that it's copied. This alerts people to the problem without being overly verbose, and lets them do a diff themselves if they want. So I think what's here is already a good start toward 1) showing people there is a problem and 2) giving them a way to investigate and solve it.

Is it worth storing this information in the file in a readable way, or should we just compute the differences and show the reason to the user (similar to pytest's assertion failure diffs)

Regarding how we store the info, note that the info has to be stored somehow in a more complete way than it is currently (i.e., as a hash) -- the diff in build 2 will be between the buildinfo from build 2 in memory and the buildinfo from build 1 that gets loaded from disk. So if you want a meaningful diff in build 2, the buildinfo from build 1 on disk has to store some representation of underlying human readable information, and the diff will be done on that. I guess in principle you could maybe use a binary format to save a bit of file storage, but given that the file even for a complex build like MNE's is <30kB, I think json stringification is a good way to go. It has all the info and stays human readable in the file itself. I could just open it in a diff view in VSCode and in 10 seconds see the problem. If it was stored it as a binary blob this wouldn't be possible -- I'd have to figure out how sphinx binarizes the data, load it myself probably in Python, then do a diff probably in Python etc.

But beyond storage, we could compute differences in Python after loading the buildinfo. However, we'd have to make decisions like what to do in cases where the diff is quite large. We probably wouldn't want to show the whole thing by default at logger.info level. And if we don't want to show the whole thing, then it would be nice if the user could still get the whole diff if they wanted. We could consider trying to do clever things like show the whole diff in logger.info level if it's fewer than 10 lines, and if more then truncate and add ..., and regardless logger.debug the whole diff. I could take a stab at this or something similar if you want!

@AA-Turner
Copy link
Member

Perhaps we could just print out which keys have changed? That might be more manageable.

A

@larsoner
Copy link
Contributor Author

larsoner commented Aug 13, 2024

Perhaps we could just print out which keys have changed? That might be more manageable.

Okay I implemented that and now we see this when I force a config key to change:

building [html]: all docs marked outdated due to build_info mismatch in 1 config key(s): ['sphinx_gallery_conf'], copying .buildinfo to .buildinfo.bak

In my case that wouldn't have been enough to fully diagnose the problem, though, so I think it's still worth keeping the copyfile in there as well so that people can diff / json.loads or whatever they want to dig deeper.

@AA-Turner
Copy link
Member

I agree on keeping the old version of the file to allow for manual inspection.

I did have a thought -- is there a risk of leaking secrets here? As .buildinfo is default-public (e.g. http://docs.python.org/3/.buildinfo), and we now serialise the loaded configuration.

I might be over-worrying, but I wanted to raise the concern and make sure we are comfortable here.

A

@larsoner
Copy link
Contributor Author

Argh yes I think you're right. Would it make sense to have this be opt in with a config var saying whether to use hash or this new string representation?

@AA-Turner
Copy link
Member

Argh yes I think you're right. Would it make sense to have this be opt in with a config var saying whether to use hash or this new string representation?

Agreed on opt-in, but I don't think this can be part of buildinfo (directly) now. The setting should enable writing configuration-<ISO DATE>.json to disk (in a pretty-printed manner), and if the buildinfo changes, it could maybe load the newest file from disk and reuse the diff mechanism already in this PR.

Would that work for you?

A

@electric-coder
Copy link

electric-coder commented Aug 18, 2024

Do we get a convenient command line option to specify the output file (or print to the terminal) if we're just looking to check what files triggered a rebuild?

For my build, this file was only ~23kB so not too big/painful I think.

I've been having a problem where changes to external SVGs don't trigger rebuilds... Thus I'm forced to do a complete rebuild that takes minutes (for lack of an easier/evident sphinx-build solution). It'd be desirable to not make this a default overhead in regular builds as users will only want the info when they need to diagnose something.

@larsoner
Copy link
Contributor Author

The setting should enable writing configuration-.json to disk (in a pretty-printed manner), and if the buildinfo changes, it could maybe load the newest file from disk and reuse the diff mechanism already in this PR.

@AA-Turner yes that sounds reasonable. One problem with that approach is that such files will pile up, but that's probably okay if you're just generally trying to diagnose a problem. Any thoughts on what the config value would be called? build_info_verbosity or build_info = "hash" (default) | "json" or something? No real opinion from me.

Do we get a convenient command line option to specify the output file (or print to the terminal) if we're just looking to check what files triggered a rebuild?

Under @AA-Turner 's idea you could do -Dbuild_info=json and then just know it'll use the date format. So no specifying the filename but using the standard scheme it's easy enough to figure out I think.

@AA-Turner
Copy link
Member

Any thoughts on what the config value would be called?

I would suggest something like dump_configuration, save_configuration, serialise_configuration, etc.

A

@larsoner
Copy link
Contributor Author

larsoner commented Oct 8, 2024

Closing for #12949

@larsoner larsoner closed this Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants