Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configuration hash in BuildInfo may be unstable #11777

Closed
theOehrly opened this issue Nov 29, 2023 · 4 comments
Closed

Configuration hash in BuildInfo may be unstable #11777

theOehrly opened this issue Nov 29, 2023 · 4 comments
Labels

Comments

@theOehrly
Copy link

theOehrly commented Nov 29, 2023

Describe the bug

The Sphinx HTML builder verifies that the configuration is unchanged by hashing all configuration values and storing this hash in a .buildinfo file. On the next run the current and the previous hash are compared to determine whether the configuration has changed. If a change is detected, all source files are marked as out of date and rebuilt.

Given that the configuration is a Python file, a reference to a function can be used as a configuration value. Some Sphinx extensions make use of this and let the user define custom functions for specific tasks. One example of this is the Sphinx-Gallery extension.

Quoting one example from the Sphinx-Gallery documentation:

sphinx_gallery_conf = {
   ...
   'reset_modules': ('matplotlib', 'seaborn'),
}

Currently, Sphinx-Gallery natively supports resetting matplotlib and seaborn. However, you can also add your own custom function to this tuple

The Sphinx HTML builder hashes the string representation the configuration value object:

return hashlib.md5(str(obj).encode(), usedforsecurity=False).hexdigest()

The string representation of a function is of course something like <function test at 0x0000020C4644D6C0> where the memory address is changing for every run.

This means, that when using any extension that uses a reference to a function as a configuration value, the documentation will be completely rebuilt on every run.

Issues caused by this:

  • the build time increases considerably on large projects (in the order of Minutes with Matplotlib, for example)
  • Breaks reproducibility in html builders #4240 is likely caused by this as well, as Scikit-Learn is using Sphinx-Gallery as well (but I haven't verified that theory)

It is of course debatable, whether this is a bug or not. But given that the configuration file is a Python file, it is not really unexpected that people want to use the possibility of just passing a function object as a configuration value.

It is not reliably possible to actually hash the code within such a function, therefore it is not possible for Sphinx to determine if the function itself has changed.

I see three potential options here:

  1. Exclude values that are a reference to a function when generating the hash
  2. Only hash the function name to get a stable hash
  3. Decide that this is unintended functionality and that modules like Sphinx-Gallery should adapt their approach instead

In my opinion, a user would not expect Sphinx to automatically recognize a change in a user provided function, therefore option 1 or 2 would be acceptable.

How to Reproduce

pip install sphinx sphinx-gallery
sphinx-quickstart --sep --project test --author me -l en -r 0.0.1
mkdir examples
echo "" > examples/readme.txt

Add the following content to the end of the default conf.py in the source directory:

extensions = [
    'sphinx_gallery.gen_gallery',
]


def example_func():
    return


sphinx_gallery_conf = {
   'reset_modules': (example_func, ),
}

Run make html and notice that all HTML files will always be rebuilt without making any changes between runs. If you run with the -vv option (or higher) you will find [build target] did not match: build_info in the output.

Environment Information

Platform:              win32; (Windows-10-10.0.19044-SP0)
Python version:        3.11.3 (tags/v3.11.3:f3909b8, Apr  4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)])
Python implementation: CPython
Sphinx version:        7.2.6
Docutils version:      0.20.1
Jinja2 version:        3.1.2
Pygments version:      2.16.1

Sphinx extensions

extensions = [
    'sphinx_gallery.gen_gallery',
]

Additional context

No response

@picnixz
Copy link
Member

picnixz commented Dec 24, 2023

This means, that when using any extension that uses a reference to a function as a configuration value, the documentation will be completely rebuilt on every run.

I think, such things should be referred to using their fully-qualified name. Then, the extension is responsible for importing it.

We need to enforce on our side only JSON-like objects for configuration values. And not allowing arbitrary references. Alternatively, we could change how we serialize functions by hashing their corresponding dissassembly operations (which should technically be unique if it doesn't change, unless they have assertions and one build is with -OO and the next one without but these are two distinct builds for me).

I prefer the second approach where we special-case the hashing of objects.

Note: the function name is not necessarily unique (even its fully-qualified name may be distinct). However its disassembly should be unique I think.

@theOehrly
Copy link
Author

theOehrly commented Dec 25, 2023

We need to enforce on our side only JSON-like objects for configuration values. And not allowing arbitrary references.

I think that using a Python file for the configuration but then imposing limitations on the allowed types for configuration values might lead to confusion. Apart from that, changing this now would be a breaking change and require a deprecation first, or not?

Alternatively, we could change how we serialize functions by hashing their corresponding dissassembly operations (which should technically be unique if it doesn't change, unless they have assertions and one build is with -OO and the next one without but these are two distinct builds for me).

Do you mean using the dis module? I didn't know about this option before. I did some tests with it and see one potential issue. Assume the following example function:

def some_function():
    import mymodule
    # ... do some stuff ...
    mymodule.documentation_helper()
    # ... do some more stuff ...

When the code in this function changes, this should be reliably detectable by looking at the output of the disassembly. But when the implementation of mymodule.documentation_helper() changes, this will not be detectable, because the dis module does not follow function calls like that. (I guess there would be a whole new problem if it did)
That means, it is possible to detect if some_function() changes. But it is not possible to guarantee that it actually does the same thing.
This might convey a false sense of being able to detect changes to the behavior of such a function, when in fact, only some changes can be detected.

I prefer the second approach where we special-case the hashing of objects.

Yes, I think that would be the best way as well.

Requiring JSON-like values and using qualified names as option values is equivalent to hashing the qualified name of a function now. Both options are unable to detect whether the behavior of a function changes. They are only able to detect if the qualified function name is the same.

My suggestion is to add a special case for hashing functions using their qualified name as a bug fix now.

Optionally, a deprecation warning could be added to inform users/maintainers of extensions, that this behavior will not be supported in the future. But I do not see a real benefit in doing that. The unstable hash will already be fixed and I don't know of any other problems that are caused by using arbitrary references here.

If we agree on what the best way to fix this is, I can create a pull request to implement the necessary changes.

@picnixz
Copy link
Member

picnixz commented Dec 25, 2023

Ah yes concerning the dissasembly, I missed that point. Also it could have been a huge dump so not worth it I think. Technically it's possible to have a deterministic hash by unrolling everything but that's the worst idea in practice.

I think we could live with just storing the name + module name. But we should probably update the doc for extensions developers and for people interested in reproducible builds.

Before implementing anything I would like some comments from other members (probably after New Year's Eve). Maybe they (or people worried about reproducibility) have other ideas.

@AA-Turner
Copy link
Member

Resolved (hopefully) in 959f73f.

A

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants