Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarify option usage #2835

Closed
wants to merge 1 commit into from
Closed

docs: clarify option usage #2835

wants to merge 1 commit into from

Conversation

layday
Copy link
Member

@layday layday commented Oct 27, 2021

Summary of changes

This (a) attempts to frame setuptools data inclusion keywords in terms
of wheel and source distributions and (b) to warn users about common
pitfalls.

Closes —

Pull Request Checklist

@benoit-pierre
Copy link
Member

The part about package_data is not quite right: this can be used to force the inclusion of some data at install or in the wheel (but not the source distribution).

So, here is how I use all 3 options in a project:

  • set include_package_data=True
  • add all files that should be included in the source distribution to MANIFEST.in
  • use options.package_data to include some files during install or in the wheel (but not the source distribution), like generated build artifacts
  • use exclude_package_data to prevent some data included in the source distribution from being installed or included in the wheel (like translation source catalogs, .pot and .po files)

So it's better to retain the original documentation for package_data.

@layday
Copy link
Member Author

layday commented Oct 27, 2021

I'm confused, are you saying package_data doesn't work for sdists? I'm not willing to make any reference to "installs" - that mode of usage is deprecated.

@layday
Copy link
Member Author

layday commented Oct 27, 2021

use options.package_data to include some files during install or in the wheel (but not the source distribution), like generated build artifacts

If you use include_package_data=True then package_data has no effect. See #1461.

@@ -0,0 +1 @@
Attempted to bring clarity to data packaging keywords -- by :user:`layday`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Attempted to bring clarity to data packaging keywords -- by :user:`layday`
Attempted to bring clarity to package data parameters -- by :user:`layday`

@abravalheri
Copy link
Contributor

abravalheri commented Oct 28, 2021

use options.package_data to include some files during install or in the wheel (but not the source distribution), like generated build artifacts

If you use include_package_data=True then package_data has no effect. See #1461.

Hi @layday, I did a small experiment and I reached the following conclusion:

  • In wheels, "data files" will be included in the distribution if:

    • package_data includes them OR
    • include_package_data=True AND MANIFEST.in includes them.
  • In sdists, "data files" will be included in the distribution if:

    • MANIFEST.in includes them OR
    • include_package_data=False (or not present) AND package_data includes them.

Therefore we can say that package_data is not ignored.

The reason why I tested the absence of MANIFEST.in is that the Python Packaging User Guide seems to indicate it is not always necessary.

You can check the code and the results for the experiment in: https://github.com/abravalheri/experiment-setuptools-package-data

(I might have got something incorrect here, so any review is appreciated).


It seems to me that a good rule of thumb is to go with package_data for wheels and MANIFEST.in for sdists...

@layday
Copy link
Member Author

layday commented Oct 29, 2021

So what you're saying is that package_data is respected even when include_package_data=True but only for wheels. That leaves one with two semi-sane options. Either to use MANIFEST.in for non-package data files in conjunction with package_data and include_package_data=False, or to use MANIFEST.in only with include_package_data=True, forgoing package_data.

distribution  `include_package_data`  `package_data`  package data from `MANIFEST.in`
------------  ---------------------   -------------   -------------------------------
sdist         True                    ignored         included
sdist         False                   included        included
wheel         True                    included        included
wheel         False                   included        ignored

@layday
Copy link
Member Author

layday commented Oct 29, 2021

Of course, if one goes with the latter option, exclude_package_data only takes effect for wheels. What a mess...

@abravalheri
Copy link
Contributor

I updated the experiment to consider exclude_package_data.

According to the results:

  • In wheels:

    • exclude_package_data will ALWAYS prevent "data files" from being included in the distribution;

    • after considering that, "data files" will be included in the distribution if:

      • package_data lists them OR
      • include_package_data=True AND MANIFEST.in includes them.
  • In sdists, "data files" will be included in the distribution if:

    • MANIFEST.in includes them OR
    • include_package_data=False (or not present)
      AND package_data lists them
      AND exclude_package_data does not list them

Please notice this considers the extreme case, when the data files are placed inside directories that are not valid Python packages (e.g. missing __init__.py files or whose names are not valid python identifiers)

@abravalheri
Copy link
Contributor

abravalheri commented Oct 29, 2021

Of course, if one goes with the latter option, exclude_package_data only takes effect for wheels. What a mess...

If MANIFEST.in is being used, then I suppose that can be done there with a global-exclude or similar directive.

That leaves one with two semi-sane options. Either to use MANIFEST.in for non-package data files in conjunction with package_data and include_package_data=False, or to use MANIFEST.in only with include_package_data=True, forgoing package_data.

OR removing MANIFEST.in, include_package_data and exclude_package_data and always including package_data explicitly.

This (a) attempts to frame setuptools data inclusion keywords in terms
of wheel and source distributions and (b) to warn users about common
pitfalls.
@layday
Copy link
Member Author

layday commented Oct 29, 2021

OR removing MANIFEST.in, include_package_data and exclude_package_data and always including package_data explicitly.

MANIFEST.in is needed to include non-package data files in sdists. You could keep package and non-package data files separate, like in the first option I presented. Either way, I'm not updating the data file guide in this PR, just the keyword documentation.

@layday
Copy link
Member Author

layday commented Oct 29, 2021

P.S. Thanks for testing the various combinations, it was very helpful.

@abravalheri
Copy link
Contributor

Just for clarifications, I have the impression that non-package data files are deprecated, since they don't work in wheels, right?

@layday
Copy link
Member Author

layday commented Oct 29, 2021

That's a different kind of "data file" - it refers to data files installed outside of the Python prefix. When I say non-package data files, I mean files which you want to include in the sdist which are not contained in a package and which will not be installed, e.g. your licence or docs.

@jaraco
Copy link
Member

jaraco commented Nov 1, 2021

I would recommend not to think about sdist and wheel as comparable formats, but instead to think about them as two sides of the build/install step.

sdist is simply a collection of sources as a tarball, comparable to getting the sources from github. The MANIFEST is one way of specifying which sources to include in the source dist. Others include simply relying on the defaults, using a file-finder plugin (like setuptools_scm), or possibly relying on files implied by other settings.

When building a wheel, on the other hand, you're specifying the files that, given the sources (either from a checkout or sdist or simple files), will be built and installed.

There's a loose relationship between the two formats, but they're largely unrelated. The main thing that's important is that any files that are needed to build the wheel should also be made present in the sdist (otherwise when building from sdist, the package will produce a different install than when building from repo source).

I hope that helps explain why these seemingly similar formats actually have very different semantics about how to specify the inputs.

@layday
Copy link
Member Author

layday commented Nov 1, 2021

It doesn't really; the semantics are muddied. The options do not behave in the way they do because the formats are not 'comparable'. They've simply evolved over the course of some fifteen odd years along with the packaging ecosystem as a whole. A lot of what might've seemed good in principle did not pan out in practice. As a result data file inclusion in setuptools is a major pain point for newcomers just as much as it is for seasoned developers. The nomenclature is problematic and the interaction between the options both surprising and surprisingly complex. The least we can do is spell it all out in the documentation, so when things don't work in the way people might expect, they might have some indication as to why.

@abravalheri
Copy link
Contributor

abravalheri commented Nov 1, 2021

Thanks @jaraco for the information. Indeed, if we consider that in order to be included to the wheel the data file should first be "added to the sdist", the same logical expressions for the inclusion of files in the distribution can be derived from the experiment.

So I understand that the desired behaviour is:

  1. default setuptools algorithm/file-finder plugins/MANIFEST.in list files to be included in the sdist
  2. include_package_data=True allow all the files inside the package directories in the sdist to be added to the wheel.
  3. package_data SHOULD add extra files (e.g. automatically generated files) to both sdist and wheel.
  4. exclude_package_data refines the behaviour of include_package_data and package_data.

However item 3. is not working correctly due to the bug pointed out by @jaraco in #1461, and is the source of all the confusion.

abravalheri added a commit to abravalheri/setuptools that referenced this pull request Nov 1, 2021
The inconsistency for the `package_data` configuration in sdists
when `include_package_data=True` in pypa#1461 have been causing some
problems for the community for a while, as also shown in pypa#2835.

As pointed out by
[@jaraco](pypa#1461 (comment)),
this was being caused by a mechanism to break the recursion between the
`egg_info` and `sdist` commands.

In summary the loop is caused by the following behaviour:

- the `egg_info` command uses a subclass of `sdist` (`manifest_maker`)
  to calculate the MANIFEST,
- the `sdist` class needs to know the MANIFEST to calculate the data files when
  `include_package_data=True`

Previously, the mechanism to break this loop was to simply ignore
the data files in `sdist` when `include_package_data=True`.

The approach implemented in this change was to replace this mechanism,
by allowing `manifest_maker` to override the `_safe_data_files` method
from `sdist`.

---

Please notice [an extensive experiment]
(https://github.com/abravalheri/experiment-setuptools-package-data)
was carried out to investigate the previous confusing behaviour.

There is also [a simplified theoretical analysis]
(pyscaffold/pyscaffold#535 (comment))
comparing the observed behavior in the experiment and the expected
one. This analysis point out to the same offender indicated by
[@jaraco](pypa#1461 (comment))
(which is being replaced in this change).
abravalheri added a commit to abravalheri/setuptools that referenced this pull request Nov 1, 2021
The inconsistency for the `package_data` configuration in sdists
when `include_package_data=True` in pypa#1461 have been causing some
problems for the community for a while, as also shown in pypa#2835.

As pointed out by
[@jaraco](pypa#1461 (comment)),
this was being caused by a mechanism to break the recursion between the
`egg_info` and `sdist` commands.

In summary the loop is caused by the following behaviour:

- the `egg_info` command uses a subclass of `sdist` (`manifest_maker`)
  to calculate the MANIFEST,
- the `sdist` class needs to know the MANIFEST to calculate the data files when
  `include_package_data=True`

Previously, the mechanism to break this loop was to simply ignore
the data files in `sdist` when `include_package_data=True`.

The approach implemented in this change was to replace this mechanism,
by allowing `manifest_maker` to override the `_safe_data_files` method
from `sdist`.

---

Please notice [an extensive experiment]
(https://github.com/abravalheri/experiment-setuptools-package-data)
was carried out to investigate the previous confusing behaviour.

There is also [a simplified theoretical analysis]
(pyscaffold/pyscaffold#535 (comment))
comparing the observed behavior in the experiment and the expected
one. This analysis point out to the same offender indicated by
[@jaraco](pypa#1461 (comment))
(which is being replaced in this change).
@layday
Copy link
Member Author

layday commented Nov 3, 2021

Closing as #2844 will fix some of this behaviour and there's no agreement on framing data packaging in terms of sdists and wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants