Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] remove unnecessary files to reduce sdist size #3639

Merged
merged 4 commits into from
Dec 11, 2020

Conversation

jameslamb
Copy link
Collaborator

Similar to #3579, this PR proposes making python-package/MANIFEST.in stricter to prevent unnecessary files from being bundled in the source distribution of the Python package.

#3405 introduced two new submodules (fmt and fast_double_parser), and right now all of their contents are being bundled in the source distribution of lightgbm. That includes a lot of files that are unnecessary for LightGBM, like tests and documentation.

This PR removes them. See #3579 for why this is worth caring about.

3.1.0 3.1.1 master this PR
sdist (compressed) 728K 572K 1.9M 712K
sdist (uncompressed) 13M 8.2M 16M 8.8M
wheel (compressed) 1.6M 1.6M 1.6M 1.6M
wheel (uncompressed) 4.6M 4.6M 4.7M 4.7M

checking the size of the python package

You can run the script below, ./check-sizes.sh, to calculate the size of the Python package.

check-sizes.sh
pushd $(pwd)/python-package

    # clean up files from previous builds
    rm -rf build_cpp
    rm -rf build
    rm -rf compile
    rm -rf dist
    rm -rf lightgbm.egg-info

    echo ""
    echo "building source distribution"
    echo ""
    python setup.py sdist > ~/lgb-tmp.log
    cp lightgbm.egg-info/SOURCES.txt ~/LIGHTGBM-SOURCES.txt
    pushd dist/
        echo ""
        echo "sdist compressed size"
        echo ""
        du -a -h .
        tar -xf lightgbm*.tar.gz
        rm lightgbm*.tar.gz
        ls .
        echo ""
        echo "sdist uncompressed size"
        echo ""
        du -sh .
    popd

    sleep 10

    echo ""
    echo "building wheel"
    echo ""
    rm -rf build_cpp
    rm -rf build
    rm -rf compile
    rm -rf lightgbm.egg-info
    rm -rf dist/
    python setup.py bdist_wheel --universal >> ~/lgb-tmp.log
    pushd dist/
        echo ""
        echo "wheel compressed size"
        echo ""
        du -a -h .
        unzip lightgbm*.whl
        rm *.whl
        echo ""
        echo "wheel uncompressed size"
        echo ""
        du -sh .
    popd

popd

Note that that script copies the contents of lightgbm.egg-info/SOURCES.txt to a file ~/LIGHTGBM-SOURCES.txt. Inspect that file to see a full list of everything included in the sdist package. This is how I figured out what changes to make in MANIFEST.in. For example, it showed that fast_double_parser's test data is in a .txt file, so a rule matching *.txt was including it.

how LightGBM uses fmt and fast_double_parser

LightGBM only re-uses header files from these two libraries. Specifically, it only needs these files:

fast_double_parser/CMakeLists.txt
fast_double_parser/LICENSE
fast_double_parser/LICENSE.BSL
fast_double_parser/include/fast_double_parser.h
fmt/CMakeLists.txt
fmt/LICENSE.rst
fmt/include/fmt/chrono.h
fmt/include/fmt/color.h
fmt/include/fmt/compile.h
fmt/include/fmt/core.h
fmt/include/fmt/format-inl.h
fmt/include/fmt/format.h
fmt/include/fmt/locale.h
fmt/include/fmt/os.h
fmt/include/fmt/ostream.h
fmt/include/fmt/posix.h
fmt/include/fmt/printf.h
fmt/include/fmt/ranges.h

Notes for reviewers

  • a similar step isn't necessary for the R package sent to CRAN because instead of using the entire submodule, it only copies exactly the necessary files:
    • cp \
      external_libs/fast_double_parser/include/fast_double_parser.h \
      ${TEMP_R_DIR}/src/include/LightGBM
      mkdir -p ${TEMP_R_DIR}/src/include/LightGBM/fmt
      cp \
      external_libs/fmt/include/fmt/*.h \
      ${TEMP_R_DIR}/src/include/LightGBM/fmt/

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice size reduction! Thank you!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants