You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently in LightGBM, I've been working on reducing the size of our Python package's source distribution. I found that many extra files from git submodules were being bundled in the package. This can be problematic in storage sensitive environments. For example, the first time I tried to use lightgbm + pandas + scikit-learn together on AWS Lambda, I had to do some surgery to trim out unnecessary things, to avoid hitting the 250 MB limit for extra packages (see description of microsoft/LightGBM#3579 if you're curious).
Cutting the package size could also help PyPi's data transfer costs a little bit 😀
I cut the size of lightgbm's sdist package by making the rules in MANIFEST.in more specific, to target only the files that were needed. You can see the diffs for the PRs below:
I can see that there are some files in xgboost that are not necessary. For example, all of the dmlc-core unit test code and even dmlc-core's .git/ directory are currently bundled in the package produced by python setup.py sdist.
how I'm checking the contents of the package (click me)
# with a clone of the repo
git submodule update --recursive
cd python-package
python setup.py sdist
open xgboost.egg-info/SOURCES.txt
# or from PyPi
wget https://files.pythonhosted.org/packages/8e/cd/c1c48514cdd03d735d38d2de471474eb7adc53fc5278cb4a877a25a29976/xgboost-1.3.1.tar.gz -O xgboost.tar.gz
tar -xvf xgboost.tar.gz
open xgboost-1.3.1/xgboost.egg-info/SOURCES.txt
I'd be happy to do this same work for the xgboost Python package, making the MANIFEST.in rules more specific to trim out unnecessary files. Would you consider a PR that did something similar?
Thanks for your time and consideration.
The text was updated successfully, but these errors were encountered:
Hello from Chicago 👋
Recently in LightGBM, I've been working on reducing the size of our Python package's source distribution. I found that many extra files from git submodules were being bundled in the package. This can be problematic in storage sensitive environments. For example, the first time I tried to use
lightgbm
+pandas
+scikit-learn
together on AWS Lambda, I had to do some surgery to trim out unnecessary things, to avoid hitting the 250 MB limit for extra packages (see description of microsoft/LightGBM#3579 if you're curious).Cutting the package size could also help PyPi's data transfer costs a little bit 😀
I cut the size of
lightgbm
's sdist package by making the rules inMANIFEST.in
more specific, to target only the files that were needed. You can see the diffs for the PRs below:I can see that there are some files in
xgboost
that are not necessary. For example, all of thedmlc-core
unit test code and evendmlc-core
's.git/
directory are currently bundled in the package produced bypython setup.py sdist
.how I'm checking the contents of the package (click me)
I'd be happy to do this same work for the
xgboost
Python package, making the MANIFEST.in rules more specific to trim out unnecessary files. Would you consider a PR that did something similar?Thanks for your time and consideration.
The text was updated successfully, but these errors were encountered: