Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix model locale issue and improve model R/W performance. #3405

Merged
merged 33 commits into from
Dec 8, 2020

Conversation

AlbertoEAF
Copy link
Contributor

@AlbertoEAF AlbertoEAF commented Sep 23, 2020

When Java is used, the default C++ locale is broken. This is true for
Java providers that use the C API or even Python models that require JEP.

This patch solves that issue making the model reads/writes insensitive
to such settings.
To achieve it, within the model read/write codebase:

  • C++ streams are imbued with the classic locale
  • Calls to functions that are dependent on the locale are replaced
  • The default locale is not changed!

This approach means:

Changes:

  • Performance improvements: More than 35% faster model read/writes. Use fast libraries for locale-agnostic conversion:
  • Add CommonC namespace which provides faster locale-independent versions of Common's methods as if using the "C" locale
  • Model code makes conversions through CommonC
  • Cleanup unused Common methods

Bugfixes:

@AlbertoEAF AlbertoEAF changed the title Fix LightGBM models locale sensitivity and improve R/W performance. Fix locale sensitivity and improve R/W performance. Sep 23, 2020
@AlbertoEAF AlbertoEAF changed the title Fix locale sensitivity and improve R/W performance. Fix model locale sensitivity and improve R/W performance. Sep 23, 2020
@AlbertoEAF AlbertoEAF changed the title Fix model locale sensitivity and improve R/W performance. Fix model locale issue and improve R/W performance. Sep 23, 2020
@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 23, 2020

Hello @guolinke @StrikerRUS @henry0312 @imatiach-msft I think finally the fix is in #3267! :)

Running the integration tests

We are using a Java provider for LightGBM focused on low-latency: https://github.com/feedzai/feedzai-openml-java/tree/master/openml-lightgbm

The Java provider PR where the new tests and LightGBM patch live can be found here: feedzai/feedzai-openml-java#53.

To run the tests just clone it and do mvn clean install -DskipTests once on the full repo, and then you can just run mvn test inside the openml-lightgbm folder to test.

This will fetch the requested LightGBM codebase, build it in a docker container and then start building the Java provider which uses LightGBM's .so's and .jar. Finally it will run the tests. As I am using a tool to compare the output model files you need python >=3.6 and termcolor installed for the time being.

Integration tests and results

To prove the patch works, there are many tests. The main ones for this issue are 2:

  1. Compare scores to reference scores and ensure they remain the same as in v3.0.0.
  2. Compare round-trip reference model file read followed by a write to a file. The output model should be the same with the v3.0.0 code as with the patch.

When locale is forced to be "C" both v3.0.0 and this custom version work, but when another locale is used and C is not enforced (pt_UTF-8 in my case)
image
(openml-lightgbm/lightgbm-provider/pom.xml)

the standard v3.0.0 starts failing the tests:
image
(openml-lightgbm/lightgbm-builder/pom.xml)

with differences in the output round-trip model read/write with the C locale:
image
and failed tests including wrong scores:
image

With the patched version however, all outputs match the behaviour of the v3.0.0 codebase with the C locale even when using other locales:
image
image

@AlbertoEAF
Copy link
Contributor Author

Any tips on what's causing so many failed builds? In my system it works.
I added 2 new libraries, but they should be compatible with gcc 4.8.4.

@guolinke
Copy link
Collaborator

@AlbertoEAF did you test the model loading speed of This PR?
I remember these deleted codes are integrated for (large) model loading performance.

@henry0312
Copy link
Contributor

Hello, Alberto.
I'll also check this PR.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 24, 2020

@AlbertoEAF did you test the model loading speed of This PR?
I remember these deleted codes are integrated for (large) model loading performance.

Hello @guolinke no I didn't. Any hints on what's a large model?
I guess I can try manually modifying one of my LGBM model .txt files to have 5000 trees or something by copying+pasting.

Also I have another question, I guess we should have tests for this right? I'm not sure how we should proceed though. Could we have a regression test with a reference model file in the repo so we could read and re-write it to disk and compare, just like we do on the Java provider? (~10MB model.txt preferrably in git-lfs to avoid weighing down on the repo).

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 24, 2020

All failing R-package tests are not cloning the new git submodules in external_libs/ which results in things like:

CMake Error at CMakeLists.txt:448 (add_subdirectory):
  add_subdirectory given source "external_libs/fmt" which is not an existing
  directory.

As for Python I didn't find the cause but could it be the same thing? For instance the compute submodule is explicitly copied in setup.py. Do the two new submodules under external_libs need the same handling?

Not used to Travis nor this build system, would appreciate help to fix the failing builds.

@StrikerRUS
Copy link
Collaborator

@AlbertoEAF

All failing R-package tests ...

Please don't care about them. There are a lot of maintenance burden to get new files worked with R CRAN installation process. We will help with it later.

As for Python ... Do the two new submodules under external_libs need the same handling?

Yes, I guess two new folders should be copied unconditionally just like src and include:

copy_files_helper('include')
copy_files_helper('src')

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 24, 2020

@guolinke Regarding model read/write speed, I ran first all the tests I had before with both versions.

Regular tests. 3 runs per LightGBM version. Both display similar performance even though the patched version seems slightly faster (not statistically significant with only these runs):

v3.0.0: {3.8s, 5.2s, 4.4s}
patch: {3.4s, 2.9s, 3.5s}

Then I took a model file with 200 trees and 8.8MB, wrote a script to sample more trees from it and add it to the file and generated a model with 50200 trees and 2.2GB. I ran all the tests as before as well as read/re-write that model to disk with both versions of LightGBM:

v3.0.0: {1m 41s, 1m 48s, 1m 46s}
patch: {1m 09s, 1m 07s, 1m 09s}

The new version is significantly faster, by around 35% on round-trip model read+writes.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 24, 2020

Ok, added that to the Python setup @StrikerRUS but some of the builds still fail.

For instance, this one fails building fast_double_parser: https://travis-ci.org/github/microsoft/LightGBM/jobs/730024056 because it lacks the non-standard strtod_l for instance (check line 704, and 708 - which misses the locale_t type).

I'm not sure I understand all this environments, what do they differ by? I guess it's related to the compiler. With the pretty old Ubuntu 14.04 and gcc 4.8.4 this compiles though.

@guolinke
Copy link
Collaborator

guolinke commented Sep 25, 2020

@AlbertoEAF great! happy to see the performance gain!

@lemire
Copy link

lemire commented Sep 25, 2020

For instance, this one fails building fast_double_parser: https://travis-ci.org/github/microsoft/LightGBM/jobs/730024056 because it lacks the non-standard strtod_l for instance (check line 704, and 708 - which misses the locale_t type).

Let me comment here as well as in the related issue you just opened. I do not think that the support is missing. The issue, I believe, is that the xlocale.h file does not get included whereas it should get included. It is almost surely my fault.

See my comment. I think we can fix this easily with a few lines of code.

@jameslamb
Copy link
Collaborator

Please let me know when this is ready for help on the R side, and I'll make a PR into this one.

CRAN (the official package manager for R) has very strict requirements for portability of compiled code, so if those checks turn up anything that would make the R package not work on CRAN's set of environments (https://cran.r-project.org/web/checks/check_results_lightgbm.html), we'll have to talk about how to make this change in a way that doesn't violate their requirements.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 29, 2020

Hello,

Seeing some Python builds still where I suspect the setup.py is not copying the external_libs as it still complains of not finding all the dependencies. That was the error I had in every single Python build before adding

copy_files_helper('external_libs')

I still get that error in some Python builds though. Example: https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=7444&view=logs&j=02a2c3ba-81f8-54e3-0767-5d5adbb0daa9&t=720ee3fa-96d4-5b47-dbf4-01607b74ade2&l=1269).

Any tips? Should I declare that copy outside the "if"? As this issue seems to happen for "dist" "sdist" "bdist" builds.

However that doesn't make sense either, as looking at the full log the compute folder is being copied but external_libs are not. However, copying compute is conditional on passing through those copies first, as well as src and include folders. So I wonder what's going on 🤔 . Is it fetching a different setup.py from somewhere else?

I ask that as in the full log I see:

...
copying compile/compute/...
...
copying compile/src/...
...

but I don't see copying external_libs or include. Also, looking at setup.py I was expecting to see copies of include, src, external_libs and optionally compute, in this order:

if not os.path.isfile(os.path.join(CURRENT_DIR, '_IS_SOURCE_PACKAGE.txt')):
copy_files_helper('include')
copy_files_helper('src')
copy_files_helper('external_libs')
if not os.path.exists(os.path.join(CURRENT_DIR, "compile", "windows")):
os.makedirs(os.path.join(CURRENT_DIR, "compile", "windows"))
copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "windows", "LightGBM.sln"),
os.path.join(CURRENT_DIR, "compile", "windows", "LightGBM.sln"),
verbose=0)
copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "windows", "LightGBM.vcxproj"),
os.path.join(CURRENT_DIR, "compile", "windows", "LightGBM.vcxproj"),
verbose=0)
copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "LICENSE"),
os.path.join(CURRENT_DIR, "LICENSE"),
verbose=0)
copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "CMakeLists.txt"),
os.path.join(CURRENT_DIR, "compile", "CMakeLists.txt"),
verbose=0)
if integrated_opencl:
copy_file(os.path.join(CURRENT_DIR, os.path.pardir, "CMakeIntegratedOpenCL.cmake"),
os.path.join(CURRENT_DIR, "compile", "CMakeIntegratedOpenCL.cmake"),
verbose=0)
if use_gpu:
copy_files_helper('compute')

Any ideas @StrikerRUS ?

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Sep 29, 2020

I took a look at all failing Travis builds. I listed what I believe are 4 types of build errors which might require changes:

  1. Disable pylint over the external_libs:
  1. R builds failing compilation without -fPIC (@jameslamb). Any reason not to compile shared libs with -fPIC? Did I inadvertently mess something in the build? ^UPDATE: I misread the log, thanks @jameslamb. In C++ with USE_MPI=ON, fmt is being compiled without -fPIC, don't know how to force it:
  1. Some Python builds where I believe setup.py is not adding external_libs and I don't know why:
  1. My noobiness with CMakeLists. Somehow TARGET_LINK_LIBRARIES is not being properly used by me?

I'd appreciate input over any of the issues @StrikerRUS @guolinke @jameslamb :) Thank you!

@jameslamb
Copy link
Collaborator

jameslamb commented Sep 29, 2020

R builds failing compilation without -fPIC (@jameslamb). Any reason not to compile shared libs with -fPIC? Did I inadvertently mess something in the build?
https://travis-ci.org/github/microsoft/LightGBM/jobs/731344754

This is not an R build. All of the R builds are on GitHub Actions.

This is a C++ test, checking code compiled with USE_MPI=ON:

image

Sorry, I don't really understand why it's failing.

Does add_subdirectory(external_libs/fmt) mean "also compile fmt, using fmt/CMakeLists.txt"?

If so, I believe that opens us up to a lot of weird issues, where values like CXXFLAGS set in LightGBM's CMakeLists.txt could conflict with those in fmt/CMakeLists.txt. Maybe that is why you're seeing these issues.

@AlbertoEAF
Copy link
Contributor Author

@jameslamb by rebasing against the latest master now there are 2 failing jobs in Travis: python 2 & mpi build

I thought python2 support was being dropped

The other error was "connection reset by peer".

Any ideas?

@jameslamb
Copy link
Collaborator

there are 2 failing jobs in Travis: python 2 & mpi build

The two test failures look unrelated to your code. I'll just restart them.

by rebasing against the latest master now

By the way, we squash commits on merge in this project, so you don't have to rebase + force push every time. You can just git merge master and push merge commits.

I like doing that because then git will raise an error if your branch isn't up to date with the remote, which usually means you accepted suggestions in the GitHub UI but then forgot to pull those changes to your local branch. If you rebase + force push, you can end up accidentally force-pushing over those changes, which results in more review churn.

Not a big deal though, totally up to your preference.

I thought python2 support was being dropped

Officially, 3.1.x is the final minor release to support Python 2. 3.2.x will not support it. But we haven't removed the support yet, because if we decide to release a bugfix-only release next (3.1.1), we have to keep that support.

You can follow #3581 for the current status of Python 2 in this project.

@AlbertoEAF
Copy link
Contributor Author

AlbertoEAF commented Nov 25, 2020

@jameslamb @StrikerRUS finally green tests! Anything left to merge? :D

Thank you so much for all the help ;)

@StrikerRUS
Copy link
Collaborator

@guolinke Could you please review this?

@guolinke
Copy link
Collaborator

I will review it today

.gitmodules Show resolved Hide resolved
@jameslamb
Copy link
Collaborator

Now that we have all these approvals (!!!), I think we need to wait to merge until we decide if the next release will be 3.1.1 or 3.2.0.

@StrikerRUS
Copy link
Collaborator

Yeah, I agree that this "sensitive" PR should go into 3.2.0.

@guolinke
Copy link
Collaborator

okay, let us wait for 3.1.1 release first

@jameslamb jameslamb mentioned this pull request Nov 29, 2020
16 tasks
@AlbertoEAF
Copy link
Contributor Author

I saw 3.1.1 is out! 🚀

@guolinke do you think we can merge now?

@guolinke
Copy link
Collaborator

guolinke commented Dec 8, 2020

@AlbertoEAF yeah, this should be included in v3.2.0

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 24, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants