Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computational performance - optimize for speed #107

Open
CBongiova opened this issue Jan 17, 2022 · 3 comments
Open

Computational performance - optimize for speed #107

CBongiova opened this issue Jan 17, 2022 · 3 comments

Comments

@CBongiova
Copy link

Hi,

I am using the ScikitLearn.jl library to train Random Forest classifiers. After the training, I note that re-applying the trained models to new datapoints take about 0.2 seconds. After some tests, it seems that this amount of time is un-related to the number of trees and features. Instead, it seems to be latency time.

I had a look at the scikit-learn webpage here: https://scikit-learn.org/0.15/modules/computational_performance.html
Here they mention that the computational performance of scikitlearn heavily relies on Numpy/Scipy and linear algebra and that it makes sense to take care of these libraries. So they propose to check that Numpy is built using an optimized BLAS/LAPACK library, as follows:

from numpy.distutils.system_info import get_info print(get_info('blas_opt')) print(get_info('lapack_opt'))

Any idea of how I can check for this in Julia?
Else, do you have any suggestion to speed-up the ScikitLearn.jl predictions?

@cstjean
Copy link
Owner

cstjean commented Jan 17, 2022

Any idea of how I can check for this in Julia?

I can't help directly, but ScikitLearn is built on PyCall.jl. You can check from there how to do that. Something like

using PyCall
sys = pyimport("numpy.distutils.system_info")
sys.getinfo("blas_opt"))

Else, do you have any suggestion to speed-up the ScikitLearn.jl predictions?

Are you making one call with a big n_sample X n_feature matrix to get your predictions?

Apart from that, it all depends on the Python code, so there's not much I can do there. DecisionTrees.jl might provide

@CBongiova
Copy link
Author

Hi @cstjean,

Thanks for your reply!

Are you making one call with a big n_sample X n_feature matrix to get your predictions?

No, I actually use the trained random forest classifiers (100 trees) to make atomic predictions online. That is, each time I only have one datapoint with about 45 features. Extracting the features is almost instantaneous, whereas making the predictions takes about 0.1 seconds.

I have actually found this discussion on stack overflow : https://stackoverflow.com/questions/50676717/why-sklearn-random-forest-takes-the-same-time-to-predict-one-sample-than-n-sampl
The 0.1 seconds seems to be latency time which is unavoidable with Scikitlearn ... maybe other libraries or ML approaches are more appropriate for real-time applications.

@cstjean
Copy link
Owner

cstjean commented Jan 18, 2022

DecisionTrees.jl supports the ScikitLearn interface, so it shouldn't be too hard to give it a try!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants