Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OutOfMemory Exception in F64Matrix constructor - maximum array bounds exceeded #59

Open
fstandhartinger opened this issue Feb 26, 2018 · 11 comments

Comments

@fstandhartinger
Copy link

Hi there!

First: Thanks for the great work, excellent design you have there!

I am experiencing an OutOfMemory Exception in the constructor of F64Matrix that does not really come from memory shortage but from the fact that F64Matrix internally uses a single one dimensional double array that can quite easily exceed .NETs internal boundaries of maximum array dimensions.

In my case I tried to create a F64Matrix with 10 mio rows and 55 columns.

My preferred suggestion would be to either abstract the matrix to an IF64Matrix interface that probably only consists of the At() method overloads. This would enable users to provide a custom implementation that is capable of handling larger amounts of data, if needed even by swapping data from and to disk.
Another solution could be to change the internal implementation of F64Matrix to use an array of double arrays, which I believe could also help.

Thanks for your help and keep up the excellent work!

Best regards

Florian

@mdabros
Copy link
Owner

mdabros commented Feb 26, 2018

Hi @fstandhartinger,

Thanks for your kind words and thanks for reporting the issue!

There are a few other issues also related to the matrix implementation:
#6, #20

The current implementation is something that will need to change in the future, partly because of issues like this, but also to support something more general like Tensors for neural networks and other machine learning algorithms.
Microsoft is introducing a tensor type, which I am currently considering to use as the base container for SharpLearning in the future. This will be able to support a very large number of elements. There are still many things to consider, so the decision is definitely not final yet, so suggestions like your own are very welcome to help guide the decision process.
Currently, most of my time is aimed towards the backend project (#35) for adding better neural network support via a TensorFlowSharp and CNTK. So the matrix change will probably not be in the very near future.

Regarding the specific issue, you should be able to create an array of size System.Int32.MaxValue, which is around 4 billion elements, using .NET: What is the Maximum Size that an Array can hold?

As the stackoverflow posts also mention, there is or at least was a default max size of 2 gigabytes for objects, but using 64 bit and the gcAllowVeryLargeObjects, this can be switched off. I think they might have changed this setting to be on by default for .net core applications, but at least if you are implementing a .net framework application, this might be useful.

Hope this can help in both short term and long term

Best regards
Mads

@fstandhartinger
Copy link
Author

Thanks @mdabros,

good to know you are planning to switch to a more flexible approach to store observations and targets.
So far I didnt get the gcAllowVeryLargeObjects approach to work, I'll have to dig into it deeper at some point, but thanks for pointing me there.
In the meantime I will just train using a smaller amount of records, a couple of million observations should be enough anyhow.

Regarding backends: What do you think about adding XGBoost as a backend for gradient boosting at some point? I think it's a shame that today there is no proper way to use one of the leading machine learning libs out there with .NET (except an outdated .NET wrapper). I don't get why Microsoft isn't jumping on the train here and provides a propper interface to strengthen its .NET developer community, actually. Big plus of XGBoost would be GPU accelleration and quite optimized memory usage, IMHO.

Best regards
Florian

@mdabros
Copy link
Owner

mdabros commented Feb 28, 2018

@fstandhartinger Adding a backend or interface to one of the leading gradient boost libraries is definitely something I want to do. My initial choice would be LightGBM from Microsoft, and I have opened an issue on this on their github page: microsoft/LightGBM#763. LightGBM also supports GPU, and should be very fast and memory efficient.

As the issue describe, I was hoping Microsoft and the authors would create a simple .net wrapper and a nuget package I could use from SharpLearning. However, it does not seem to be a priority for them at the moment. Like you, I also don't understand why Microsoft isn't providing proper interfacing for machine learning in .NET. I hope that they will eventually jump on the train, and that it won't be too late when they do.

Once the work on the neural net backend is completed, I will start to look at gradient boost again. I think gradient boost is still a very valid choice for many ML problems, even in this age of "deep learning".

Best regards
Mads

@fstandhartinger
Copy link
Author

fstandhartinger commented Feb 28, 2018 via email

@mdabros
Copy link
Owner

mdabros commented Mar 2, 2018

Hi Florian,

I think it's find to apply some moderate pressure to Microsoft, in the end they did listen to the community regarding C# support for CNTK, so hopefully the same will happen with LightGBM.

I completely agree with you regarding gradient boosted trees, it is still a very important and useful technique for many problem types.

Looking forward to hearing some good news about the LightGBM C# wrapper :-)

Best regards
Mads

@fstandhartinger
Copy link
Author

fstandhartinger commented Mar 4, 2018

Hi Mads,

I created a .NET Wrapper for LightGBM, based on the executables.

=> Here it is: LightGbmDotNet

I tried to hide/encapsulate the fact that it is based on the LightGbm executable as good as I could, I think for most usecases it shouldn't be a problem for the end user.
Actually it even has some advantages, i.e. no additional memory consumption in the calling process.

The exceutable will be started in an invisible way, all meta output from LightGBM will be handled (errors are thrown as exceptions, verbose info output represented as log text) and temp files will automatically be removed after use.
Its pretty much thread safe and multiple instances can be run simultaneously, optionally sharing data sets for training/prediction (helpful for hyperparameter tuning runs).
IMHO performance and memory consumption is quite good actually.

There is no need to install or download or compile anything from LightGBM, because the native dlls and exes are embedded in the LightGbmDotNet library.
Just download the project, build it, reference the LightGbmDotNet.dll and instanciate the LightGbm class.

Its still missing a Nuget package, a unit test and an example, but downloading the project and getting it to run should be pretty straight forward.

Best regards

Florian

PS.: I designed the methods to take an IEnumerable<IEnumerable<double>> as training/prediction sets, because that enables the user to provide huge amounts of data in a memory saving way (for example using yield return methods). To use these methods easily for training/prediction with SharpLearning (F64Matrix and ObservationTargetSet classes) use the following extensions:


    public static class Extensions
    {
        public static IEnumerable<IEnumerable<double>> GetLightGbmTrainingRows(this ObservationTargetSet set)
        {
            for (var rowIndex = 0; rowIndex < set.Observations.RowCount; rowIndex++)
                yield return GetRow(set, rowIndex);
        }

        private static IEnumerable<double> GetRow(ObservationTargetSet set, int rowIndex)
        {
            yield return set.Targets[rowIndex];
            for (var colIndex = 0; colIndex < set.Observations.ColumnCount; colIndex++)
                yield return set.Observations[rowIndex, colIndex];
        }

        public static IEnumerable<IEnumerable<double>> GetLightGbmPredictionRows(this F64Matrix matrix)
        {
            for (var rowIndex = 0; rowIndex < matrix.RowCount; rowIndex++)
                yield return GetRow(matrix, rowIndex);
        }

        private static IEnumerable<double> GetRow(F64Matrix matrix, int rowIndex)
        {
            for (var colIndex = 0; colIndex < matrix.ColumnCount; colIndex++)
                yield return matrix.At(rowIndex, colIndex);
        }
    }

@mdabros
Copy link
Owner

mdabros commented Mar 5, 2018

Hi @fstandhartinger

Thanks for contributing with the LightGbmDotNet wrapper. I will definitely try it out, and make some experiments using it. I could imagine using it in conjunction with some hyper parameter tuning for my next Kaggle competition on structured data.

As I mention in microsoft/LightGBM#763, I would prefer a direct wrapping of the native dll using pinvoke or similar when LightGBM is to be included in SharpLearning. This would make for a more complete solution in my opinion.

But I think it is a step in the right direction, and for many developers, I think the solution you have made with the executable will be a great help to access LightGBM from .net.

Best regards
Mads

@fstandhartinger
Copy link
Author

Hi @mdabros,

thank you!

Yes, I totally agree, a pinvoke based wrapper would be a lot better. And I also agree that the issue at MS should be left open, I feel it would be the natural solution that they offer bindings for .NET.

I consider upgrading the LightGbmDotNet project to directly call into dll functions at some point, maybe just changing the code in the back and leaving the interface stable - but for now the executable based version fulfills my need, so for now I'll leave it as it is.

Actually I am using it for hyperparameter tuning using your super easy to use Optimizer classes.
In case you want to give it a try, the code looks like that:


var parameters = new ParameterBounds[]
{
	new ParameterBounds(min: 80, max: 500, transform: Transform.Linear), // iterations
	new ParameterBounds(min: 0.02, max: 0.2, transform: Transform.Logarithmic), // learning rate
	new ParameterBounds(min: 2, max: 15, transform: Transform.Linear), // maximumTreeDepth
	new ParameterBounds(min: 0.2, max: 0.9, transform: Transform.Linear), // featureFraction
	new ParameterBounds(min: 5, max: 1000, transform: Transform.Linear), // minDataInLeaf
};

// Define optimizer objective (function to minimize)
Func<double[], OptimizerResult> minimize = p =>
{
	using (var lightGbm = new LightGbm(true)) //truue => use GPU accelleration
	{
		var lightGbmParams = Parameters.DefaultForBinaryClassification;
		lightGbmParams.AddOrReplace(new Parameter("num_trees", ((int)p[0]).ToString()));
		lightGbmParams.AddOrReplace(new Parameter("learning_rate", p[1].ToString(englishCulture)));
		lightGbmParams.AddOrReplace(new Parameter("num_leaves", ((int)p[2]).ToString()));
		lightGbmParams.AddOrReplace(new Parameter("feature_fraction", p[3].ToString(englishCulture)));
		lightGbmParams.AddOrReplace(new Parameter("min_data_in_leaf", ((int)p[4]).ToString()));

		lightGbm.Train(trainingDataSet, lightGbmParams);

		//Predict on multiple disjoined test sets
		var candidateError = testRecordDataSets.Result.Sum(ds =>
		{
			var predictions = lightGbm.Predict(ds.DataSet);
			return CalculateError(ds.Records, predictions);
		});
		return new OptimizerResult(p, candidateError);
       }
};

// create random search optimizer
var optimizer = new RandomSearchOptimizer(parameters, iterations: totalVariantsToTest, runParallel: true);

// find best hyperparameters
var result = optimizer.OptimizeBest(minimize);
var best = result.ParameterSet;

Best reagrds

Florian

@mdabros
Copy link
Owner

mdabros commented May 20, 2018

@fstandhartinger Just to let you know, I have added support for XGBoost via SharpLearning.XGBoost. It is a x64 bit project and supports CPU and GPU learning. It should be available via nuget in the next few days. You can find a simple training time comparison in the pull request: #68

The implementation was possible using the package XGBoost.net, since this had relatively good bindings to the native xgboost lib.

At some point I hope a similar package for LightGBM becomes available, then I will add support for that also.

best regards
Mads

@fstandhartinger
Copy link
Author

fstandhartinger commented Jun 2, 2018

Hey Mads,

sorry for replying so late - I have been on holiday and have just returned.

This is excellent news and looks like you have once again done a great job!

I will check it out once I find time for it. If the bindings allow transferring data in memory this greatly should improve training/prediction performance, compared to my somewhat clumsy file based LightGbm wrapper.
I am now linking to your XGBoost implementation from my own LightGbm wrappers github page and recommend it on quora.

Regarding LightGbm: I haven't checked out the quality of the bindings here, but this project might be a good starting point for a LightGbm wrapper for SharpLearning: LightGBMSharp

Also I am wondering how good catboost from Yandex performs in comparison to XGBoost and LightGBM, they claim to be the leading Gradient Boosted Decision Tree library out there but I haven't seen much third party sources that state the same.

Best regards

Florian

@mdabros
Copy link
Owner

mdabros commented Jun 6, 2018

Hi Florian,

No worries, hope you had a good holiday :-)

Thanks for linking to the implementation! There are defenitely still some things that could be improved, a few of them is listed in the future work section of the Using SharpLearning.XGBoost wiki page.

I have also looked at the LighGBMSharp project, and I think it could be used as a relatively good interface for a SharpLearning.LightGBM project. The largest problem as I see it is the current nuget package, which does not include the native binaries for LightGBM. But this should be relatively easy to add.

CatBoost also looks very interesting, I had a quick talk with one of the developers on last years NIPS conference, and they defenitely seem to have some nice "tricks" for speeding up the learning part of the algorithm. But as always, there is currently no C# interface for the library :-)

Best regards
Mads

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants