Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrating friedrich into linfa #1

Open
LukeMathWalker opened this issue Dec 12, 2019 · 19 comments
Open

Integrating friedrich into linfa #1

LukeMathWalker opened this issue Dec 12, 2019 · 19 comments

Comments

@LukeMathWalker
Copy link

I have started to review friedrich with the aim of integrating it into linfa.

I imagine the integration using a "shim" crate in the linfa repository, named linfa-gaussian-processes, which re-exports all or part of friedrich's public API. This ensures that linfa is in control of naming conventions, interfaces and whatever is required to give the overall linfa ecosystem an homogeneous feeling as well as allowing you to retain ownership of friedrich and evolving it as you see fit.

The alternative would be to move friedrich inside linfa as linfa-guassian-processes, but that's a much more substantial step hence I don't think that's what we want to go for (at least for now).

In terms of requirements, apart from some minor improvements here and there that I should be able to submit as PRs against friedrich over the Christmas holidays, I'd like to understand the following:

  • What was the rationale behind the choice of nalgebra over ndarray?
    I'd like to standardize on ndarray, at least in the initial phase, to avoid having overly generic interfaces (e.g. inputs and outputs generic parameters).
    Is it because you needed ndarray-linalg and that requires a non-Rust dependency?
  • I have found that there are few tests across the repository, either testing units of functionality or the overall algorithm. What are your thoughts on this?
  • Benchmarks: have you compared how your implementation fares against another reference implementation, both in terms of performance and correctness?
@nestordemeure
Copy link
Owner

I agree with the shim proposition (as it would keep friedrich small for people not wanting to depend on the full linfa and make your interface homogenous).
If you provide me with a clear description of the naming / interface convension or with a mock where all methods points to unimplemented!(), I can start working on it this Christmas.

I have actually tried both nalgebra and ndarray for my first prototypes but I did not found some operations I needed in ndarray-linalg (I cant remember which ones however).
I believe that nalgebra is a better fit for my needs (and I have contributed some methods to nalgebra, that should be available in the next version, and make noticeable performance improvements available) but it is fairly easy to take ndarray types as inputs (as the nalgebra types are mostly an implementation detail that should be transparent to the user).
I have not done it yet but I plan to add the trait implementation to use ndarray types as inputs/ouputs behind a feature flag (I can make it a priority as it would be a blocker for integration in linfa).

Both kind of benchmarks are high on my todo list, I will start that once I am fully happy with the internals and API (I am still wondering whether the current kernel trait and builder pattern can be improved). At the moment I have confirmed that it works as expected but not that it reproduce results from reference implementations.

I do think tests are needed but they are low priority for the moment (I want to confirm API stability before that).

(I should probably have a clean full and official TODO list somewhere)

@LukeMathWalker
Copy link
Author

I am actually happy to work on the shim myself, to get familiar with the crate. I'll spend some time on it over the Christmas break and I'll pull you in for a review when I have something decent ready 👍

My main issue with nalgebra at the moment would be having to list both nalgebra and ndarray as dependencies in linfa - I don't consider it a deal breaker though, certainly not at this stage.

@nestordemeure
Copy link
Owner

Great!

I am currently working on adding ndarray's ArrayBase<f64, Ix2> as an optional input type which should give you all the building blocks needed (currently fighting with the Data trait...).
By the way, is it preferable to use ArrayBase<f64, Ix2> or Array2<f64> (much simpler but less general).

I can understand that, nalgebra is a dependency that you clearly feel at compile time. Once I am fully happy with friedrich and if it is integrated into linfa I might retry ndarray-linalg but it is clearly low on my priority list.

@LukeMathWalker
Copy link
Author

I am currently working on adding ndarray's ArrayBase<f64, Ix2> as an optional input type which should give you all the building blocks needed (currently fighting with the Data trait...).
By the way, is it preferable to use ArrayBase<f64, Ix2> or Array2<f64> (much simpler but less general).

The first provides much better ergonomics to the user of the crate (they can pass in views, mutable views, owned arrays, etc.).

I can understand that, nalgebra is a dependency that you clearly feel at compile time. Once I am fully happy with friedrich and if it is integrated into linfa I might retry ndarray-linalg but it is clearly low on my priority list.

We are aligned on this, not top of the list for me either.

@nestordemeure
Copy link
Owner

I have added (but not tested) support for ndarray.

The next step on my list (which should be done within a week) is to reproduce the classical Mauna Loa example. As it has been implemented in sci-kit learn, it might give you a reference point to compare both implementations.

@LukeMathWalker
Copy link
Author

Ok, I have started to write the shim, which means I looked closer at the whole crate and I am struggling to understand some design choices.

Let's start from my biggest confusion source right now - what do you exactly mean as Prior?
Let's restrict the discussion to regression, for simplicity.
In all libraries dealing with GP and most theoretical treatments, the term prior is used for the kernel function itself. The kernel function you choose embeds the assumptions that you have made on the function family we believe the unknown function to belong to.
What do you mean by Prior in friedrich? It's also confusing that you can fit a prior.
Why did you feel the need to separate Prior and Kernel?

@LukeMathWalker
Copy link
Author

It's there any reference (book/article) I can look at for the training algorithm you are using?

@nestordemeure
Copy link
Owner

nestordemeure commented Jan 1, 2020

This article is quitte good at explaining what a gaussian process is and where does the prior come into play : gaussian processes are not so fancy (see the The Bayesian prior away from data section).

In short the kernel function gives a similarity metric between elements in the input space while the prior gives a default value that will be returned in the absence of information (in particular stationary kernels returns zero when the similarity to the training samples fall to 0 which is probably not what the user would expect).

The prior can be any regression model and, while having a good prior (close to the target function) makes the learning faster (it is useful for things such as transfer learning), I have seen some paper omiting it from the definition of the gaussian process (since it is trivial to add it on top of the equation).

Something like a linear prior can be included into the kernel but (to the best of my knowledge) it is not true for all type of priors while it is easy to include an explicit prior term in the equation used.

(don't hesitate to tell me if I am not clear)

@nestordemeure
Copy link
Owner

nestordemeure commented Jan 1, 2020

It's there any reference (book/article) I can look at for the training algorithm you are using?

The training of the prior or of the kernel ?

For the prior the training is up to each particular prior implementation (linear regression for a linear prior, etc).

For the kernel I believe I put some links in the sources. In short I use the ADAM optimizer (a form of gradient descent) to maximize the log-likelihood of the model given the training data (which has some nice regularising properties).
If the kernel function has a scaling parameter, I use ideas from a paper (I don't have the link on hand but it is in the source, I can put it here if needed) to find the optimal scaling and perform the gradient descent on the corresponding problem.

Does it answer your question ?

@nestordemeure
Copy link
Owner

Reading the documentation for scikit learn's implementation, it seem that they set the prior to either 0 or the mean of the data but do not allow for alternatives :

The prior mean is assumed to be constant and zero (for normalize_y=False) or the training data’s mean (for normalize_y=True). The prior’s covariance is specified by passing a kernel object.

(note that when I say prior, I always mean prior mean)

@LukeMathWalker
Copy link
Author

Ok, let me rephrase then.
You are saying that most projects work under the assumption that m(x) in

GP

is 0, simplifying the GP definition to

SimpleGP

while friedrich lets you specify m(x).
Now, both m(x) and k(x, x) are actually m(x, a) and k(x, x, b) where a and b are their respective hyper-parameters.
friedrich lets you choose if you want to fit those hyper-parameters on the training data or no.
Have I got it right?

@nestordemeure
Copy link
Owner

if p(x) is the prior and not the final distribution outputed by the trained model then yes, its a perfect description.

@LukeMathWalker
Copy link
Author

Perfect. I have started to draft the shim - you can follow my progresses here: https://github.com/LukeMathWalker/linfa/blob/gp/linfa-gaussian-processes/src/hyperparameters.rs

@nestordemeure
Copy link
Owner

nestordemeure commented Jan 2, 2020

Great! I gave a quick look to your code, here are two remarks :

I see that you put a ??? next to convergence_fraction (I should probably find a better name for this parameter).
It corresponds to the current stopping criteria for the gradient descent : given a variable x and its gradient gx, the algorithm stops if, for all variables, we have |gx| < convergence_fraction*|x|.

You set noise to 0.1 by default (with a comment saying that it correspond to 10% of the std).
However friedrich takes absolute values for the noise (and not relative values such that 0.1 would be converted to 10% of the std).
The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.

(on my side, I did not have as much time as expected but the example should be coming shortly)

@LukeMathWalker
Copy link
Author

I see that you put a ??? next to convergence_fraction (I should probably find a better name for this parameter).
It corresponds to the current stopping criteria for the gradient descent : given a variable x and its gradient gx, the algorithm stops if, for all variables, we have |gx| < convergence_fraction*|x|.

I figured that out later when doing the setters, but I didn't go back to edit the comment - I'll fix it up 👍

You set noise to 0.1 by default (with a comment saying that it correspond to 10% of the std).
However friedrich takes absolute values for the noise (and not relative values such that 0.1 would be converted to 10% of the std).
The source of the confusion might be that, if the user does not give a value for the noise, it will default to 10% of the std of the ouputs.

I did it on purpose and I know it's broken right now 😁
I don't like the way the default works because it's not tunable by the user - e.g. it's cumbersome for them to say that noise should be 20% of the std, because the value they pass in is used as an absolute value.
I have to figure out a way to make it work as percentage of std by default, still working on it - a PR against friedrich might come out of it.

@nestordemeure
Copy link
Owner

nestordemeure commented Jan 3, 2020

Making it a % of the std is not too hard (computing the std is cheap compared to the training) but I believe that using the std is only good to get an idea of the magnitude of the noise while, when you have an exact number given by an expert, its usually an absolute value (but it might be a moot point as the fit will change the amplitude of the noise anyway).

I am currious about your PR nevertheless

@LukeMathWalker
Copy link
Author

LukeMathWalker commented Jan 3, 2020

Yeah, when I say make it work I mean from an API point of view 👍

@wegreenall
Copy link

Have been following this with great interest. What is the current state of the integration of GPs into linfa?
I was thinking of building a Gaussian process library in Rust but would rather contribute to something bigger, if this is still happening.

@nestordemeure
Copy link
Owner

Hi @wegreenall ! Currently there is a shim in the Linfa repository but the work has not been integrated in the main branch and, to my knowledge, might be incomplete.

Your best bet is to check with @LukeMathWalker to determine what is missing for integration in Linfa. Don't hesitate to ask me any question to help with the integration work.

Longer term, Friedrich uses nalgebra while linfa focuses on ndarray, both of which are large dependencies, so it might be interesting to start working on a new version of Friedrich that would not be dependent on nalgebra (however the integration is quitte deep so that would reprsent a lot of work).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants