You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
first, thanks a lot for sharing your code with the world!
I would like to use your GPU implementation of L-BFGS-B for my master thesis project. One of the applications will be a large-scale (L2-)norm approximation with non-negativity constraints on the optimization variables, i.e.
$$
\text{min } \frac{1}{m} ||A x - b||_2^2
$$
$$
\text{subject to } x \geq 0
$$
where $A \in \mathbb{R}^{m \times n}$, $b \in \mathbb{R}^m$ and $x \in \mathbb{R}^n$.
To get familiar with the code I created an example for such a problem with $m=n=1000$. While everything works very good in case of the CPU version, the GPU version often jumps to really large or NaN values and terminates after only a few iterations without having reached a value close to a minimum. Removing the box constraints leads to both versions doing the job very well.
Similar results (very fast termination without getting close to the minimum for the GPU version) could be observed for other problems of this type.
My question is now: Do you think this is a limitation of your GPU-algorithm? (I thought this could be possible, because I observed the problem only if box-bounds are enabled. So maybe your adaptions for finding the GCP do not work for this problem?) Is it a known issue?
If not: Do you have an idea what to change in my example to make things work? (see examples folder in my fork ).
My CPU is a Intel(R) Core(TM) i7-7820X CPU, the GPU is a GeForce GTX 1080 Ti with CUDA 11.7. I get the following output
problem size: 1000, constrained, CPU, single precision
Summary:
Total time: 981.328 ms
Iteration time: 7.85062 ms / iteration
residual f: 4.29088e-09
residual g: 0.000298678
residual x: 1.18977e-05
iterations: 125
info: 1
CPU result 8.92826e-07
problem size: 1000, constrained, GPU, single precision
Summary:
Total time: 17.6567 ms
Iteration time: 1.35821 ms / iteration
residual f: nan
residual g: 0
residual x: nan
iterations: 13
info: 0
GPU result 0.639691
problem size: 1000, unconstrained, CPU, single precision
Summary:
Total time: 1507.06 ms
Iteration time: 7.49779 ms / iteration
residual f: 5.06273e-09
residual g: 9.37757e-05
residual x: 0.00013773
iterations: 201
info: 1
CPU result 2.74595e-07
problem size: 1000, unconstrained, GPU, single precision
Summary:
Total time: 138.902 ms
Iteration time: 0.775991 ms / iteration
residual f: 7.42523e-09
residual g: 0.000114142
residual x: 0.00048579
iterations: 179
info: 0
GPU result 1.28257e-06
Process finished with exit code 0
It would be really nice if you could take a look at this. Thanks!
The text was updated successfully, but these errors were encountered:
Indeed, we haven't tested this on many optimization problems besides the ones in the paper. So it's very possible the simplified version of GCP causes divergence for the solver. Why and where it produces a NAN exactly may need more investigation though.
Hi @nepluno,
first, thanks a lot for sharing your code with the world!
I would like to use your GPU implementation of L-BFGS-B for my master thesis project. One of the applications will be a large-scale (L2-)norm approximation with non-negativity constraints on the optimization variables, i.e.
where$A \in \mathbb{R}^{m \times n}$ , $b \in \mathbb{R}^m$ and $x \in \mathbb{R}^n$ .$m=n=1000$ . While everything works very good in case of the CPU version, the GPU version often jumps to really large or NaN values and terminates after only a few iterations without having reached a value close to a minimum. Removing the box constraints leads to both versions doing the job very well.
To get familiar with the code I created an example for such a problem with
Similar results (very fast termination without getting close to the minimum for the GPU version) could be observed for other problems of this type.
My question is now: Do you think this is a limitation of your GPU-algorithm? (I thought this could be possible, because I observed the problem only if box-bounds are enabled. So maybe your adaptions for finding the GCP do not work for this problem?) Is it a known issue?
If not: Do you have an idea what to change in my example to make things work? (see examples folder in my fork ).
My CPU is a Intel(R) Core(TM) i7-7820X CPU, the GPU is a GeForce GTX 1080 Ti with CUDA 11.7. I get the following output
It would be really nice if you could take a look at this. Thanks!
The text was updated successfully, but these errors were encountered: