Segmentation fault #68

ahabedsoltan · 2023-12-24T01:45:46Z

Hello,

I've been using FALKON, and it functions well on a GPU with a small number of centers. However, when I increase the number of centers to around 64,000, I encounter a "Segmentation fault" error, causing the program to terminate.

Here's the sequence of events during the run:

falkon starts...
MainProcess.MainThread::[Calcuating Preconditioner of size 128000]
Preconditioner will run on 1 GPUs
--MainProcess.MainThread::[Kernel]
--MainProcess.MainThread::[Kernel] complete in 80.324s
--MainProcess.MainThread::[Cholesky 1]
Using parallel POTRF
--MainProcess.MainThread::[Cholesky 1] complete in 47.152s
--MainProcess.MainThread::[Copy triangular]
--MainProcess.MainThread::[Copy triangular] complete in 17.284s
--MainProcess.MainThread::[LAUUM(CUDA)]
--MainProcess.MainThread::[LAUUM(CUDA)] complete in 56.486s
--MainProcess.MainThread::[Cholesky 2]
Segmentation fault

I'm curious about why this issue arises with a large number of centers. Previously, I've successfully used FALKON with up to 256,000 centers. It seems that with the current updated version, there are issues at this scale. Your assistance in resolving this matter would be greatly appreciated.

parthe · 2023-12-28T06:59:39Z

We have tried setting the following options, yet the seg-fault persists.
never_store_kernel=True
chol_force_kernel=True
no_single_kernel=False

parthe · 2023-12-29T06:38:52Z

Here is a minimal working code that reproduces the error that was raised by @ahabedsoltan

import falkon, torch

n, N, M, d, bw = 200_000, 1000, 64_000, 1, 1.

accufun = lambda yt, yh: 100 * (yt.argmax(dim=1) == yh.argmax(dim=1)).sum() / yh.shape[0]

options = falkon.FalkonOptions(debug=True,
    never_store_kernel=True,
    chol_force_ooc=True,
    no_single_kernel=False)
kernel_fn_flk = falkon.kernels.LaplacianKernel(sigma=bw, opt=options)
model = falkon.Falkon(kernel=kernel_fn_flk, penalty=1e-6, M=M, options=options,
                      error_every=1, error_fn=accufun, maxiter=1)

X = torch.randn(n, d)
Y = torch.randn(n, d)
x = torch.randn(N, d)
y = torch.randn(N, d)
model.fit(X, Y, Xts=x, Yts=y)

Giodiro · 2024-01-01T15:13:31Z

Hi! I think it was a bug in a small helper function, it should be fixed on master! Are you comfortable trying it out like this or do you prefer if I release a new version?

ahabedsoltan · 2024-01-01T20:45:33Z

Thank you. Could you please create a pre-built wheel for it? Each time I try to install it using the command 'pip install git+https://github.com/falkonml/falkon.git', the installation fails.

parthe · 2024-01-02T07:13:19Z

Reinstalling falkon as follows solved the issue. @Giodiro Thanks for the quick bug-fix!

pip uninstall falkon
pip install --no-build-isolation git+https://github.com/FalkonML/falkon.git

ahabedsoltan · 2024-01-08T22:43:38Z

Thank you it resolved the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault #68

Segmentation fault #68

ahabedsoltan commented Dec 24, 2023

parthe commented Dec 28, 2023

parthe commented Dec 29, 2023

Giodiro commented Jan 1, 2024

ahabedsoltan commented Jan 1, 2024

parthe commented Jan 2, 2024

ahabedsoltan commented Jan 8, 2024

Segmentation fault #68

Segmentation fault #68

Comments

ahabedsoltan commented Dec 24, 2023

parthe commented Dec 28, 2023

parthe commented Dec 29, 2023

Giodiro commented Jan 1, 2024

ahabedsoltan commented Jan 1, 2024

parthe commented Jan 2, 2024

ahabedsoltan commented Jan 8, 2024