Training on GPU fails (OSError: exception: access violation) #1717

Mtale · 2018-09-30T15:43:36Z

I have been trying to run LightGBM GPU for some time without success. The software works well on CPU.

I've compiled LightGBM using MinGW following the instructions here and using MSVC like instructed here. I used Visual Studio 2017 to compile.

No matter the way of compilation, while I try to train a model in Jupyter on Python I get the same error message:

OSError: exception: access violation reading 0x0000000000000020

More details on error below. The referenced error is for sklearn API but the error stays the same if I use lightgbm.cv API.

While trying to run CLI example in the instructions of MinGW compilation, the program fails silently. I have MSVC compilation installed right now and can't reproduce but if you refer to image in the instructions, silent fail occurs after the line Total bins 6143.

output of CLI example

I've run Tensorflow GPU earlier, hence the GPU does work. However, GPU Caps Viewer fails silently while starting. Probably related, but I wan't able to find anything on that problem online.

I've tried suggestions in the following issues:

#836
#1028

Environment info

Operating System: Windows 10 Home

CPU Model: i7 7700
GPU model: Geforce GTX 1060 6Gb
CUDA: 9.0.176.2
OpenCL: 1.2

C++/Python/R version: Python 3.6

Error message

in model(features, test_features, encoding, n_folds)
125 eval_set = [(valid_features, valid_labels), (train_features, train_labels)],
126 eval_names = ['valid', 'train'], categorical_feature = cat_indices,
--> 127 early_stopping_rounds = 100, verbose = 200)
128
129 # Record the best iteration

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
697 verbose=verbose, feature_name=feature_name,
698 categorical_feature=categorical_feature,
--> 699 callbacks=callbacks)
700 return self
701

C:\Anaconda3\lib\site-packages\lightgbm\sklearn.py in fit(self, X, y, sample_weight, init_score, group, eval_set, eval_names, eval_sample_weight, eval_class_weight, eval_init_score, eval_group, eval_metric, early_stopping_rounds, verbose, feature_name, categorical_feature, callbacks)
500 verbose_eval=verbose, feature_name=feature_name,
501 categorical_feature=categorical_feature,
--> 502 callbacks=callbacks)
503
504 if evals_result:

C:\Anaconda3\lib\site-packages\lightgbm\engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks)
188 # construct booster
189 try:
--> 190 booster = Booster(params=params, train_set=train_set)
191 if is_valid_contain_train:
192 booster.set_train_data_name(train_data_name)

C:\Anaconda3\lib\site-packages\lightgbm\basic.py in init(self, params, train_set, model_file, silent)
1474 train_set.construct().handle,
1475 c_str(params_str),
-> 1476 ctypes.byref(self.handle)))
1477 # save reference to data
1478 self.train_set = train_set

OSError: exception: access violation reading 0x0000000000000020

funkindy · 2018-10-11T13:02:01Z

Exactly the same problem here with GPU version built with MinGW:

Windows 10,
CMake 3.8,
Boost 1.63.0
CUDA: 9.2

Building goes fine, CLI interface also works okay on GPU with test examples, but python wrapper drops exactly the same error (OSError: exception: access violation writing 0xFFFFFFFF95A80000) on Booster init.

This command is used to install python wrapper: python setup.py install --precompile

LinYungLun · 2018-10-15T06:54:09Z

Got similar error on my win10 machine too, works okay on GPU with test examples.

Windows 10,
CMake 3.11,
Boost 1.63.0
CUDA: 9.2

Traceback (most recent call last):
File "", line 95, in
verbose_eval=300
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\engine.py", line 192, in train
booster = Booster(params=params, train_set=train_set)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 1487, in init
train_set.construct().handle,
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 985, in construct
categorical_feature=self.categorical_feature, params=self.params)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 771, in _lazy_init
self.__init_from_np2d(data, params_str, ref_dataset)
File "C://Users//melo1//Anaconda3//envs//GPU//Lib//site-packages\lightgbm\basic.py", line 835, in __init_from_np2d
ctypes.byref(self.handle)))
OSError: exception: access violation writing 0xFFFFFFFF94A00000

guolinke · 2018-10-15T06:58:01Z

ping @huanzhang12

huanzhang12 · 2018-10-15T07:05:23Z

@funkindy @marcualin7412 Could you try if GPU Caps Viewer works on your system?
You can download it here: http://www.ozone3d.net/gpu_caps_viewer/
See if you can view OpenCL devices using it.

For debugging this kind of issue I suggest using the CLI version of LightGBM instead of Python. Could you please run LightGBM using the CLI (command line interface) and get a full output log? This will be really helpful for me to investigate this issue.

LinYungLun · 2018-10-15T07:13:01Z

thanks for your quick response.
this is my OpenCL page

LinYungLun · 2018-10-15T07:22:43Z

@huanzhang12
Using CLI works fine, seems there are something wrong in spyder ide, thanks for your time.

funkindy · 2018-10-15T07:23:21Z

@huanzhang12 attached is the output of this command: "../../lightgbm.exe" config=train.conf data=binary.train valid=binary.test objective=binary device=gpu

OpenCL page of the GPU Caps Viewer is Okay like the @marcualin7412 one.

GPU_testing.txt

The log looks good so the issue may be specific for python interface.

StrikerRUS · 2018-11-30T15:07:53Z

ping @huanzhang12

StrikerRUS · 2019-01-16T19:40:46Z

gently ping @huanzhang12

Mtale · 2019-02-09T13:54:26Z

Any information on this issue yet?

xins-yao · 2019-03-02T12:41:43Z

@Mtale
I have the same issue with u before, and disable my other Inter GPU, now works fine, maybe u could try it.

pipidog · 2019-03-29T18:32:52Z

Any solution on this issue?

I compiled LightGBM (using VS 2017) on my windows 10 machine with 1060 6GB GPU. It runs well in CPU and always got error message:

OSError: exception: access violation reading 0x0000000000000020

when using GPU. Checked all discussion regarding this issue but no useful information so far. Any idea?

Crazy-LittleBoy · 2019-08-02T14:22:16Z

I have the similar problem:
OSError: exception: access violation reading 0x000001E028ADE000

chenjunboBUPT · 2019-11-08T01:34:50Z

When i use it with basic env, i works well.
But when i want to use it with pytorch env, i encounter this error.
OSError: exception: access violation reading 0x000001E028ADE000

guolinke · 2020-08-06T00:18:57Z

It seems the problem mainly happen in windows. and one comment say disable the intel GPU can help.

I have the same issue with u before, and disable my other Inter GPU, now works fine, maybe u could try it.

you can try this solution. and gentle ping @huanzhang12 for the better word around.

We have a new CUDA implementation (#3160), which does not depend on OpenCL, and it should fix this.

Numan100 · 2021-06-23T03:35:25Z

Any solution on this issue?

I compiled LightGBM (using VS 2017) on my windows 10 machine with 1060 6GB GPU. It runs well in CPU and always got error message:

OSError: exception: access violation reading 0x0000000000000020

when using GPU. Checked all discussion regarding this issue but no useful information so far. Any idea?

I have similar OSError in reading 0x0000000000000038.
Have you got Any solutions to this bug? @pipidog @guolinke

StrikerRUS assigned huanzhang12 Sep 30, 2018

guolinke added the bug label Aug 1, 2019

StrikerRUS mentioned this issue May 11, 2020

v3.0.0rc1 #3071

Merged

guolinke mentioned this issue Aug 10, 2020

[WIP] next release (3.0.0) #3293

Closed

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on GPU fails (OSError: exception: access violation) #1717

Training on GPU fails (OSError: exception: access violation) #1717

Mtale commented Sep 30, 2018

funkindy commented Oct 11, 2018 •

edited

Loading

LinYungLun commented Oct 15, 2018 •

edited

Loading

guolinke commented Oct 15, 2018

huanzhang12 commented Oct 15, 2018

LinYungLun commented Oct 15, 2018

LinYungLun commented Oct 15, 2018

funkindy commented Oct 15, 2018

StrikerRUS commented Nov 30, 2018

StrikerRUS commented Jan 16, 2019

Mtale commented Feb 9, 2019

xins-yao commented Mar 2, 2019

pipidog commented Mar 29, 2019

Crazy-LittleBoy commented Aug 2, 2019

chenjunboBUPT commented Nov 8, 2019

guolinke commented Aug 6, 2020

Numan100 commented Jun 23, 2021

Training on GPU fails (OSError: exception: access violation) #1717

Training on GPU fails (OSError: exception: access violation) #1717

Comments

Mtale commented Sep 30, 2018

Environment info

Error message

funkindy commented Oct 11, 2018 • edited Loading

LinYungLun commented Oct 15, 2018 • edited Loading

guolinke commented Oct 15, 2018

huanzhang12 commented Oct 15, 2018

LinYungLun commented Oct 15, 2018

LinYungLun commented Oct 15, 2018

funkindy commented Oct 15, 2018

StrikerRUS commented Nov 30, 2018

StrikerRUS commented Jan 16, 2019

Mtale commented Feb 9, 2019

xins-yao commented Mar 2, 2019

pipidog commented Mar 29, 2019

Crazy-LittleBoy commented Aug 2, 2019

chenjunboBUPT commented Nov 8, 2019

guolinke commented Aug 6, 2020

Numan100 commented Jun 23, 2021

funkindy commented Oct 11, 2018 •

edited

Loading

LinYungLun commented Oct 15, 2018 •

edited

Loading