Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segment fault #3

Open
fengjiaxin opened this issue Feb 18, 2020 · 12 comments
Open

segment fault #3

fengjiaxin opened this issue Feb 18, 2020 · 12 comments

Comments

@fengjiaxin
Copy link

hi,excuse me
i meet a new issue,when i train the model
i meet another issue
segment fault core dump
would you update the new code,i have no idea to solve the problem

and more:
i think GLN/gln/mods/mol_gnn/gnn_family/utils.py can update by replace cuda() to to(DEVICE)
thanks a lot

@Hanjun-Dai
Copy link
Owner

could you please provide more details for the segfault?

@fengjiaxin
Copy link
Author

./run_mf.sh: 行 60: 9301 段错误 (吐核)python ../main.py -gm $gm -fp_degree 2 -neg_sample $neg_sample -att_type $att_type -gnn_out $gnn_out -tpl_enc $tpl_enc -subg_enc $subg_enc -latent_dim $msg_dim -bn $bn -gen_method $gen -retro_during_train $retro -neg_num $neg_size -embed_dim $embed_dim -readout_agg_type $graph_agg -act_func $act -act_last True -max_lv $lv -dropbox $dropbox -data_name $data_name -save_dir $save_dir -tpl_name $tpl_name -f_atoms $dropbox/cooked_$data_name/atom_list.txt -iters_per_val 3000 -gpu 1 -topk 50 -beam_size 50 -num_parts 1

no other information, i think its not environment issue

@Hanjun-Dai
Copy link
Owner

are you able to run the test with existing model dumps?

@Hanjun-Dai
Copy link
Owner

and did you modify the script?

I use -gpu 0 in the script. Please try with the vanilla code and see if that works

@fengjiaxin
Copy link
Author

get another issue gpu cuda error
are ckpt file saved by gpu?

@fengjiaxin
Copy link
Author

i use -gpu 1 ,and did you save the model by gpu 0, i run test script by error as follows:

Traceback (most recent call last):
File "main_test.py", line 139, in
model = RetroGLN(cmd_args.dropbox, local_args.model_for_test)
File "/home/fengjiaxin/GLN/gln/test/model_inference.py", line 43, in init
self.gln.load_state_dict(torch.load(model_file))
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 426, in load
return _load(f, map_location, pickle_module, **pickle_load_args)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 613, in _load
result = unpickler.load()
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 576, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 155, in default_restore_location
result = fn(storage, location)
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/serialization.py", line 135, in _cuda_deserialize
return storage_type(obj.size())
File "/home/fengjiaxin/.conda/envs/my-rdkit-env/lib/python3.6/site-packages/torch/cuda/init.py", line 634, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

@Hanjun-Dai
Copy link
Owner

yes it uses gpu by default. Please always use -gpu 0 in your script.
If you want to change GPU, please use CUDA_VISIBLE_DEVICES instead

@fengjiaxin
Copy link
Author

hi , i debug the code ,some error at GLN/gln/graph_logic/soft_logic.py line 29
jagged_forward graph_embed = graph_enc(list)
no other information
can you introduce your code in brief
i can not find the error
thanks

@fengjiaxin
Copy link
Author

can you give a docker image? i think it will be useful

@Hanjun-Dai
Copy link
Owner

graph_enc is from another sub package in this repo.

Can you first try without GPU? Please take a look at this:
https://discuss.pytorch.org/t/on-a-cpu-device-how-to-load-checkpoint-saved-on-gpu-device/349

to see how to load a gpu dump into cpu

@fengjiaxin
Copy link
Author

hi, i debug the traing file and test file
got the same error ,not cuda error
would you introduce your code in brief ,thanks

@Hanjun-Dai
Copy link
Owner

If the error is happening in that line, you may double check the
https://github.com/Hanjun-Dai/GLN/blob/master/gln/mods/mol_gnn/gnn_family/utils.py#L64

note that different graph nn implementation will override this function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants