Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process is eventually killed #12

Open
muneebaadil opened this issue Aug 30, 2017 · 1 comment
Open

Process is eventually killed #12

muneebaadil opened this issue Aug 30, 2017 · 1 comment

Comments

@muneebaadil
Copy link

Iterations start taking more time and eventually the process is killed automatically. Any idea why this might be happening? Usually the process terminates somewhere around 11th epoch. Output of the program is attached below:

Creating data loader...	
loading data...	
	Initializing data loader for train set...	
	Initializing data loader for val set...	
Train start	
[Iter: 0.1k / lr: 1.00e-4] 	Time: 59.98 (Data: 49.06) 	Err: 33.426155	
[Iter: 0.2k / lr: 1.00e-4] 	Time: 58.74 (Data: 50.61) 	Err: 17.742222	
[Iter: 0.3k / lr: 1.00e-4] 	Time: 57.86 (Data: 49.84) 	Err: 15.524932	
[Iter: 0.4k / lr: 1.00e-4] 	Time: 58.06 (Data: 49.76) 	Err: 13.206391	
[Iter: 0.5k / lr: 1.00e-4] 	Time: 59.22 (Data: 50.93) 	Err: 11.778802	
[Iter: 0.6k / lr: 1.00e-4] 	Time: 58.67 (Data: 49.99) 	Err: 11.223848	
[Iter: 0.7k / lr: 1.00e-4] 	Time: 57.71 (Data: 49.69) 	Err: 10.476344	
[Iter: 0.8k / lr: 1.00e-4] 	Time: 58.65 (Data: 50.37) 	Err: 10.352445	
[Iter: 0.9k / lr: 1.00e-4] 	Time: 58.19 (Data: 49.89) 	Err: 9.883691	
[Iter: 1.0k / lr: 1.00e-4] 	Time: 58.66 (Data: 50.56) 	Err: 9.862550	
[Epoch 1 (iter/epoch: 1000)] Test time: 23.26	
(scale 4) Average PSNR: 27.0086 (Highest ever: 27.0086 at epoch = 1)	
	
[Iter: 1.1k / lr: 1.00e-4] 	Time: 59.21 (Data: 50.59) 	Err: 9.435181	
[Iter: 1.2k / lr: 1.00e-4] 	Time: 57.73 (Data: 49.44) 	Err: 9.251034	
[Iter: 1.3k / lr: 1.00e-4] 	Time: 58.37 (Data: 50.18) 	Err: 9.082382	
[Iter: 1.4k / lr: 1.00e-4] 	Time: 58.64 (Data: 50.27) 	Err: 8.877250	
[Iter: 1.5k / lr: 1.00e-4] 	Time: 57.33 (Data: 49.34) 	Err: 8.786548	
[Iter: 1.6k / lr: 1.00e-4] 	Time: 58.76 (Data: 50.61) 	Err: 8.352576	
[Iter: 1.7k / lr: 1.00e-4] 	Time: 57.96 (Data: 49.72) 	Err: 8.513297	
[Iter: 1.8k / lr: 1.00e-4] 	Time: 57.93 (Data: 49.91) 	Err: 8.242201	
[Iter: 1.9k / lr: 1.00e-4] 	Time: 57.72 (Data: 49.67) 	Err: 8.255659	
[Iter: 2.0k / lr: 1.00e-4] 	Time: 58.26 (Data: 50.18) 	Err: 8.005295	
[Epoch 2 (iter/epoch: 1000)] Test time: 23.43	
(scale 4) Average PSNR: 27.5600 (Highest ever: 27.5600 at epoch = 2)	
	
[Iter: 2.1k / lr: 1.00e-4] 	Time: 59.21 (Data: 50.80) 	Err: 8.167938	
[Iter: 2.2k / lr: 1.00e-4] 	Time: 58.26 (Data: 49.71) 	Err: 8.067993	
[Iter: 2.3k / lr: 1.00e-4] 	Time: 58.07 (Data: 49.75) 	Err: 8.067032	
[Iter: 2.4k / lr: 1.00e-4] 	Time: 57.78 (Data: 49.53) 	Err: 8.029886	
[Iter: 2.5k / lr: 1.00e-4] 	Time: 58.21 (Data: 50.22) 	Err: 8.389706	
[Iter: 2.6k / lr: 1.00e-4] 	Time: 58.52 (Data: 50.37) 	Err: 8.040120	
[Iter: 2.7k / lr: 1.00e-4] 	Time: 57.95 (Data: 49.76) 	Err: 7.712390	
[Iter: 2.8k / lr: 1.00e-4] 	Time: 58.30 (Data: 50.30) 	Err: 8.082315	
[Iter: 2.9k / lr: 1.00e-4] 	Time: 58.21 (Data: 50.17) 	Err: 7.580434	
[Iter: 3.0k / lr: 1.00e-4] 	Time: 58.14 (Data: 49.73) 	Err: 7.440087	
[Epoch 3 (iter/epoch: 1000)] Test time: 23.53	
(scale 4) Average PSNR: 27.8171 (Highest ever: 27.8171 at epoch = 3)	
	
[Iter: 3.1k / lr: 1.00e-4] 	Time: 59.66 (Data: 50.99) 	Err: 7.984920	
[Iter: 3.2k / lr: 1.00e-4] 	Time: 57.77 (Data: 49.08) 	Err: 7.743653	
[Iter: 3.3k / lr: 1.00e-4] 	Time: 58.84 (Data: 50.73) 	Err: 7.766374	
[Iter: 3.4k / lr: 1.00e-4] 	Time: 58.35 (Data: 49.99) 	Err: 7.688996	
[Iter: 3.5k / lr: 1.00e-4] 	Time: 59.49 (Data: 50.64) 	Err: 7.697878	
[Iter: 3.6k / lr: 1.00e-4] 	Time: 58.05 (Data: 49.94) 	Err: 7.589858	
[Iter: 3.7k / lr: 1.00e-4] 	Time: 57.57 (Data: 49.55) 	Err: 7.795816	
[Iter: 3.8k / lr: 1.00e-4] 	Time: 57.88 (Data: 49.92) 	Err: 7.386197	
[Iter: 3.9k / lr: 1.00e-4] 	Time: 58.64 (Data: 50.19) 	Err: 7.491277	
[Iter: 4.0k / lr: 1.00e-4] 	Time: 57.93 (Data: 49.67) 	Err: 7.493202	
[Epoch 4 (iter/epoch: 1000)] Test time: 23.77	
(scale 4) Average PSNR: 27.9856 (Highest ever: 27.9856 at epoch = 4)	
	
[Iter: 4.1k / lr: 1.00e-4] 	Time: 58.35 (Data: 50.14) 	Err: 7.274099	
[Iter: 4.2k / lr: 1.00e-4] 	Time: 59.01 (Data: 50.77) 	Err: 7.444119	
[Iter: 4.3k / lr: 1.00e-4] 	Time: 57.99 (Data: 49.68) 	Err: 7.388358	
[Iter: 4.4k / lr: 1.00e-4] 	Time: 58.33 (Data: 50.28) 	Err: 7.441720	
[Iter: 4.5k / lr: 1.00e-4] 	Time: 58.04 (Data: 49.80) 	Err: 7.664649	
[Iter: 4.6k / lr: 1.00e-4] 	Time: 57.90 (Data: 49.77) 	Err: 7.053351	
[Iter: 4.7k / lr: 1.00e-4] 	Time: 58.90 (Data: 50.54) 	Err: 7.477736	
[Iter: 4.8k / lr: 1.00e-4] 	Time: 57.74 (Data: 49.68) 	Err: 7.428355	
[Iter: 4.9k / lr: 1.00e-4] 	Time: 58.28 (Data: 50.14) 	Err: 7.505870	
[Iter: 5.0k / lr: 1.00e-4] 	Time: 58.14 (Data: 49.78) 	Err: 7.511303	
[Epoch 5 (iter/epoch: 1000)] Test time: 23.88	
(scale 4) Average PSNR: 28.1898 (Highest ever: 28.1898 at epoch = 5)	
	
[Iter: 5.1k / lr: 1.00e-4] 	Time: 59.04 (Data: 50.48) 	Err: 7.640118	
[Iter: 5.2k / lr: 1.00e-4] 	Time: 57.66 (Data: 49.61) 	Err: 7.343382	
[Iter: 5.3k / lr: 1.00e-4] 	Time: 57.88 (Data: 49.75) 	Err: 7.346845	
[Iter: 5.4k / lr: 1.00e-4] 	Time: 57.71 (Data: 49.41) 	Err: 7.537878	
[Iter: 5.5k / lr: 1.00e-4] 	Time: 57.82 (Data: 49.71) 	Err: 7.401683	
[Iter: 5.6k / lr: 1.00e-4] 	Time: 58.17 (Data: 50.21) 	Err: 7.207794	
[Iter: 5.7k / lr: 1.00e-4] 	Time: 57.57 (Data: 49.46) 	Err: 7.378472	
[Iter: 5.8k / lr: 1.00e-4] 	Time: 57.74 (Data: 49.43) 	Err: 7.457488	
[Iter: 5.9k / lr: 1.00e-4] 	Time: 58.11 (Data: 50.07) 	Err: 7.313913	
[Iter: 6.0k / lr: 1.00e-4] 	Time: 58.28 (Data: 50.04) 	Err: 7.198203	
[Epoch 6 (iter/epoch: 1000)] Test time: 23.48	
(scale 4) Average PSNR: 28.2877 (Highest ever: 28.2877 at epoch = 6)	
	
[Iter: 6.1k / lr: 1.00e-4] 	Time: 59.09 (Data: 50.69) 	Err: 7.453156	
[Iter: 6.2k / lr: 1.00e-4] 	Time: 57.47 (Data: 49.34) 	Err: 7.176994	
[Iter: 6.3k / lr: 1.00e-4] 	Time: 58.58 (Data: 50.55) 	Err: 7.083494	
[Iter: 6.4k / lr: 1.00e-4] 	Time: 58.84 (Data: 50.34) 	Err: 7.062294	
[Iter: 6.5k / lr: 1.00e-4] 	Time: 57.90 (Data: 49.71) 	Err: 7.171743	
[Iter: 6.6k / lr: 1.00e-4] 	Time: 57.92 (Data: 49.89) 	Err: 7.110619	
[Iter: 6.7k / lr: 1.00e-4] 	Time: 57.22 (Data: 49.05) 	Err: 7.182309	
[Iter: 6.8k / lr: 1.00e-4] 	Time: 58.61 (Data: 50.15) 	Err: 7.099355	
[Iter: 6.9k / lr: 1.00e-4] 	Time: 58.09 (Data: 49.89) 	Err: 7.436261	
[Iter: 7.0k / lr: 1.00e-4] 	Time: 58.57 (Data: 50.47) 	Err: 7.007415	
[Epoch 7 (iter/epoch: 1000)] Test time: 23.65	
(scale 4) Average PSNR: 28.3610 (Highest ever: 28.3610 at epoch = 7)	
	
[Iter: 7.1k / lr: 1.00e-4] 	Time: 59.49 (Data: 50.68) 	Err: 7.013370	
[Iter: 7.2k / lr: 1.00e-4] 	Time: 58.05 (Data: 49.62) 	Err: 7.256025	
[Iter: 7.3k / lr: 1.00e-4] 	Time: 57.96 (Data: 49.75) 	Err: 7.065247	
[Iter: 7.4k / lr: 1.00e-4] 	Time: 58.43 (Data: 50.19) 	Err: 7.236178	
[Iter: 7.5k / lr: 1.00e-4] 	Time: 57.86 (Data: 49.75) 	Err: 7.288727	
[Iter: 7.6k / lr: 1.00e-4] 	Time: 58.21 (Data: 50.14) 	Err: 6.995226	
[Iter: 7.7k / lr: 1.00e-4] 	Time: 58.45 (Data: 50.10) 	Err: 7.003108	
[Iter: 7.8k / lr: 1.00e-4] 	Time: 58.34 (Data: 50.10) 	Err: 6.968335	
[Iter: 7.9k / lr: 1.00e-4] 	Time: 58.06 (Data: 49.99) 	Err: 6.904951	
[Iter: 8.0k / lr: 1.00e-4] 	Time: 58.26 (Data: 49.67) 	Err: 7.103047	
[Epoch 8 (iter/epoch: 1000)] Test time: 23.62	
(scale 4) Average PSNR: 28.4066 (Highest ever: 28.4066 at epoch = 8)	
	
[Iter: 8.1k / lr: 1.00e-4] 	Time: 59.04 (Data: 50.64) 	Err: 7.247079	
[Iter: 8.2k / lr: 1.00e-4] 	Time: 58.20 (Data: 49.88) 	Err: 7.180099	
[Iter: 8.3k / lr: 1.00e-4] 	Time: 58.24 (Data: 49.67) 	Err: 6.996955	
[Iter: 8.4k / lr: 1.00e-4] 	Time: 58.86 (Data: 50.49) 	Err: 7.076885	
[Iter: 8.5k / lr: 1.00e-4] 	Time: 58.23 (Data: 50.02) 	Err: 7.028236	
[Iter: 8.6k / lr: 1.00e-4] 	Time: 57.68 (Data: 49.56) 	Err: 7.116949	
[Iter: 8.7k / lr: 1.00e-4] 	Time: 57.59 (Data: 49.54) 	Err: 7.182252	
[Iter: 8.8k / lr: 1.00e-4] 	Time: 57.99 (Data: 49.85) 	Err: 7.182454	
[Iter: 8.9k / lr: 1.00e-4] 	Time: 57.79 (Data: 49.45) 	Err: 7.205822	
[Iter: 9.0k / lr: 1.00e-4] 	Time: 57.88 (Data: 49.67) 	Err: 7.369923	
[Epoch 9 (iter/epoch: 1000)] Test time: 23.59	
(scale 4) Average PSNR: 28.5147 (Highest ever: 28.5147 at epoch = 9)	
	
[Iter: 9.1k / lr: 1.00e-4] 	Time: 58.90 (Data: 50.38) 	Err: 7.108896	
[Iter: 9.2k / lr: 1.00e-4] 	Time: 58.35 (Data: 50.30) 	Err: 7.120156	
[Iter: 9.3k / lr: 1.00e-4] 	Time: 57.81 (Data: 49.63) 	Err: 6.699856	
[Iter: 9.4k / lr: 1.00e-4] 	Time: 57.45 (Data: 49.53) 	Err: 7.055263	
[Iter: 9.5k / lr: 1.00e-4] 	Time: 58.29 (Data: 50.09) 	Err: 7.094778	
[Iter: 9.6k / lr: 1.00e-4] 	Time: 58.64 (Data: 50.43) 	Err: 7.121905	
[Iter: 9.7k / lr: 1.00e-4] 	Time: 59.03 (Data: 50.64) 	Err: 6.954122	
[Iter: 9.8k / lr: 1.00e-4] 	Time: 64.40 (Data: 56.23) 	Err: 7.131248	
Warning: Error is too large! Skip this batch. (Err: 15.775752)	
[Iter: 9.9k / lr: 1.00e-4] 	Time: 69.52 (Data: 61.19) 	Err: 7.178918	
[Iter: 10.0k / lr: 1.00e-4] 	Time: 72.00 (Data: 63.93) 	Err: 6.885784	
[Epoch 10 (iter/epoch: 1000)] Test time: 23.70	
(scale 4) Average PSNR: 28.5899 (Highest ever: 28.5899 at epoch = 10)	
	
Warning: Error is too large! Skip this batch. (Err: 14.525774)	
[Iter: 10.1k / lr: 1.00e-4] 	Time: 69.08 (Data: 60.68) 	Err: 6.986968	
[Iter: 10.2k / lr: 1.00e-4] 	Time: 72.27 (Data: 64.09) 	Err: 7.137931	
[Iter: 10.3k / lr: 1.00e-4] 	Time: 72.05 (Data: 63.93) 	Err: 7.256474	
[Iter: 10.4k / lr: 1.00e-4] 	Time: 74.87 (Data: 66.77) 	Err: 6.899007	
[Iter: 10.5k / lr: 1.00e-4] 	Time: 73.01 (Data: 64.82) 	Err: 6.969793	
[Iter: 10.6k / lr: 1.00e-4] 	Time: 75.41 (Data: 67.26) 	Err: 6.985296	
[Iter: 10.7k / lr: 1.00e-4] 	Time: 99.23 (Data: 91.19) 	Err: 7.097432	
[Iter: 10.8k / lr: 1.00e-4] 	Time: 121.62 (Data: 113.34) 	Err: 6.992342	
[Iter: 10.9k / lr: 1.00e-4] 	Time: 142.16 (Data: 133.66) 	Err: 7.034576	
[Iter: 11.0k / lr: 1.00e-4] 	Time: 169.72 (Data: 161.17) 	Err: 7.181707	
[Epoch 11 (iter/epoch: 1000)] Test time: 24.06	
(scale 4) Average PSNR: 28.6195 (Highest ever: 28.6195 at epoch = 11)	
	
[Iter: 11.1k / lr: 1.00e-4] 	Time: 219.76 (Data: 211.10) 	Err: 7.121071	
[Iter: 11.2k / lr: 1.00e-4] 	Time: 231.93 (Data: 223.22) 	Err: 6.955821	
[Iter: 11.3k / lr: 1.00e-4] 	Time: 233.64 (Data: 224.90) 	Err: 6.994724	
Killed
@limbee
Copy link
Owner

limbee commented Aug 31, 2017

I'm sorry I couldn't find the cause of this error. It has never seen before in our numerous experiments.
If you find the solution, please let us know. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants