train_loss variable is None and val_loss variable in log not showing up #5

rraju1 · 2022-02-11T01:42:19Z

Hi,

Thank you for the awesome project and the training script. I was able to replicate the result for resnet18 for 16 epochs (as per the resnet18 dataset settings), which came out to roughly the same accuracy. My question is related to the train_loss variable coming out as None and the validation loss isn't being recorded in the log. Based on the top1/5 accuracies, it looks like it is working but it would still be nice to have both losses logged though. Can you confirm if this is the expected behavior?

=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.06347999721765518, 'top_5': 0.18308000266551971, 'val_time': 9.070343971252441, 'train_loss': None, 'epoch': 0}

Thanks!

The text was updated successfully, but these errors were encountered:

lengstrom · 2022-02-13T22:54:01Z

What log_level are you running the script with? Can you please post the config box that gets printed out when you run the script?

rraju1 · 2022-02-14T16:08:26Z

I ran the script with log_level = 1. The exact command I used to run is the following: python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \ --data.train_dataset=/staging/groups/lipasti_group/train_400_0.10_90.ffcv \ --data.val_dataset=/staging/groups/lipasti_group/val_400_0.10_90.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder=".". And I don't know if it's relevant but I used ./write_imagenet.sh 400 0.10 90 to create the ffcv files.

 Arguments defined────────┬──────────────────────────────────────────────────────┐
│ Parameter                │ Value                                                │
├──────────────────────────┼──────────────────────────────────────────────────────┤
│ model.arch               │ resnet18                                             │
│ model.pretrained         │ 0                                                    │
│ resolution.min_res       │ 160                                                  │
│ resolution.max_res       │ 192                                                  │
│ resolution.end_ramp      │ 13                                                   │
│ resolution.start_ramp    │ 11                                                   │
│ data.train_dataset       │ /staging/groups/lipasti_group/train_400_0.10_90.ffcv │
│ data.val_dataset         │ /staging/groups/lipasti_group/val_400_0.10_90.ffcv   │
│ data.num_workers         │ 12                                                   │
│ data.in_memory           │ 1                                                    │
│ lr.step_ratio            │ 0.1                                                  │
│ lr.step_length           │ 30                                                   │
│ lr.lr_schedule_type      │ cyclic                                               │
│ lr.lr                    │ 0.5                                                  │
│ lr.lr_peak_epoch         │ 2                                                    │
│ logging.folder           │ .                                                    │
│ logging.log_level        │ 1                                                    │
│ validation.batch_size    │ 512                                                  │
│ validation.resolution    │ 256                                                  │
│ validation.lr_tta        │ 1                                                    │
│ training.eval_only       │ 0                                                    │
│ training.batch_size      │ 1024                                                 │
│ training.optimizer       │ sgd                                                  │
│ training.momentum        │ 0.9                                                  │
│ training.weight_decay    │ 5e-05                                                │
│ training.epochs          │ 16                                                   │
│ training.label_smoothing │ 0.1                                                  │
│ training.distributed     │ 0                                                    │
│ training.use_blurpool    │ 1                                                    │
│ dist.world_size          │ 1                                                    │
│ dist.address             │ localhost                                            │
│ dist.port                │ 12355                                                │
└──────────────────────────┴──────────────────────────────────────────────────────┘

lengstrom · 2022-02-16T00:56:31Z

What output do you get in ./log?

rraju1 · 2022-02-16T02:12:53Z

Below is the output

Running job
=> Logging in /var/lib/condor/execute/slot1/dir_27650/2dbc9849-6bef-4d25-8dff-d3f250cc9d78
=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.09058000147342682, 'top_5': 0.2343199998140335, 'val_time': 9.118002891540527, 'train_loss': None, 'epoch': 0}
=> Log: {'current_lr': 0.49980017985611513, 'top_1': 0.1768600046634674, 'top_5': 0.38374000787734985, 'val_time': 7.200252532958984, 'train_loss': None, 'epoch': 1}
=> Log: {'current_lr': 0.464314262875414, 'top_1': 0.25446000695228577, 'top_5': 0.5013999938964844, 'val_time': 6.987864017486572, 'train_loss': None, 'epoch': 2}
=> Log: {'current_lr': 0.4285999771611283, 'top_1': 0.2940399944782257, 'top_5': 0.5546000003814697, 'val_time': 6.982339382171631, 'train_loss': None, 'epoch': 3}
=> Log: {'current_lr': 0.3928856914468425, 'top_1': 0.3174000084400177, 'top_5': 0.5855200290679932, 'val_time': 7.011723279953003, 'train_loss': None, 'epoch': 4}
=> Log: {'current_lr': 0.3571714057325568, 'top_1': 0.355459988117218, 'top_5': 0.6302800178527832, 'val_time': 7.0547261238098145, 'train_loss': None, 'epoch': 5}
=> Log: {'current_lr': 0.3214571200182711, 'top_1': 0.4002799987792969, 'top_5': 0.6700000166893005, 'val_time': 7.045661926269531, 'train_loss': None, 'epoch': 6}
=> Log: {'current_lr': 0.2857428343039854, 'top_1': 0.4112600088119507, 'top_5': 0.6878399848937988, 'val_time': 7.064192056655884, 'train_loss': None, 'epoch': 7}
=> Log: {'current_lr': 0.2500285485896997, 'top_1': 0.4200800061225891, 'top_5': 0.6910600066184998, 'val_time': 6.991353750228882, 'train_loss': None, 'epoch': 8}
=> Log: {'current_lr': 0.21431426287541397, 'top_1': 0.45730000734329224, 'top_5': 0.7239199876785278, 'val_time': 7.05776834487915, 'train_loss': None, 'epoch': 9}
=> Log: {'current_lr': 0.17859997716112827, 'top_1': 0.48787999153137207, 'top_5': 0.7565799951553345, 'val_time': 6.991325855255127, 'train_loss': None, 'epoch': 10}
=> Log: {'current_lr': 0.14288569144684257, 'top_1': 0.5102400183677673, 'top_5': 0.7673799991607666, 'val_time': 6.966805934906006, 'train_loss': None, 'epoch': 11}
=> Log: {'current_lr': 0.10717140573255682, 'top_1': 0.5567200183868408, 'top_5': 0.8069599866867065, 'val_time': 7.033263683319092, 'train_loss': None, 'epoch': 12}
=> Log: {'current_lr': 0.07145712001827112, 'top_1': 0.5875599980354309, 'top_5': 0.8281400203704834, 'val_time': 7.092468738555908, 'train_loss': None, 'epoch': 13}
=> Log: {'current_lr': 0.03574283430398542, 'top_1': 0.6301400065422058, 'top_5': 0.8547599911689758, 'val_time': 7.117802619934082, 'train_loss': None, 'epoch': 14}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.025460720062256, 'train_loss': None, 'epoch': 15}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.002474308013916, 'epoch': 15, 'total time': 2081.8419053554535}

GeekAlexis · 2022-10-20T23:18:40Z

I have the same issue. It looks like train_loop doesn't return the loss at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

train_loss variable is None and val_loss variable in log not showing up #5

train_loss variable is None and val_loss variable in log not showing up #5

rraju1 commented Feb 11, 2022 •

edited

Loading

lengstrom commented Feb 13, 2022 •

edited

Loading

rraju1 commented Feb 14, 2022 •

edited

Loading

lengstrom commented Feb 16, 2022

rraju1 commented Feb 16, 2022 •

edited

Loading

GeekAlexis commented Oct 20, 2022

train_loss variable is None and val_loss variable in log not showing up #5

train_loss variable is None and val_loss variable in log not showing up #5

Comments

rraju1 commented Feb 11, 2022 • edited Loading

lengstrom commented Feb 13, 2022 • edited Loading

rraju1 commented Feb 14, 2022 • edited Loading

lengstrom commented Feb 16, 2022

rraju1 commented Feb 16, 2022 • edited Loading

GeekAlexis commented Oct 20, 2022

rraju1 commented Feb 11, 2022 •

edited

Loading

lengstrom commented Feb 13, 2022 •

edited

Loading

rraju1 commented Feb 14, 2022 •

edited

Loading

rraju1 commented Feb 16, 2022 •

edited

Loading