Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_loss variable is None and val_loss variable in log not showing up #5

Open
rraju1 opened this issue Feb 11, 2022 · 5 comments
Open

Comments

@rraju1
Copy link

rraju1 commented Feb 11, 2022

Hi,

Thank you for the awesome project and the training script. I was able to replicate the result for resnet18 for 16 epochs (as per the resnet18 dataset settings), which came out to roughly the same accuracy. My question is related to the train_loss variable coming out as None and the validation loss isn't being recorded in the log. Based on the top1/5 accuracies, it looks like it is working but it would still be nice to have both losses logged though. Can you confirm if this is the expected behavior?

=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.06347999721765518, 'top_5': 0.18308000266551971, 'val_time': 9.070343971252441, 'train_loss': None, 'epoch': 0}

Thanks!

@lengstrom
Copy link
Contributor

lengstrom commented Feb 13, 2022

What log_level are you running the script with? Can you please post the config box that gets printed out when you run the script?

@rraju1
Copy link
Author

rraju1 commented Feb 14, 2022

I ran the script with log_level = 1. The exact command I used to run is the following: python train_imagenet.py --config-file rn18_configs/rn18_16_epochs.yaml \ --data.train_dataset=/staging/groups/lipasti_group/train_400_0.10_90.ffcv \ --data.val_dataset=/staging/groups/lipasti_group/val_400_0.10_90.ffcv \ --data.num_workers=12 --data.in_memory=1 \ --logging.folder=".". And I don't know if it's relevant but I used ./write_imagenet.sh 400 0.10 90 to create the ffcv files.

 Arguments defined────────┬──────────────────────────────────────────────────────┐
│ Parameter                │ Value                                                │
├──────────────────────────┼──────────────────────────────────────────────────────┤
│ model.arch               │ resnet18                                             │
│ model.pretrained         │ 0                                                    │
│ resolution.min_res       │ 160                                                  │
│ resolution.max_res       │ 192                                                  │
│ resolution.end_ramp      │ 13                                                   │
│ resolution.start_ramp    │ 11                                                   │
│ data.train_dataset       │ /staging/groups/lipasti_group/train_400_0.10_90.ffcv │
│ data.val_dataset         │ /staging/groups/lipasti_group/val_400_0.10_90.ffcv   │
│ data.num_workers         │ 12                                                   │
│ data.in_memory           │ 1                                                    │
│ lr.step_ratio            │ 0.1                                                  │
│ lr.step_length           │ 30                                                   │
│ lr.lr_schedule_type      │ cyclic                                               │
│ lr.lr                    │ 0.5                                                  │
│ lr.lr_peak_epoch         │ 2                                                    │
│ logging.folder           │ .                                                    │
│ logging.log_level        │ 1                                                    │
│ validation.batch_size    │ 512                                                  │
│ validation.resolution    │ 256                                                  │
│ validation.lr_tta        │ 1                                                    │
│ training.eval_only       │ 0                                                    │
│ training.batch_size      │ 1024                                                 │
│ training.optimizer       │ sgd                                                  │
│ training.momentum        │ 0.9                                                  │
│ training.weight_decay    │ 5e-05                                                │
│ training.epochs          │ 16                                                   │
│ training.label_smoothing │ 0.1                                                  │
│ training.distributed     │ 0                                                    │
│ training.use_blurpool    │ 1                                                    │
│ dist.world_size          │ 1                                                    │
│ dist.address             │ localhost                                            │
│ dist.port                │ 12355                                                │
└──────────────────────────┴──────────────────────────────────────────────────────┘

@lengstrom
Copy link
Contributor

What output do you get in ./log?

@rraju1
Copy link
Author

rraju1 commented Feb 16, 2022

Below is the output

Running job
=> Logging in /var/lib/condor/execute/slot1/dir_27650/2dbc9849-6bef-4d25-8dff-d3f250cc9d78
=> Log: {'current_lr': 0.2498251798561151, 'top_1': 0.09058000147342682, 'top_5': 0.2343199998140335, 'val_time': 9.118002891540527, 'train_loss': None, 'epoch': 0}
=> Log: {'current_lr': 0.49980017985611513, 'top_1': 0.1768600046634674, 'top_5': 0.38374000787734985, 'val_time': 7.200252532958984, 'train_loss': None, 'epoch': 1}
=> Log: {'current_lr': 0.464314262875414, 'top_1': 0.25446000695228577, 'top_5': 0.5013999938964844, 'val_time': 6.987864017486572, 'train_loss': None, 'epoch': 2}
=> Log: {'current_lr': 0.4285999771611283, 'top_1': 0.2940399944782257, 'top_5': 0.5546000003814697, 'val_time': 6.982339382171631, 'train_loss': None, 'epoch': 3}
=> Log: {'current_lr': 0.3928856914468425, 'top_1': 0.3174000084400177, 'top_5': 0.5855200290679932, 'val_time': 7.011723279953003, 'train_loss': None, 'epoch': 4}
=> Log: {'current_lr': 0.3571714057325568, 'top_1': 0.355459988117218, 'top_5': 0.6302800178527832, 'val_time': 7.0547261238098145, 'train_loss': None, 'epoch': 5}
=> Log: {'current_lr': 0.3214571200182711, 'top_1': 0.4002799987792969, 'top_5': 0.6700000166893005, 'val_time': 7.045661926269531, 'train_loss': None, 'epoch': 6}
=> Log: {'current_lr': 0.2857428343039854, 'top_1': 0.4112600088119507, 'top_5': 0.6878399848937988, 'val_time': 7.064192056655884, 'train_loss': None, 'epoch': 7}
=> Log: {'current_lr': 0.2500285485896997, 'top_1': 0.4200800061225891, 'top_5': 0.6910600066184998, 'val_time': 6.991353750228882, 'train_loss': None, 'epoch': 8}
=> Log: {'current_lr': 0.21431426287541397, 'top_1': 0.45730000734329224, 'top_5': 0.7239199876785278, 'val_time': 7.05776834487915, 'train_loss': None, 'epoch': 9}
=> Log: {'current_lr': 0.17859997716112827, 'top_1': 0.48787999153137207, 'top_5': 0.7565799951553345, 'val_time': 6.991325855255127, 'train_loss': None, 'epoch': 10}
=> Log: {'current_lr': 0.14288569144684257, 'top_1': 0.5102400183677673, 'top_5': 0.7673799991607666, 'val_time': 6.966805934906006, 'train_loss': None, 'epoch': 11}
=> Log: {'current_lr': 0.10717140573255682, 'top_1': 0.5567200183868408, 'top_5': 0.8069599866867065, 'val_time': 7.033263683319092, 'train_loss': None, 'epoch': 12}
=> Log: {'current_lr': 0.07145712001827112, 'top_1': 0.5875599980354309, 'top_5': 0.8281400203704834, 'val_time': 7.092468738555908, 'train_loss': None, 'epoch': 13}
=> Log: {'current_lr': 0.03574283430398542, 'top_1': 0.6301400065422058, 'top_5': 0.8547599911689758, 'val_time': 7.117802619934082, 'train_loss': None, 'epoch': 14}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.025460720062256, 'train_loss': None, 'epoch': 15}
=> Log: {'current_lr': 2.8548589699667337e-05, 'top_1': 0.6671800017356873, 'top_5': 0.8743799924850464, 'val_time': 7.002474308013916, 'epoch': 15, 'total time': 2081.8419053554535}

@GeekAlexis
Copy link

I have the same issue. It looks like train_loop doesn't return the loss at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants