Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing batch statistics #946

Open
alvations opened this issue Jul 4, 2022 · 2 comments
Open

Missing batch statistics #946

alvations opened this issue Jul 4, 2022 · 2 comments
Labels

Comments

@alvations
Copy link
Collaborator

alvations commented Jul 4, 2022

Bug description

At a certain point of the data, the batch statistics somehow "disappears" or went missing. I recall there was an issue that outputs the batch to a temp file, but I've been using --shuffle-in-ram so not sure where to find the bad batch. Maybe it's #480 ?

Will try to re-run without --shuffle-in-ram and see if goes pass that same batch. Meanwhile, is it possible to skip batch and move to the next one if one batch's statistics is missing?

[2022-07-03 20:27:36] Saving Adam parameters
[2022-07-03 20:27:37] [training] Saving training checkpoint to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz and /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.optimizer.npz
[2022-07-03 20:47:31] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-ce-mean-words.npz
[2022-07-03 20:47:38] [valid] Ep. 1 : Up. 13000 : ce-mean-words : 0.225207 : new best
[2022-07-03 21:07:02] Saving model weights and runtime parameters to /home/ubuntu/stash/transliterate/cjk-transliterate-6+6-8-1024-4096-12000-0.0001-0.1/cjk-zz-r42-controllable/model.npz.best-perplexity.npz
[2022-07-03 21:07:08] [valid] Ep. 1 : Up. 13000 : perplexity : 1.25258 : new best
[2022-07-03 21:07:17] Error: Missing batch statistics
[2022-07-03 21:07:17] Error: Aborted from size_t marian::data::BatchStats::findBatchSize(const std::vector<long unsigned int>&, marian::data::BatchStats::const_iterator&) const in /home/ubuntu/marian/src/data/batch_stats.h:38

[CALL STACK]
[0x5654ccc315a7]    marian::data::BatchStats::  findBatchSize  (std::vector<unsigned long,std::allocator<unsigned long>> const&,  std::_Rb_tree_const_iterator<std::pair<std::vector<unsigned long,std::allocator<unsigned long>> const,unsigned long>>&) const + 0x277
[0x5654ccc8800c]    marian::data::BatchGenerator<marian::data::CorpusBase>::  fetchBatches  () + 0x181c
[0x5654ccc889b3]    marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1}::  operator()  () const + 0x33
[0x5654ccc89573]    std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> (),std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>,std::__future_base::_Result_base::_Deleter>,std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>> ()>::_M_run()::{lambda()#1},std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr<marian::data::CorpusBatch>>>>>::  _M_invoke  (std::_Any_data const&) + 0x53
[0x5654ccbb63ad]    std::__future_base::_State_baseV2::  _M_do_set  (std::function<std::unique_ptr<std::__future_base::_Result_base,std::__future_base::_Result_base::_Deleter> ()>*,  bool*) + 0x2d
[0x7fa69415747f]                                                       + 0x1147f
[0x5654ccbc1710]    std::__future_base::_Task_state<marian::ThreadPool::enqueue<marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}>(std::result_of&&,(marian::data::BatchGenerator<marian::data::CorpusBase>::fetchBatchesAsync()::{lambda()#1}&&)...)::{lambda()#1},std::allocator<int>,std::deque<std::shared_ptr<marian::data::CorpusBatch>,std::allocator<std::shared_ptr>> ()>::  _M_run  () + 0x120
[0x5654ccbb7a30]    std::thread::_State_impl<std::thread::_Invoker<std::tuple<marian::ThreadPool::reserve(unsigned long)::{lambda()#1}>>>::  _M_run  () + 0x180
[0x5654ceebd5b4]                                                       + 0x34745b4
[0x7fa69414e609]                                                       + 0x8609
[0x7fa693f24163]    clone                                              + 0x43

How to reproduce

See next comment.

Context

  • Marian version: v1.11.0 f00d0621 2022-02-08 08:39:24 -0800
@alvations alvations added the bug label Jul 4, 2022
@alvations
Copy link
Collaborator Author

alvations commented Jul 4, 2022

Data

https://drive.google.com/file/d/1hr-RzBz-5zCMhwbi4ogzPulVB1ZLCDUI/view?usp=sharing

Marian version

v1.11.0 f00d0621 2022-02-08 08:39:24 -0800

Command

#!/bin/bash

SRC=fr           # en
TRG=xx           # ja
RANDSEED=42      # 42
ELAYERS=6       # 6
DLAYERS=6
HEADS=8         # 8
DIMEMB=1024        # 1024
DIMTRA=4096      # 4096
VOCABSIZE=8000     # 32000
LR=0.0001         # 0.0001
DROPOUT=0.1    # 0.1

MODELDIR=$HOME/stash/fdi-$ELAYERS+$DLAYERS-$HEADS-$DIMEMB-$DIMTRA-$VOCABSIZE-$LR-$DROPOUT/$SRC-$TRG-r$RANDSEED/

mkdir -p $MODELDIR

DATADIR=$HOME/stash/fdi-data

TRAIN_SRC=$DATADIR/train.$SRC-$TRG.$SRC
TRAIN_TRG=$DATADIR/train.$SRC-$TRG.$TRG
VALID_SRC=$DATADIR/valid.$SRC-$TRG.$SRC
VALID_TRG=$DATADIR/valid.$SRC-$TRG.$TRG
TRAINLOG=$MODELDIR/train.log
VALIDLOG=$MODELDIR/valid.log

GPUS="0"
WORKSPACE=10185  # Assumes 11GB RAM on GPU

MARIAN=$HOME/marian/build/marian

$MARIAN --model $MODELDIR/model.npz --type transformer \
--train-sets $TRAIN_SRC $TRAIN_TRG \
--vocabs $MODELDIR/vocab.src.spm $MODELDIR/vocab.src.spm \
--dim-vocabs $VOCABSIZE $VOCABSIZE \
--valid-freq 500 --save-freq 500 --disp-freq 00 \
--valid-metrics ce-mean-words perplexity  \
--valid-sets $VALID_SRC $VALID_TRG \
--quiet-translation \
--beam-size 12 --normalize=0.6 \
--valid-mini-batch 16 \
--early-stopping 5 --cost-type=ce-mean-words \
--log $TRAINLOG --valid-log $VALIDLOG \
--enc-depth $ELAYERS --dec-depth $DLAYERS \
--transformer-preprocess n --transformer-postprocess da \
--tied-embeddings-all --dim-emb $DIMEMB --transformer-dim-ffn $DIMTRA \
--transformer-dropout $DROPOUT --transformer-dropout-attention $DROPOUT \
--transformer-dropout-ffn $DROPOUT --label-smoothing $DROPOUT \
--learn-rate $LR \
--lr-warmup 8000 --lr-decay-inv-sqrt 8000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--devices $GPUS --workspace $WORKSPACE  --optimizer-delay 1 --sync-sgd --seed $RANDSEED \
--exponential-smoothing \
--keep-best \
--max-length 5000 --valid-max-length 5000 --max-length-crop \
--shuffle-in-ram --mini-batch-fit \
--sentencepiece-options "--character_coverage=1.0 --user_defined_symbols=BE,CA,CH,FR"

Logfile

https://gist.github.com/alvations/9da72d5458c409e8971ee3c65d550a85

Hardware

$ nvidia-smi 
Mon Jul  4 22:06:50 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    On   | 00000000:06:00.0 Off |                  Off |
| 30%   36C    P8    18W / 300W |      1MiB / 48685MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

@alvations
Copy link
Collaborator Author

alvations commented Jul 4, 2022

This is interesting,

  • Broke batching: --max-length 5000 --valid-max-length 5000

  • Broke batching: --max-length 2000 --valid-max-length 2000

  • Seems to work: --max-length 1000 --valid-max-length 1000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant