MASS-summarization how to generate unmasked output? #129

SONG1178 · 2020-03-15T18:55:33Z

I am using Wikihow Dataset to test performance of MASS. The output I get from fairseq-generate is :

S-5945 [UNK] now ready for [UNK] and placing on [UNK] T-5945 <[UNK]> H-5945 -1.701836347579956 remove the [UNK] from the oven and place it in the [UNK] [UNK] and allow it to cool for a few minutes before it is ready to dry for [UNK] to cool and then [UNK] it on the oven for the [UNK] to cook for about 15 minutes before [UNK] P-5945 -2.3607 -1.3653 -0.8389 -1.6368 -0.2720 -1.8948 -0.3451 -2.1066 -0.8099 -1.2254 -0.6529 -0.5549 -2.3445 -4.6659 -3.0849 -0.8033 -0.0442 -0.7458 -0.7507 -1.7043 -0.4074 -1.3986 -0.4831 -3.4457 -1.3255 -2.6874 -0.1853 -3.0822 -1.1561 -0.5247 -4.2249 -2.3640 -1.9811 -2.1335 -0.6717 -4.4360 -1.9174 -0.8858 -3.1737 -1.4359 -3.6464 -0.5433 -3.6553 -4.3460 -1.1267 -2.0195 -2.6686 -1.1864 -0.7702 -0.5627 -0.1414

Where I believe the UNK is the mask. The question is, how do I get an Unmasked output???

The text was updated successfully, but these errors were encountered:

StillKeepTry · 2020-03-17T02:37:15Z

Can you show me your input for generation?

SONG1178 · 2020-03-17T12:55:25Z

Here is what I wrote to run the generation:
!fairseq-generate processed --path checkpoints/checkpoint_best.pt\ --user-dir mass --task translation_mass \ --batch-size 64 --beam 5 --min-len 50 --no-repeat-ngram-size 3 \ --lenpen 1.0\

where processed contains the previously binarised dataset

StillKeepTry · 2020-03-17T14:46:50Z

The model is the pre-trained model or has been fine-tuned by your dataset? If it has been fine-tuned by your dataset, can you show me some part of your dataset or the log for binarizing dataset?

In my setting, [UNK] just represent UNK tokens.

SONG1178 · 2020-03-17T23:19:48Z

I'm sorry for the misunderstanding.
The dataset look like:
@summary keep track of your record. @article Have a piece of paper or an emblem that has your Win/Loss Ratio; it's no fun to play competitively if you don't have a way of knowing how good you are doing.
and some become UNK in the training log.
The command I run for preprocessing is as follow:
!fairseq-preprocess \ --user-dir mass --task masked_s2s \ --source-lang src --target-lang tgt \ --trainpref data/train --validpref data/val --testpref data/test \ --destdir processed --srcdict dict.txt --tgtdict dict.txt \ --workers 20
and the log is:
Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='processed', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='src', srcdict='dict.txt', target_lang='tgt', task='masked_s2s', tbmf_wrapper=False, tensorboard_logdir='', testpref='data/test', tgtdict='dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='data/train', user_dir='mass', validpref='data/val', workers=20) | [src] Dictionary: 30521 types | [src] data/train.src: 168028 sents, 11526616 tokens, 21.7% replaced by [UNK] | [src] Dictionary: 30521 types | [src] data/val.src: 5997 sents, 411015 tokens, 21.9% replaced by [UNK] | [src] Dictionary: 30521 types | [src] data/test.src: 5996 sents, 408924 tokens, 22.0% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/train.tgt: 168028 sents, 1265227 tokens, 19.6% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/val.tgt: 5997 sents, 45240 tokens, 19.7% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/test.tgt: 5996 sents, 44796 tokens, 19.6% replaced by [UNK] | Wrote preprocessed data to processed

StillKeepTry · 2020-03-18T14:01:03Z

According to the log of your preprocessing stage, we can find that nearly 20% of tokens have been replaced by [UNK]. I am sure your dataset is not tokenized to sub-word. You first need to tokenize your dataset as a word-piece level. We provide a script to tokenize dataset into word-pieces. A demo is like:

mkdir -p mono
for SPLIT in train valid test; do 
    python encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs mono/${SPLIT}.txt \
        --workers 60; \
done

SONG1178 · 2020-03-19T20:55:51Z

Thank you so much for your advice. I have tokenised the dataset using the code you have suggested. However, it appear to me that the model does not really 'summarise' the text. Output generated by the fairseq-generate is as shown below:

S-3006 your vet will decide the cause of your dog ’ s ear infection , such as bacteria , yeast , or something else . this will determine how your dog is treated . your vet will clean your dog ’ s ear . most of the time , the vet will start with a topical ear drops in the office , which contain medications against yeast , bacteria , or ear mit ##es . then , your vet will give you an anti ##biotic , anti ##fu ##nga ##l , or other medicine to administer at home . if the dog has regular ear infections , or the infection is slow to respond , the vet may perform further tests . these include examining discharge sm ##ears under a microscope or sending a sw ##ab away for culture . T-3006 get the proper treatment for your dog . H-3006 -1.0089088678359985 have your vet examine your dog ’ s ear infections at the vet ##erina ##rian ’ s office . if your dog has regular ear infections , the vet will give you an anti ##fu ##nga ##l , antibiotics , and antibiotics to administer at home for a period of time . P-3006 -1.7238 -0.4101 -0.5660 -1.6493 -0.3975 -0.0812 -0.1971 -0.1003 -0.2629 -2.8643 -3.5335 -1.6000 -0.6871 -1.6712 -0.0722 -2.4702 -0.1171 -0.1275 -0.1614 -7.4418 -0.6900 -0.1267 -0.6967 -1.4147 -0.1394 -0.1095 -0.3544 -1.3551 -0.2830 -0.6664 -2.3485 -0.6951 -1.5013 -0.4509 -0.1020 -0.0235 -0.0174 -2.0451 -0.5637 -0.3491 -1.2786 -0.7637 -1.3381 -0.7185 -0.6126 -0.7385 -3.3992 -1.6051 -2.5224 -0.0767 -0.0889 -0.1518 -0.1112

Do you have any suggestion about the cause?

StillKeepTry · 2020-06-04T03:06:15Z

Have you tried to fine-tune model with your data or you just test your results by the pre-trained model?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MASS-summarization how to generate unmasked output? #129

MASS-summarization how to generate unmasked output? #129

SONG1178 commented Mar 15, 2020

StillKeepTry commented Mar 17, 2020

SONG1178 commented Mar 17, 2020

StillKeepTry commented Mar 17, 2020

SONG1178 commented Mar 17, 2020

StillKeepTry commented Mar 18, 2020 •

edited

Loading

SONG1178 commented Mar 19, 2020

StillKeepTry commented Jun 4, 2020

MASS-summarization how to generate unmasked output? #129

MASS-summarization how to generate unmasked output? #129

Comments

SONG1178 commented Mar 15, 2020

StillKeepTry commented Mar 17, 2020

SONG1178 commented Mar 17, 2020

StillKeepTry commented Mar 17, 2020

SONG1178 commented Mar 17, 2020

StillKeepTry commented Mar 18, 2020 • edited Loading

SONG1178 commented Mar 19, 2020

StillKeepTry commented Jun 4, 2020

StillKeepTry commented Mar 18, 2020 •

edited

Loading