Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MASS-summarization how to generate unmasked output? #129

Open
SONG1178 opened this issue Mar 15, 2020 · 7 comments
Open

MASS-summarization how to generate unmasked output? #129

SONG1178 opened this issue Mar 15, 2020 · 7 comments

Comments

@SONG1178
Copy link

I am using Wikihow Dataset to test performance of MASS. The output I get from fairseq-generate is :

S-5945 [UNK] now ready for [UNK] and placing on [UNK] T-5945 <[UNK]> H-5945 -1.701836347579956 remove the [UNK] from the oven and place it in the [UNK] [UNK] and allow it to cool for a few minutes before it is ready to dry for [UNK] to cool and then [UNK] it on the oven for the [UNK] to cook for about 15 minutes before [UNK] P-5945 -2.3607 -1.3653 -0.8389 -1.6368 -0.2720 -1.8948 -0.3451 -2.1066 -0.8099 -1.2254 -0.6529 -0.5549 -2.3445 -4.6659 -3.0849 -0.8033 -0.0442 -0.7458 -0.7507 -1.7043 -0.4074 -1.3986 -0.4831 -3.4457 -1.3255 -2.6874 -0.1853 -3.0822 -1.1561 -0.5247 -4.2249 -2.3640 -1.9811 -2.1335 -0.6717 -4.4360 -1.9174 -0.8858 -3.1737 -1.4359 -3.6464 -0.5433 -3.6553 -4.3460 -1.1267 -2.0195 -2.6686 -1.1864 -0.7702 -0.5627 -0.1414

Where I believe the UNK is the mask. The question is, how do I get an Unmasked output???

@StillKeepTry
Copy link
Contributor

Can you show me your input for generation?

@SONG1178
Copy link
Author

Here is what I wrote to run the generation:
!fairseq-generate processed --path checkpoints/checkpoint_best.pt\ --user-dir mass --task translation_mass \ --batch-size 64 --beam 5 --min-len 50 --no-repeat-ngram-size 3 \ --lenpen 1.0\

where processed contains the previously binarised dataset

@StillKeepTry
Copy link
Contributor

The model is the pre-trained model or has been fine-tuned by your dataset? If it has been fine-tuned by your dataset, can you show me some part of your dataset or the log for binarizing dataset?

In my setting, [UNK] just represent UNK tokens.

@SONG1178
Copy link
Author

I'm sorry for the misunderstanding.
The dataset look like:
@summary keep track of your record. @article Have a piece of paper or an emblem that has your Win/Loss Ratio; it's no fun to play competitively if you don't have a way of knowing how good you are doing.
and some become UNK in the training log.
The command I run for preprocessing is as follow:
!fairseq-preprocess \ --user-dir mass --task masked_s2s \ --source-lang src --target-lang tgt \ --trainpref data/train --validpref data/val --testpref data/test \ --destdir processed --srcdict dict.txt --tgtdict dict.txt \ --workers 20
and the log is:
Namespace(alignfile=None, bpe=None, cpu=False, criterion='cross_entropy', dataset_impl='mmap', destdir='processed', fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, joined_dictionary=False, log_format=None, log_interval=1000, lr_scheduler='fixed', memory_efficient_fp16=False, min_loss_scale=0.0001, no_progress_bar=False, nwordssrc=-1, nwordstgt=-1, only_source=False, optimizer='nag', padding_factor=8, seed=1, source_lang='src', srcdict='dict.txt', target_lang='tgt', task='masked_s2s', tbmf_wrapper=False, tensorboard_logdir='', testpref='data/test', tgtdict='dict.txt', threshold_loss_scale=None, thresholdsrc=0, thresholdtgt=0, tokenizer=None, trainpref='data/train', user_dir='mass', validpref='data/val', workers=20) | [src] Dictionary: 30521 types | [src] data/train.src: 168028 sents, 11526616 tokens, 21.7% replaced by [UNK] | [src] Dictionary: 30521 types | [src] data/val.src: 5997 sents, 411015 tokens, 21.9% replaced by [UNK] | [src] Dictionary: 30521 types | [src] data/test.src: 5996 sents, 408924 tokens, 22.0% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/train.tgt: 168028 sents, 1265227 tokens, 19.6% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/val.tgt: 5997 sents, 45240 tokens, 19.7% replaced by [UNK] | [tgt] Dictionary: 30521 types | [tgt] data/test.tgt: 5996 sents, 44796 tokens, 19.6% replaced by [UNK] | Wrote preprocessed data to processed

@StillKeepTry
Copy link
Contributor

StillKeepTry commented Mar 18, 2020

According to the log of your preprocessing stage, we can find that nearly 20% of tokens have been replaced by [UNK]. I am sure your dataset is not tokenized to sub-word. You first need to tokenize your dataset as a word-piece level. We provide a script to tokenize dataset into word-pieces. A demo is like:

mkdir -p mono
for SPLIT in train valid test; do 
    python encode.py \
        --inputs wikitext-103-raw/wiki.${SPLIT}.raw \
        --outputs mono/${SPLIT}.txt \
        --workers 60; \
done 

@SONG1178
Copy link
Author

Thank you so much for your advice. I have tokenised the dataset using the code you have suggested. However, it appear to me that the model does not really 'summarise' the text. Output generated by the fairseq-generate is as shown below:

S-3006 your vet will decide the cause of your dog ’ s ear infection , such as bacteria , yeast , or something else . this will determine how your dog is treated . your vet will clean your dog ’ s ear . most of the time , the vet will start with a topical ear drops in the office , which contain medications against yeast , bacteria , or ear mit ##es . then , your vet will give you an anti ##biotic , anti ##fu ##nga ##l , or other medicine to administer at home . if the dog has regular ear infections , or the infection is slow to respond , the vet may perform further tests . these include examining discharge sm ##ears under a microscope or sending a sw ##ab away for culture . T-3006 get the proper treatment for your dog . H-3006 -1.0089088678359985 have your vet examine your dog ’ s ear infections at the vet ##erina ##rian ’ s office . if your dog has regular ear infections , the vet will give you an anti ##fu ##nga ##l , antibiotics , and antibiotics to administer at home for a period of time . P-3006 -1.7238 -0.4101 -0.5660 -1.6493 -0.3975 -0.0812 -0.1971 -0.1003 -0.2629 -2.8643 -3.5335 -1.6000 -0.6871 -1.6712 -0.0722 -2.4702 -0.1171 -0.1275 -0.1614 -7.4418 -0.6900 -0.1267 -0.6967 -1.4147 -0.1394 -0.1095 -0.3544 -1.3551 -0.2830 -0.6664 -2.3485 -0.6951 -1.5013 -0.4509 -0.1020 -0.0235 -0.0174 -2.0451 -0.5637 -0.3491 -1.2786 -0.7637 -1.3381 -0.7185 -0.6126 -0.7385 -3.3992 -1.6051 -2.5224 -0.0767 -0.0889 -0.1518 -0.1112

Do you have any suggestion about the cause?

@StillKeepTry
Copy link
Contributor

Have you tried to fine-tune model with your data or you just test your results by the pre-trained model?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants