Evaluating BART on CNN/DM : How to process dataset #1391

astariul · 2019-11-19T07:11:59Z

From the README of BART for reproducing CNN/DM results :

Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized sample.

After following instructions, I don't have files like test.source and test.target...

Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).

How can I process test.bin into test.source and test.target ?

@ngoyal2707 @yinhanliu

The text was updated successfully, but these errors were encountered:

yinhanliu · 2019-11-19T07:39:44Z

thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.

astariul · 2019-11-20T05:11:04Z

Note
I also had to modify this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L145

In order to remove <s> and </s> from the target file.

astariul · 2019-11-20T08:44:16Z

Note 2

To get better results, I also had to keep text cased. In order to do this, I removed this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L122

isabelcachola · 2019-12-05T01:35:37Z

I followed these instructions but I'm getting .bin files instead of .source and .target files. Am I missing something? I'm also trying to reproduce these results.

isabelcachola · 2019-12-06T18:59:03Z

I modified the write_to_bin function to the following. Is this the correct data format?

def write_to_bin(url_file, out_file, makevocab=False):
  """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
  print "Making bin file for URLs listed in %s..." % url_file
  url_list = read_text_file(url_file)
  url_hashes = get_url_hashes(url_list)
  story_fnames = [s+".story" for s in url_hashes]
  num_stories = len(story_fnames)

  if makevocab:
    vocab_counter = collections.Counter()

  with open('%s.target' %(out_file), 'wb') as target_file:
      with open('%s.source' %(out_file), 'wb') as source_file:
        for idx,s in enumerate(story_fnames):
            if idx % 1000 == 0:
                print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))

            # Look in the tokenized story dirs to find the .story file corresponding to this url
            if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
                story_file = os.path.join(cnn_tokenized_stories_dir, s)
            elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
                story_file = os.path.join(dm_tokenized_stories_dir, s)
            else:
                print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                # Check again if tokenized stories directories contain correct number of files
                print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
                check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
                raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))

            # Get the strings to write to .bin file
            article, abstract = get_art_abs(story_file)

            target_file.write(abstract + '\n')
            source_file.write(article + '\n')

zhaoguangxiang · 2019-12-06T19:11:44Z

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join

I summarize several notes here :

remove " " before "."
cased, remove the line of lower cased
"\r" in origin articles leads error in bpe preprocess
remove "(CNN)"
bpe encoding

code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e

isabelcachola · 2019-12-06T19:33:22Z

@zhaoguangxiang Thank you!

artmatsak · 2020-01-27T13:45:06Z

Here's a version for Python 3 if anyone is interested:

https://github.com/artmatsak/cnn-dailymail

Summary: The first step in the CNN/DM fine-tuning instructions for BART is misleading (see #1391). This PR fixes the README and adds links to #1391 as well as to a repository with CNN/DM processing code adjusted for BART. Pull Request resolved: #1650 Differential Revision: D19606689 fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9

Summary: The first step in the CNN/DM fine-tuning instructions for BART is misleading (see facebookresearch#1391). This PR fixes the README and adds links to facebookresearch#1391 as well as to a repository with CNN/DM processing code adjusted for BART. Pull Request resolved: facebookresearch#1650 Differential Revision: D19606689 fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9

Summary: The first step in the CNN/DM fine-tuning instructions for BART is misleading (see facebookresearch/fairseq#1391). This PR fixes the README and adds links to facebookresearch/fairseq#1391 as well as to a repository with CNN/DM processing code adjusted for BART. Pull Request resolved: facebookresearch/fairseq#1650 Differential Revision: D19606689 fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9

Ricardokevins · 2021-12-17T13:22:31Z

@zhaoguangxiang
Hi thank you for providing the code for preprocess.
I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper?
I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal
eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

zhaoguangxiang · 2021-12-17T15:49:57Z

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

I forgot my reproduction result. I will reply to you after trying again.

Ricardokevins · 2021-12-17T15:53:46Z

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

BaohaoLiao · 2023-03-10T19:15:13Z

If anyone still has problems about:

download and preprocess CNN/DM
evaluate fine-tuned BART on CNN/DM
You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

zhaoguangxiang · 2023-03-11T01:07:25Z

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

I forgot my reproduction experience.

astariul closed this as completed Nov 20, 2019

astariul mentioned this issue Nov 20, 2019

Difficulties to reproduce CNN/DM results with BART #1401

Closed

zhaoguangxiang mentioned this issue Dec 6, 2019

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

Closed

artmatsak mentioned this issue Jan 27, 2020

Fix BART CNN/DM fine-tuning instructions #1650

Closed

loganlebanoff mentioned this issue Feb 11, 2020

How to use Bert or Bart or Roberta or GPT for translation #1599

Closed

astariul mentioned this issue Oct 19, 2020

temp JJJJane/DFA-template#1

Open

xssstory mentioned this issue Jul 19, 2022

the inputs xssstory/SeqCo#2

Open

thaokimctu mentioned this issue Sep 26, 2022

Create raw_data structure in example folder from input yixinL7/BRIO#20

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluating BART on CNN/DM : How to process dataset #1391

Evaluating BART on CNN/DM : How to process dataset #1391

astariul commented Nov 19, 2019 •

edited

Loading

yinhanliu commented Nov 19, 2019

astariul commented Nov 20, 2019 •

edited

Loading

astariul commented Nov 20, 2019

isabelcachola commented Dec 5, 2019

isabelcachola commented Dec 6, 2019

zhaoguangxiang commented Dec 6, 2019 •

edited

Loading

isabelcachola commented Dec 6, 2019

artmatsak commented Jan 27, 2020

Ricardokevins commented Dec 17, 2021 •

edited

Loading

zhaoguangxiang commented Dec 17, 2021 •

edited

Loading

Ricardokevins commented Dec 17, 2021

BaohaoLiao commented Mar 10, 2023

zhaoguangxiang commented Mar 11, 2023

Evaluating BART on CNN/DM : How to process dataset #1391

Evaluating BART on CNN/DM : How to process dataset #1391

Comments

astariul commented Nov 19, 2019 • edited Loading

yinhanliu commented Nov 19, 2019

astariul commented Nov 20, 2019 • edited Loading

astariul commented Nov 20, 2019

isabelcachola commented Dec 5, 2019

isabelcachola commented Dec 6, 2019

zhaoguangxiang commented Dec 6, 2019 • edited Loading

isabelcachola commented Dec 6, 2019

artmatsak commented Jan 27, 2020

Ricardokevins commented Dec 17, 2021 • edited Loading

zhaoguangxiang commented Dec 17, 2021 • edited Loading

Ricardokevins commented Dec 17, 2021

BaohaoLiao commented Mar 10, 2023

zhaoguangxiang commented Mar 11, 2023

astariul commented Nov 19, 2019 •

edited

Loading

astariul commented Nov 20, 2019 •

edited

Loading

zhaoguangxiang commented Dec 6, 2019 •

edited

Loading

Ricardokevins commented Dec 17, 2021 •

edited

Loading

zhaoguangxiang commented Dec 17, 2021 •

edited

Loading