Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating BART on CNN/DM : How to process dataset #1391

Closed
astariul opened this issue Nov 19, 2019 · 13 comments
Closed

Evaluating BART on CNN/DM : How to process dataset #1391

astariul opened this issue Nov 19, 2019 · 13 comments

Comments

@astariul
Copy link
Contributor

astariul commented Nov 19, 2019

From the README of BART for reproducing CNN/DM results :

Follow instructions here to download and process into data-files such that test.source and test.target has one line for each non-tokenized sample.

After following instructions, I don't have files like test.source and test.target...

Instead, I have test.bin, and chunked version of this file
(chunked/test_000.bin ~ chunked/test_011.bin).


How can I process test.bin into test.source and test.target ?

@ngoyal2707 @yinhanliu

@yinhanliu
Copy link

thanks for the interest. you need to remove https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L235
and comment out all the tf code in function write_to_bin.
You need to keep the raw data (no tokenization) to gpt2 bpe.

@astariul
Copy link
Contributor Author

astariul commented Nov 20, 2019

Note
I also had to modify this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L145

In order to remove <s> and </s> from the target file.

@astariul
Copy link
Contributor Author

Note 2

To get better results, I also had to keep text cased. In order to do this, I removed this line :

https://github.com/abisee/cnn-dailymail/blob/b15ad0a2db0d407a84b8ca9b5731e1f1c4bd24b9/make_datafiles.py#L122

@isabelcachola
Copy link

I followed these instructions but I'm getting .bin files instead of .source and .target files. Am I missing something? I'm also trying to reproduce these results.

@isabelcachola
Copy link

I modified the write_to_bin function to the following. Is this the correct data format?

def write_to_bin(url_file, out_file, makevocab=False):
  """Reads the tokenized .story files corresponding to the urls listed in the url_file and writes them to a out_file."""
  print "Making bin file for URLs listed in %s..." % url_file
  url_list = read_text_file(url_file)
  url_hashes = get_url_hashes(url_list)
  story_fnames = [s+".story" for s in url_hashes]
  num_stories = len(story_fnames)

  if makevocab:
    vocab_counter = collections.Counter()

  with open('%s.target' %(out_file), 'wb') as target_file:
      with open('%s.source' %(out_file), 'wb') as source_file:
        for idx,s in enumerate(story_fnames):
            if idx % 1000 == 0:
                print "Writing story %i of %i; %.2f percent done" % (idx, num_stories, float(idx)*100.0/float(num_stories))

            # Look in the tokenized story dirs to find the .story file corresponding to this url
            if os.path.isfile(os.path.join(cnn_tokenized_stories_dir, s)):
                story_file = os.path.join(cnn_tokenized_stories_dir, s)
            elif os.path.isfile(os.path.join(dm_tokenized_stories_dir, s)):
                story_file = os.path.join(dm_tokenized_stories_dir, s)
            else:
                print "Error: Couldn't find tokenized story file %s in either tokenized story directories %s and %s. Was there an error during tokenization?" % (s, cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                # Check again if tokenized stories directories contain correct number of files
                print "Checking that the tokenized stories directories %s and %s contain correct number of files..." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir)
                check_num_stories(cnn_tokenized_stories_dir, num_expected_cnn_stories)
                check_num_stories(dm_tokenized_stories_dir, num_expected_dm_stories)
                raise Exception("Tokenized stories directories %s and %s contain correct number of files but story file %s found in neither." % (cnn_tokenized_stories_dir, dm_tokenized_stories_dir, s))

            # Get the strings to write to .bin file
            article, abstract = get_art_abs(story_file)

            target_file.write(abstract + '\n')
            source_file.write(article + '\n')

@zhaoguangxiang
Copy link

zhaoguangxiang commented Dec 6, 2019

There are many details, here is my code.

I fix the over lenght of train.bpe.source caused by ascii '0D' in articles by split and join

I summarize several notes here :

  1. remove " " before "."
  2. cased, remove the line of lower cased
  3. "\r" in origin articles leads error in bpe preprocess
  4. remove "(CNN)"
  5. bpe encoding

code : https://gist.github.com/zhaoguangxiang/45bf39c528cf7fb7853bffba7fe57c7e

@isabelcachola
Copy link

@zhaoguangxiang Thank you!

@artmatsak
Copy link
Contributor

Here's a version for Python 3 if anyone is interested:

https://github.com/artmatsak/cnn-dailymail

facebook-github-bot pushed a commit that referenced this issue Jan 29, 2020
Summary:
The first step in the CNN/DM fine-tuning instructions for BART is misleading (see #1391). This PR fixes the README and adds links to #1391 as well as to a repository with CNN/DM processing code adjusted for BART.
Pull Request resolved: #1650

Differential Revision: D19606689

fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9
moussaKam pushed a commit to moussaKam/language-adaptive-pretraining that referenced this issue Sep 29, 2020
Summary:
The first step in the CNN/DM fine-tuning instructions for BART is misleading (see facebookresearch#1391). This PR fixes the README and adds links to facebookresearch#1391 as well as to a repository with CNN/DM processing code adjusted for BART.
Pull Request resolved: facebookresearch#1650

Differential Revision: D19606689

fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9
yzpang pushed a commit to yzpang/gold-off-policy-text-gen-iclr21 that referenced this issue Feb 19, 2021
Summary:
The first step in the CNN/DM fine-tuning instructions for BART is misleading (see facebookresearch/fairseq#1391). This PR fixes the README and adds links to facebookresearch/fairseq#1391 as well as to a repository with CNN/DM processing code adjusted for BART.
Pull Request resolved: facebookresearch/fairseq#1650

Differential Revision: D19606689

fbshipit-source-id: 4f1771f47d3650035a911ab393ab6df2193c1bf9
@Ricardokevins
Copy link

Ricardokevins commented Dec 17, 2021

@zhaoguangxiang
Hi thank you for providing the code for preprocess.
I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper?
I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal
eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

image

@zhaoguangxiang
Copy link

zhaoguangxiang commented Dec 17, 2021

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).

image

I forgot my reproduction result. I will reply to you after trying again.

@Ricardokevins
Copy link

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
image

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

@BaohaoLiao
Copy link

If anyone still has problems about:

  1. download and preprocess CNN/DM
  2. evaluate fine-tuned BART on CNN/DM
    You might want to check my reproduction repository https://github.com/BaohaoLiao/NLP-reproduction

@zhaoguangxiang
Copy link

@zhaoguangxiang Hi thank you for providing the code for preprocess. I am looking for the right preprocess-scripts. Does this script can reproduce the result mentioned in paper? I run the code in windows and encounter many encoding-problem. After fix that, i found the dataset format is abnormal eg. strange blank in head of lines(everylines), and large margin between sentence(for example in line5).
image

I forgot my reproduction result. I will reply to you after trying again.

Thank you very much~~ It will help a lot

I forgot my reproduction experience.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants