Data processing for NeuSum

This repo contains the code which can generate the training data (CNN / Daily Mail) needed by NeuSum.

Preprocess CNN/DM dataset using abisee's scripts: https://github.com/abisee/cnn-dailymail
Convert its output to the format shown in the sample_data folder. The format of files:
1. File train.txt.src is the input document. Each line contains several tokenized sentences delimited by ##SENT## of a document.
2. File train.txt.tgt is the summary of document. Each line contains several tokenized summaries delimited by ##SENT## of the corresponding document.
Use find_oracle.py to search the best sentences to be extracted. The arguments of the main functions are: document_file, summary_file and output_path.
Next, build the ROUGE score gain file using get_mmr_regression_gain.py. The usage is shown in the code entry.

Note

The algorithm is a brute-force search, which can be slow in some cases. Therefore, running it in parallel is recommended (and it is what I did in my experiments).

Recently, I modify the find_oracle.py a little using multiprocessing so that it can be easier to run it in parallel. Please check out find_oracle_para.py.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
PyRouge		PyRouge
sample_data		sample_data
.gitignore		.gitignore
Document.py		Document.py
README.md		README.md
find_oracle.py		find_oracle.py
find_oracle_para.py		find_oracle_para.py
get_mmr_regression_gain.py		get_mmr_regression_gain.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data processing for NeuSum

Note

About

Releases

Packages

Languages

magic282/cnndm_acl18

Folders and files

Latest commit

History

Repository files navigation

Data processing for NeuSum

Note

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages