This project implements the chain-of-density text summarization approach from the paper "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting" by researchers at Salesforce, MIT, Columbia, and others.
Chain-of-density summarization is a new technique that creates highly condensed yet information-rich summaries from long-form text. It works by iteratively extracting essential entities from the source text and rewriting the summary to incorporate more entities each time (without losing previous entities), resulting in a "chain" of increasingly dense summaries.
This implementation takes a long text input (e.g. articles, blogs, whitepapers, documents) and runs it through multiple cycles of entity extraction and summary rewriting to produce a final dense summary containing only the critical information from the source.
Key benefits of the chain-of-density approach include:
- Producing highly compressed yet faithful summaries
- Capturing key details and concepts from complex, long-form text
- Iteratively distilling information density
- Leveraging large language model capabilities for summarization
This repository provides code to apply chain-of-density summarization to arbitrary text inputs using the OpenAI API. It extracts entities, constructs chain-of-thought prompts, queries the API, and outputs condensed summaries.
To run the summarizer:
- Install dependencies:
poetry install
- Create a .env file and set your OpenAI API key:
OPENAI_API_KEY=<your-key>
-
Update config.ini with the input text file path and output location.
-
Run the summarizer:
poetry run cod
This will load the input text, run the chain-of-density summarization, and save the output to the configured file.
The main logic is in main.py. It:
- Loads the input text
- Gets the OpenAI API key from the .env file
- Sends a prompt to the OpenAI API with the text
- Gets back a chain of 5 increasingly dense summaries
- Exports the result to the .txt
The prompt largely follows the methodology outlined in the paper aside from minor adjustments.
Config options like input/output paths are stored in config.ini.
- Parse output as JSON
- Collate the list of entities and additional missing entities
- Allow for the sequential merging and summarisations of multiple inputs
- Add a critique of the Chain-of-Density approach to summarisation (pros and cons)
- "From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting"
- "Annotated + Unannotated CoD Summaries on Hugging Face"
- Generating requirements.txt
poetry export --without-hashes -f requirements.txt --output requirements.txt