This project is meant for evaluating the summarization capabilities of three Large Language Models (T5-Small, T5-Large and GPT2).
CNN_Daily Mail Dataset [https://huggingface.co/datasets/cnn_dailymail]
ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics designed for comparing summaries from Lin et al., 2004. See Wikipedia for more info. Here, we use the Hugging Face Evaluator wrapper to call into the rouge_score
package. This package provides 4 scores:
rouge1
: ROUGE computed over unigrams (single words or tokens)rouge2
: ROUGE computed over bigrams (pairs of consecutive words or tokens)rougeL
: ROUGE based on the longest common subsequence shared by the summaries being comparedrougeLsum
: likerougeL
, but at "summary level," i.e., ignoring sentence breaks (newlines)
- Single GPU available from Google Colab Free version
Deleted and cleaned the gpu memory after every evaluation so the GPU is not full.
del model
torch.cuda.empty_cache()
gc.collect() # Garbage Collector
Model specific results and the summarization results are available in the attached notebook