Skip to content

This project is meant for evaluating the summarization capabilities of three Large Language Models (T5-Small, T5-Large and GPT2). The dataset used in CNN_Daily Mail dataset

Notifications You must be signed in to change notification settings

Cenrax/LLMSummaEvaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

LLM Summarization Capability Evaluation

This project is meant for evaluating the summarization capabilities of three Large Language Models (T5-Small, T5-Large and GPT2).

Dataset:

CNN_Daily Mail Dataset [https://huggingface.co/datasets/cnn_dailymail]

Evaluation Metric

ROUGE: Recall-Oriented Understudy for Gisting Evaluation (ROUGE) is a set of evaluation metrics designed for comparing summaries from Lin et al., 2004. See Wikipedia for more info. Here, we use the Hugging Face Evaluator wrapper to call into the rouge_score package. This package provides 4 scores:

  • rouge1: ROUGE computed over unigrams (single words or tokens)
  • rouge2: ROUGE computed over bigrams (pairs of consecutive words or tokens)
  • rougeL: ROUGE based on the longest common subsequence shared by the summaries being compared
  • rougeLsum: like rougeL, but at "summary level," i.e., ignoring sentence breaks (newlines)

Compute Resources

  • Single GPU available from Google Colab Free version

How did I fit three models in Google Colab Free version

Deleted and cleaned the gpu memory after every evaluation so the GPU is not full.

del model
torch.cuda.empty_cache()
gc.collect() # Garbage Collector

Results

image

Model specific results and the summarization results are available in the attached notebook

About

This project is meant for evaluating the summarization capabilities of three Large Language Models (T5-Small, T5-Large and GPT2). The dataset used in CNN_Daily Mail dataset

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published