An AI-based web application that provides concise summaries of articles using advanced natural language processing (NLP) techniques.
- Introduction
- Data Exploration
- Model Selection
- Model Fine-Tuning
- Extractive Summarization
- Web Application Development
- Installation
- Usage
Article-Summarizer-Using-AI is a web application designed to summarize lengthy articles using NLP. The application allows users to upload their own articles or use sample data to generate summaries in various styles, utilizing a generative AI model.
The dataset used for training and evaluation is the PubMed Summarization dataset. It includes articles from PubMed with corresponding abstracts used as summaries.
-
Loading the Dataset:
from datasets import load_dataset pubmed_data = load_dataset("ccdv/pubmed-summarization", split='train[:1000]')
-
Initial Data Cleaning:
- Remove rows with missing values to ensure data quality.
pubmed_data = pubmed_data.filter(lambda x: x['article'] is not None and x['abstract'] is not None)
-
Exploratory Data Analysis:
- Examine the distribution of article lengths and summary lengths.
- Identify common topics and terminology within the dataset.
print(pubmed_data[0]) # View the first data entry
-
Text Tokenization:
- Split text into sentences and words for detailed analysis.
from nltk.tokenize import sent_tokenize, word_tokenize sentences = sent_tokenize(article_text) words = word_tokenize(sentence)
-
Stop Words Removal:
- Remove common English words that do not contribute to the summary.
from nltk.corpus import stopwords stop_words = set(stopwords.words('english')) words = [word for word in words if word.lower() not in stop_words]
-
Lemmatization:
- Convert words to their base forms.
from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer() words = [lemmatizer.lemmatize(word.lower()) for word in words]
-
API Configuration:
- Use the
google.generativeai
library for model generation.
import google.generativeai as genai import os api_key = os.environ.get('your_api_key') genai.configure(api_key=api_key)
- Use the
-
Model Initialization:
- Set up the generative AI model.
model = genai.GenerativeModel()
-
Fine-tune the model with the PubMed dataset to improve summary quality.
# Example pseudo-code for fine-tuning model.train(dataset=pubmed_data, epochs=10, learning_rate=0.001)
For extractive summarization, the application uses traditional NLP techniques to identify key sentences from the article without relying on a generative model.
-
Extractive Summary Script:
Rename the provided
extractive_summary.py
toapp.py
and move it to the project root:mv /mnt/data/extractive_summary.py app.py
-
Core Logic:
- The extractive summarization script uses statistical and heuristic methods to identify the most important sentences in the text.
# Example of extractive summarization def extractive_summary(text): # Tokenize the text and rank sentences sentences = sent_tokenize(text) # Rank and select key sentences (pseudo-code) summary = ' '.join(sentences[:3]) # Example: Select first 3 sentences return summary
-
Integration:
- Integrate the extractive summarization logic with the Flask application to allow users to choose between generative and extractive summaries.
@app.route('/summarize', methods=['POST']) def summarize(): if 'file' in request.files and request.files['file'].filename != '': file = request.files['file'] article_text = file.read().decode("utf-8") else: sample_index = int(request.form['sample']) article_text = pubmed_data[sample_index]['article'] style = request.form.get('style', 'brief') summary_method = request.form.get('method', 'generative') if summary_method == 'generative': summary_text = preprocess_and_summarize(article_text, style) else: summary_text = extractive_summary(article_text) return render_template('result.html', original=article_text, summary=summary_text)
-
Evaluate the model's performance using metrics such as ROUGE or BLEU.
from nltk.translate.bleu_score import sentence_bleu reference = [reference_summary.split()] candidate = generated_summary.split() score = sentence_bleu(reference, candidate) print(f'BLEU Score: {score}')
-
Flask Setup:
- Initialize the Flask app and configure the login manager.
from flask import Flask from flask_login import LoginManager app = Flask(__name__) app.secret_key = 'your_secret_key' login_manager = LoginManager(app)
-
Routes and Authentication:
- Implement routes for login, registration, summarization, and logout.
@app.route('/login', methods=['GET', 'POST']) def login(): # login logic here return render_template('login.html')
-
Templates:
- Create HTML templates for the user interface.
<!-- templates/index.html --> <form action="{{ url_for('summarize') }}" method="post" enctype="multipart/form-data"> <input type="file" name="file"> <button type="submit">Summarize</button> </form>
-
User Experience:
- Ensure a user-friendly interface with clear instructions and feedback.
- Python 3.7+
- Flask
- NLTK
- Generative AI Library (e.g., google.generativeai)
- An API key for generative AI
-
Clone the Repository:
git clone https://github.com/yourusername/Article-Summarizer-Using-AI.git
-
Navigate to the Project Directory:
cd Article-Summarizer-Using-AI
-
Create a Virtual Environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install Dependencies:
pip install -r requirements.txt
-
Set Environment Variables:
- Create a
.env
file with your API key.
your_api_key=<YOUR_GENERATIVE_AI_API_KEY>
- Create a
-
Download NLTK Data:
The script handles downloading necessary NLTK data.
-
Run the Application:
flask run --port=5001
-
Access the App:
- Visit
http://127.0.0.1:5001
in your browser.
- Visit
-
Login/Register:
- Register a new account or log in with existing credentials.
-
Summarize Articles:
- Upload a text file or choose a sample to summarize.
-
View Summary:
- The summarized text is displayed on the results page.
Thank you for using Article-Summarizer-Using-AI! We hope you find it useful for your summarization needs.