Skip to content

Text Normalization prepares text for analysis using NLP techniques. It ensures consistency by converting to lowercase, removing punctuation, and filtering common words. Text is parsed into words, sentences, and paragraphs for better understanding.

Notifications You must be signed in to change notification settings

hariharasudan3/Text-Normalization-NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Text Normalization and Tokenization

This tool simplifies text preprocessing using Natural Language Processing (NLP) techniques. It helps standardize and break down text for easier analysis.

Features

  • Text Normalization: Convert text to lowercase and remove punctuation for consistency.
  • Remove Stopwords: Filter out common words (e.g., "and", "the", "is") to focus on meaningful content.
  • Tokenize into Words: Split text into individual words for detailed analysis.
  • Tokenize into Sentences: Divide text into sentences to understand its structure.
  • Tokenize into Paragraphs: Separate text into paragraphs for deeper document analysis.

Usage

  1. Text Normalization: Converts text to lowercase and removes punctuation marks.
  2. Remove Stopwords: Filters out common words to highlight significant content.
  3. Tokenize into Words
    • Input: "Tokenization is an important step."
    • Output: ["Tokenization", "is", "an", "important", "step", "."]
  4. Tokenize into Sentences
    • Input: "Tokenization is important. It breaks down text."
    • Output: ["Tokenization is important.", "It breaks down text."]
  5. Tokenize into Paragraphs
    • Input: "Tokenization is important. It involves breaking down text into units.\n\nAfter tokenization, further analysis is possible."
    • Output: ["Tokenization is important. It involves breaking down text into units.", "After tokenization, further analysis is possible."]

Required Modules

  • NLTK: A toolkit for NLP tasks like tokenization and stopwords removal.
    pip install nltk
  • Download resources:
    nltk.download('stopwords') nltk.download('punkt')
  • Flask: A web framework for creating Python web applications.
    pip install Flask

Python version 3.10 - 3.11

Install modules by
pip install -r requirements.txt

To run the application
python app.py

Web Page

image

Text Normalization

image

Remove Stopwords

image

Tokenize into Words

image

Tokenize into Sentences

image

Tokenize into Paragraphs

image

About

Text Normalization prepares text for analysis using NLP techniques. It ensures consistency by converting to lowercase, removing punctuation, and filtering common words. Text is parsed into words, sentences, and paragraphs for better understanding.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published