Skip to content

Latest commit

 

History

History
59 lines (47 loc) · 2.76 KB

README.md

File metadata and controls

59 lines (47 loc) · 2.76 KB

Text Normalization and Tokenization

This tool simplifies text preprocessing using Natural Language Processing (NLP) techniques. It helps standardize and break down text for easier analysis.

Features

  • Text Normalization: Convert text to lowercase and remove punctuation for consistency.
  • Remove Stopwords: Filter out common words (e.g., "and", "the", "is") to focus on meaningful content.
  • Tokenize into Words: Split text into individual words for detailed analysis.
  • Tokenize into Sentences: Divide text into sentences to understand its structure.
  • Tokenize into Paragraphs: Separate text into paragraphs for deeper document analysis.

Usage

  1. Text Normalization: Converts text to lowercase and removes punctuation marks.
  2. Remove Stopwords: Filters out common words to highlight significant content.
  3. Tokenize into Words
    • Input: "Tokenization is an important step."
    • Output: ["Tokenization", "is", "an", "important", "step", "."]
  4. Tokenize into Sentences
    • Input: "Tokenization is important. It breaks down text."
    • Output: ["Tokenization is important.", "It breaks down text."]
  5. Tokenize into Paragraphs
    • Input: "Tokenization is important. It involves breaking down text into units.\n\nAfter tokenization, further analysis is possible."
    • Output: ["Tokenization is important. It involves breaking down text into units.", "After tokenization, further analysis is possible."]

Required Modules

  • NLTK: A toolkit for NLP tasks like tokenization and stopwords removal.
    pip install nltk
  • Download resources:
    nltk.download('stopwords') nltk.download('punkt')
  • Flask: A web framework for creating Python web applications.
    pip install Flask

Python version 3.10 - 3.11

Install modules by
pip install -r requirements.txt

To run the application
python app.py

Web Page

image

Text Normalization

image

Remove Stopwords

image

Tokenize into Words

image

Tokenize into Sentences

image

Tokenize into Paragraphs

image