Welcome to Day 27 of the 30 Days of Data Science series! Today, we dive into the fascinating world of Natural Language Processing (NLP). NLP bridges the gap between human language and computers, allowing machines to understand, process, and generate human text. By the end of this lesson, you will have a solid understanding of the following key topics:
- NLTK
- spaCy
- Hugging Face
- Topic Modeling with Gensim
- Text Summarization
- Word Embeddings with Word2Vec and GloVe
- 🌟 Day 27: Natural Language Processing (NLP)
Natural Language Processing (NLP) is a field within Artificial Intelligence that focuses on enabling machines to understand and interact with human language. It has wide applications, including:
- Text Classification: Spam detection, sentiment analysis.
- Machine Translation: Translating text between languages (e.g., Google Translate).
- Named Entity Recognition (NER): Identifying entities like names, dates, and locations in text.
- Question Answering: Building systems like ChatGPT.
NLTK is a powerful Python library for working with text data. It provides tools for tokenization, stemming, lemmatization, and more. Let’s explore some common functionalities.
Tokenization is the process of breaking text into smaller components, such as words or sentences.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt')
text = "Natural Language Processing is fascinating. Let's learn more!"
# Word Tokenization
words = word_tokenize(text)
print("Word Tokens:", words)
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokens:", sentences)
Stopwords are common words (e.g., "is", "the") that are often removed in text preprocessing.
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in stop_words]
print("Filtered Words:", filtered_words)
- Stemming reduces words to their root form (e.g., "running" -> "run").
- Lemmatization maps words to their base dictionary form (e.g., "better" -> "good").
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
word = "running"
print("Stemmed:", stemmer.stem(word))
print("Lemmatized:", lemmatizer.lemmatize(word, pos='v'))
spaCy is an efficient library designed for large-scale NLP tasks. It supports features like Named Entity Recognition (NER), Part-of-Speech tagging, and dependency parsing.
NER identifies entities such as names, dates, and locations in text.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Apple was founded by Steve Jobs in Cupertino, California."
doc = nlp(text)
for ent in doc.ents:
print(ent.text, ent.label_)
POS tagging assigns grammatical tags (e.g., noun, verb) to words in a sentence.
for token in doc:
print(token.text, token.pos_)
Hugging Face provides state-of-the-art NLP models, including BERT and GPT, through the transformers
library.
Use a pre-trained model to classify the sentiment of a given text.
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("I love learning NLP!")
print(result)
Generate text using a language model like GPT-2.
from transformers import pipeline
generator = pipeline("text-generation", model="gpt2")
result = generator("Natural Language Processing is", max_length=30, num_return_sequences=1)
print(result[0]['generated_text'])
Topic Modeling is the task of identifying abstract topics within a collection of documents. The gensim
library provides tools for Latent Dirichlet Allocation (LDA), a popular topic modeling technique.
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel
# Sample data
documents = ["I love data science", "Data science is the future", "NLP is fascinating"]
# Preprocessing
tokenized_docs = [doc.split() for doc in documents]
dictionary = corpora.Dictionary(tokenized_docs)
corpus = [dictionary.doc2bow(doc) for doc in tokenized_docs]
# LDA Model
lda_model = LdaModel(corpus, num_topics=2, id2word=dictionary)
for idx, topic in lda_model.print_topics(-1):
print(f"Topic {idx}: {topic}")
Text summarization condenses a large text into a shorter version while retaining the main points. You can use Hugging Face for extractive summarization.
from transformers import pipeline
summarizer = pipeline("summarization")
text = "Natural Language Processing has a variety of applications, including text summarization. Summarization aims to condense long texts."
summary = summarizer(text, max_length=50, min_length=25, do_sample=False)
print(summary[0]['summary_text'])
Word embeddings are dense vector representations of words. Libraries like gensim
support Word2Vec, while pre-trained GloVe embeddings are available for direct use.
from gensim.models import Word2Vec
sentences = [["I", "love", "NLP"], ["Word2Vec", "is", "useful"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)
# Get vector for a word
vector = model.wv["NLP"]
print(vector)
-
Tokenize the following text using NLTK:
"The quick brown fox jumps over the lazy dog."
Count the number of tokens and remove stopwords.
-
Use spaCy to extract entities from:
"Tesla's stock price soared after Elon Musk's announcement in 2023."
-
Use Hugging Face's sentiment analysis pipeline to analyze:
"The movie was a masterpiece, but the ending was disappointing."
-
Generate a short text completion starting with:
"Data Science is the future of"
-
Use Gensim's LDA model to find topics from the following documents:
["Artificial intelligence is transforming industries", "Machine learning is a subset of AI", "NLP is a key AI application"]
-
Summarize the following text:
"Machine learning is a branch of artificial intelligence that focuses on building systems capable of learning and improving from experience without being explicitly programmed."
Today, you learned about the foundational tools and techniques for NLP:
- NLTK: Preprocessing text with tokenization, stopword removal, and lemmatization.
- spaCy: Performing advanced tasks like NER and POS tagging.
- Hugging Face: Leveraging pre-trained models for sentiment analysis and text generation.
- Gensim: Topic modeling with LDA.
- Summarization: Condensing text into shorter forms.
- Word Embeddings: Representing words as dense vectors with Word2Vec and GloVe.
NLP is a powerful field with applications in numerous domains. Keep practicing and explore these libraries further to master them!