Continuous Skip-Gram Model for Word Embeddings on Customer Complaints Data

Business Overview

In the world of Natural Language Processing (NLP), machines can understand and generate text by learning from vast amounts of data. Word embeddings, which represent words as numerical vectors, are essential for this purpose. You've probably heard of Word2Vec and Count vectorization - two popular algorithms for creating word embeddings. This project provides a basic introduction to Word2Vec and one-hot encoding.

Aim

The aim of this project is to understand the continuous skip-gram algorithm and build a model for generating word embeddings from a set of documents.

Data Description

The dataset used for this project contains more than two million customer complaints about consumer financial products. It includes various columns, with one containing the actual text of the complaint and another indicating the product for which the customer is raising the complaint.

Note: The project uses the first 100 complaints for testing purposes, but you can train the model on the entire dataset.

Tech Stack

Language: Python
Libraries: pandas, torch, collections, nltk, numpy, pickle, os, re

Approach

Introduction to the continuous skip-gram algorithm
Data Description
Data Preprocessing
- Handling missing values
- Conversion to lowercase
- Punctuation removal
- Digits removal
- Removing extra spaces
- Tokenization
Building a Data Loader
Creating the Skip-Gram Model using the PyTorch framework
Model training
Generating Word Embeddings

Modular Code Overview

Input: Contains the data for analysis, in this case, complaints.xlsx.
Output: Contains pre-trained models and vectorizers for future use.
Source: Contains modularized code for various project steps, including:
- model.py
- data.py
- utils.py
These Python files contain useful functions used in Engine.py.
config.py: Contains project configurations.
Engine.py: The main file to run the entire project, including model training and saving.
skip_gram.ipynb: The original Jupyter notebook.
processing.py: Used for data processing.
README.md: Contains comprehensive instructions and information on running specific files.
requirements.txt: Lists required libraries with respective versions. Install them using pip install -r requirements.txt.

Note: The project uses the first 100 complaints for testing purposes, but you can train the model on the entire dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Output		Output
Source		Source
__pycache__		__pycache__
libs		libs
Engine.py		Engine.py
LICENSE		LICENSE
config.py		config.py
processing.py		processing.py
readme.md		readme.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Continuous Skip-Gram Model for Word Embeddings on Customer Complaints Data

Business Overview

Aim

Data Description

Tech Stack

Approach

Modular Code Overview

About

Releases

Packages

Languages

License

AjNavneet/SkipGram-Word-Embeddings-Customer-Complaints

Folders and files

Latest commit

History

Repository files navigation

Continuous Skip-Gram Model for Word Embeddings on Customer Complaints Data

Business Overview

Aim

Data Description

Tech Stack

Approach

Modular Code Overview

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages