General

NLP - Natural Language Processing : text classification, sentiment-analysis, chatbot, language processing.
Types of Conversational Agents:
- Rule Based ( follow a specific set of rules, not flexible but they give pretty accurate responses )
- Learning Based:
  - Generative - generate respons on the fly
  - Retrieval-based - retrieve a certain response from corpus

Terms

Corpus - collection of text documents, is a body of written or spoken material upon which a linguistic analysis is based (mainly used for NLP)
Dataset - collection of data
Tokenization - process of breaking a sentence into words
Stemming - process of reducing a word to its root by removing suffixes
Lemmatization - process of finding the base form of a word with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma
Lemmatization takes more time than stemming: lemmatize the word caring it returns care and stemming returns car. Lemmatization is computationally more expensive since it involves look-up tables. Lemmatization and stemming are special cases of normalization. They identify a canonical representative for a set of related word forms.
Bag of Words - a vector representation of a document, where each word is represented by a number, and the number is the number of times the word appears in the document
Term Frequency (Tf) - number of times a word appears in a document aka Keyword extraction
Inverse Document (Idf) Frequency - number of documents in a corpus that contain a word. , measure as measured as log(total number of sentences / Number of sentences with term t)
Fallback Rate (FBR) - number of times when the bot is unable to answer a question and falls back to a default response "Sorry, I don't understand"
Intent - a goal or purpose of a conversation

Implementation

TfidfVectorizer - Convert a collection of raw documents to a matrix of TF-IDF features.
Term Frequency and Inverse Document Frequency is a text vectorization techniques used for information retrieval to represent how important a specific word or phrase is to a given document.
Tfidf is a BoW(bag of words) algorithm that can help you identify the intent, but not relation between those intents.
The matrix that is obtained from vectorized tfidf along with the label will just tell the machine that if for some text, similar matrix is obtained, this is the label. Which is handy in classification, but not for chatbot responses.
Flow to get a response from the chatbot:
- Get the user input
- Vectorize the user input
- Get the similarity between the user input and the corpus
- Get the index of the most similar text from the corpus
- Get the response from the corpus at the index
- Return the response

Keyword Extraction

* BoWC (Bag of Weighted Concepts) - creates concepts by clustering word vectors (embedding)
* keywords extraction methods: Rake `pip install rake-nltk`, Yake, Keybert, and Textrank (methods to extract keywords from a single text)

Issues

NLTK resources not found:
- Solution: python3 -c "import nltk; nltk.download('all')"

LookupError: 
**********************************************************************
  Resource omw-1.4 not found.
  Please use the NLTK Downloader to obtain the resource:

  >>> import nltk
  >>> nltk.download('omw-1.4')

Weak corpus:
- Solution: The reason is that you have used custom tokenizer and used default stop_words='english' so while extracting features a check is made to see if there is any inconsistency between stop_words and tokenizer. link

 UserWarning: Your stop_words may be inconsistent with your preprocessing. Tokenizing the stop words generated tokens ['ha', 'le', 'u', 'wa'] not in stop_words.

Which is the best stemmer:
- Solution: The best stemmer is snowball stemmer. It is a stemmer that is based on the snowball algorithm. It is a good stemmer for English.
- doc
- link
Dockerfile build:
- installing numpy raise RuntimeError("Broken toolchain: cannot link a simple C program") solved by changing the baseimage FROM python:3.8-slim

Data:

Repo containing small data-sets: dejanu_DataSets

Usefull links:

* https://www.youtube.com/watch?v=bxF2LWZcUR8&t=92s
* https://docs.microsoft.com/en-us/power-virtual-agents/teams/fundamentals-get-started-teams
* https://www.kaggle.com/code/alvations/basic-nlp-with-nltk

Chatbot image:

* Docker registry [here](https://hub.docker.com/r/dejanualex/chatbot)
* K8s embeded corpus: `docker run -itu 0  dejanualex/chatbot:1.0 bash`
* Plug and play corpus: `docker run -v $PWD/datasets:/app/datasets -itu 0 dejanualex/chatbot:2.0 bash`

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
assets		assets
datasets		datasets
.gitignore		.gitignore
Dockerfile		Dockerfile
chatbot.py		chatbot.py
corpusreader.py		corpusreader.py
readme.md		readme.md
requirements.txt		requirements.txt
second_chatbot.py		second_chatbot.py
tests_nltk.py		tests_nltk.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

General

Terms

Implementation

Keyword Extraction

Issues

Data:

Usefull links:

Chatbot image:

About

Releases

Packages

Languages

dejanu/chatbot

Folders and files

Latest commit

History

Repository files navigation

General

Terms

Implementation

Keyword Extraction

Issues

Data:

Usefull links:

Chatbot image:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages