Based on two concepts TFIDF and Bag-of-words.
TFIDF - In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. Tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields, including text summarization and classification.
Read more about TFIDF on Wikipedia.
The bag-of-words model is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.
Read more about Bag-of-words on Wikipedia.
Add this line to your application's Gemfile:
gem 'bow_tfidf'
And then execute:
$ bundle
Or install it yourself as:
$ gem install bow_tfidf
First of all bag of words with computed tfidf for each word should be created. For this add labeled words as a hash to bag of words:
bow = BowTfidf::BagOfWords.new
bow.add_labeled_data!({
category1: ['word', 'word1'],
category2: ['word', 'word2'],
category3: ['word', 'word2', 'word3']
})
Instance of BowTfidf::BagOfWords
responds to words
and categories
methods:
bow.words
#{
# 'word1' => {
# categories: {
# 1 => {
# tf: 0.3010299956639812,
# tfidf: 0.14362780923945326
# }
# },
# idf: 0.47712125471966244
# },
# ...
#}
bow.categories
#{
# category1: {
# id: 1,
# key: :category1,
# words: Set['word', 'word1']
# },
# ...
#}
To identify category of text pass array of words as argument to category classifier:
classifier = BowTfidf::Classifier.new(bow)
classifier.call(['word2', 'word3'])
# {
# category_key: :category3,
# score: {
# category3: 0.27185717486836963,
# category2: 0.09061905828945654
# }
# }
:category_key
- assumption about category of text by given words. Is based on :score
. The highest score wins.
BowTfidf::Classifier
takes numerical interpretation of relation beetwen word and category, sums it up for each word and returns score.
-
all given words are not in the BOW.
- Solution: update BOW with new words.
-
each of given words belongs to all categories
- In current implementation TFIDF tool ignores such words and does not add it to BOW. It is done with assumption that less frequent words should exists.
To improve performance and memmory usage create dump of built BOW with light data structure(without unnecessary for classifier attributes) and custom classifier which can work with the dump.
# TODO
BowTfidf::Tokenizer.new.call('word word2, some! text')
# <Set: {"word", "word2", "some", "text"}>
After checking out the repo, run bin/setup
to install dependencies. You can also run bin/console
for an interactive prompt that will allow you to experiment.
To install this gem onto your local machine, run bundle exec rake install
. To release a new version, update the version number in version.rb
, and then run bundle exec rake release
, which will create a git tag for the version, push git commits and tags, and push the .gem
file to rubygems.org.
Bug reports and pull requests are welcome on GitHub at https://github.com/isidzukuri/bow_tfidf.