Term Frequency–Inverse Document Frequency (tf-idf) is implemented to determine how important a word (or words) is to a document relative to a corpus. The following example will add four documents to a corpus and determine the weight of the word "crystal" and then the weight of the word "ruby" in each document.
-
Add the dependency to your
shard.yml
:dependencies: cadmium_tfidf: github: cadmiumcr/tfidf
-
Run
shards install
require "cadmium_tfidf"
tfidf = Cadmium.tf_idf.new
tfidf.add_document("this document is about crystal.")
tfidf.add_document("this document is about ruby.")
tfidf.add_document("this document is about ruby and crystal.")
tfidf.add_document("this document is about crystal. it has crystal examples")
puts "crystal --------------------------------"
tfidf.tfidfs("crystal") do |i, measure, key|
puts "document ##{i} is #{measure}"
end
puts "ruby --------------------------------"
tfidf.tfidfs("ruby") do |i, measure, key|
puts "document ##{i} is #{measure}"
end
# => crystal --------------------------------
document #0 is 1
document #1 is 0
document #2 is 1
document #3 is 2
ruby --------------------------------
document #0 is 0
document #1 is 1.2876820724517808
document #2 is 1.2876820724517808
document #3 is 0
- Fork it (https://github.com/cadmiumcr/tfidf/fork)
- Create your feature branch (
git checkout -b my-new-feature
) - Commit your changes (
git commit -am 'Add some feature'
) - Push to the branch (
git push origin my-new-feature
) - Create a new Pull Request
- Chris Watson - creator and maintainer