Skip to content

imports pretrained keras tokenizer and transforms text with the tf-idf strategy

License

Notifications You must be signed in to change notification settings

Martinay/tf-idf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

term frequency–inverse document frequency (tf-idf) in c#

This library can import pretrained keras tokenizer, which were exported in json format. With the imported tokenizer it is possible to transform text with tf-idf. To use tokenization per word use WordStrategy, for per character tokenization use CharacterStrategy. This is the same behavior as setting the char_level property on the tokenizer.

how-to

train and export tokenizer

from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=tokenizer_word_count, char_level=False, filters='', split=' ')
all_words = "abc def"
tokenizer.fit_on_texts(all_words)

tokenizer_json = tokenizer.to_json()
with open(output_directory + "tokenizer.json", "w", encoding="utf-8") as text_file:
    print(tokenizer_json, file=text_file)

import tokenizer in c#

var loadedJson = File.ReadAllText(@"tokenizer.json", Encoding.UTF8);
tokenizer = Tokenizer.FromJson(loadedJson);

For examples have a look at the unit tests.

About

imports pretrained keras tokenizer and transforms text with the tf-idf strategy

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages