Skip to content

WenchenLi/capricorn

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation




capricorn

capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.

  1. build vocabulary from corpus
  2. load necessary word embedding with consistent word index in Vocabulary

getting started

pip install capricorn
import capricorn
import os

# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "path/to/embedding"

# Load vocab
if os.path.isfile(Vocab_path):
  print("Loading Vocabulary ...")
  vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)

else:  # build vocab
  print("Building Vocabulary ...")

  x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
            "Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
            "Stansted Could Double Passengers on Deregulation, Times Reports."]

  # Build/load vocabulary
  max_document_length = 11
  min_freq_filter = 2

  vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length,
                                                  min_frequency=min_freq_filter)
  # only fit
  # vocab_processor.fit(x_text)
  # or fit_transform to get the transformed corpus
  x_text_transformed = vocab_processor.fit_transform(x_text)
  vocab_processor.save(Vocab_path)
  print("vocab_processor saved at:", Vocab_path)

# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix_with_dim(embedding_vector_path, 300)

User input

The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.

If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:

"We like it very much"

to:

"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"

About

nlp vocabulary builder and embedding loader

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages