capricorn

capricorn is a lightweight library for helping prepare vocabulary from corpus and prepare word embedding ready to be used by learning models.

build vocabulary from corpus
load necessary word embedding with consistent word index in Vocabulary

getting started

pip install capricorn

import capricorn
import os

# Specify filepaths
Vocab_path = "vocab_processor"
embedding_vector_path = "path/to/embedding"

# Load vocab
if os.path.isfile(Vocab_path):
  print("Loading Vocabulary ...")
  vocab_processor = capricorn.VocabularyProcessor.restore(Vocab_path)

else:  # build vocab
  print("Building Vocabulary ...")

  x_text = ["Saudi Arabia Equity Movers: Almarai, Jarir Marketing and Spimaco.",
            "Orange, Thales to Get French Cloud Computing Funds, Figaro Says.",
            "Stansted Could Double Passengers on Deregulation, Times Reports."]

  # Build/load vocabulary
  max_document_length = 11
  min_freq_filter = 2

  vocab_processor = capricorn.VocabularyProcessor(max_document_length=max_document_length,
                                                  min_frequency=min_freq_filter)
  # only fit
  # vocab_processor.fit(x_text)
  # or fit_transform to get the transformed corpus
  x_text_transformed = vocab_processor.fit_transform(x_text)
  vocab_processor.save(Vocab_path)
  print("vocab_processor saved at:", Vocab_path)

# build embedding matrix of which the index is consistent with vocab word2index mapping
embedding_matrix = vocab_processor.prepare_embedding_matrix_with_dim(embedding_vector_path, 300)

User input

The library default to use special token __UNK__ and __PAD__, if the input sequence lengths below the max_document_length when initial VocabularyProcessor, it will automatically pad the sequence use the __PAD__.

If user have pre defined special tokens when initialize Vocabulary, user need to pre-process the sequence, namely adding the self defined special tokens to the input sequence. For example if user defined __START__ and __END__ as additional special tokens and max_document_length=11, User has to process the original sentence from:

"We like it very much"

to:

"__START__ __PAD__ __PAD__ We like it very much __PAD__ __PAD__ __END__"

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
capricorn		capricorn
img_src		img_src
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

capricorn

getting started

User input

About

Releases

Packages

Languages

License

WenchenLi/capricorn

Folders and files

Latest commit

History

Repository files navigation

capricorn

getting started

User input

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages