Skip to content

Kira-Pgr/EmbeddingPaw

Repository files navigation

EmbeddingPaw

EmbeddingPaw is a Python library for playing with text embeddings. It provides a simple and intuitive interface for creating, manipulating, and visualizing embeddings using an OpenAI-like API.

Installation

To install EmbeddingPaw, you can use pip:

git clone https://github.com/Kira-Pgr/EmbeddingPaw.git
cd EmbeddingPaw
python setup.py sdist bdist_wheel
pip install .

Getting Started

To get started with EmbeddingPaw, you need to create an instance of the EmbeddingPaw class with your API configuration:

from embeddingpaw import EmbeddingPaw

config = EmbeddingPaw(
    base_url="http://localhost:1234/v1",
    api_key="sk-xxxx",
    embedding_db_path="embeddings_db.pkl"
)

You can use LMStudio to create your own local embedding server

image

or use the OpenAI API by providing your API key.

Creating Tokens

You can create a Token object by providing the text you want to embed:

from embeddingpaw import Token

token = Token("Hello, world!")

The Token class automatically retrieves the embedding for the given text using the configured API.

Token Operations

EmbeddingPaw provides various operations that you can perform on Token objects:

  • get_similarity(token): Calculate the cosine similarity between two tokens.
  • get_closest_token(num=1): Find the closest token(s) in the embedding database.

You can also perform arithmetic operations on token embeddings using the following operators:

  • Addition (+): Add the embeddings of two tokens.
  • Subtraction (-): Subtract the embeddings of two tokens.
  • Multiplication (*): Multiply the embeddings of two tokens.
  • Division (/): Divide the embeddings of two tokens.
  • Matrix Multiplication (@): Perform matrix multiplication on the embeddings of two tokens.

Token Arrays

You can create a TokenArray object to work with multiple tokens:

from embeddingpaw import TokenArray

token_array = TokenArray([token1, token2, token3])

The TokenArray class provides methods for manipulating and analyzing the array of tokens:

  • append(token): Append a token to the array.
  • pop(): Remove the last token from the array.
  • delete(text): Delete a token from the array based on its text.
  • pca(n_components=3): Apply Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings.
  • cluster_tokens(range_k=range(2, 10)): Cluster the tokens and show the result in a table.

Visualizing Embeddings

EmbeddingPaw includes a TokenVisualizer class for visualizing token embeddings in a 3D scatter plot:

from embeddingpaw import TokenVisualizer

visualizer = TokenVisualizer(token_array)
visualizer.show_web()  # Render the visualization in a web browser
visualizer.show_notebook()  # Render the visualization in a Jupyter notebook

Embedding Database

The EmbeddingPawDatabase class allows you to manage and interact with an embedding database:

from embeddingpaw import EmbeddingPawDatabase

db = EmbeddingPawDatabase()

The database provides methods for adding, deleting, and loading tokens:

  • add_token(token): Add a token to the database.
  • delete_token(text): Delete a token from the database based on its text.
  • load_token_from_txt(path): Load tokens from a text file.
  • load_token_from_json(path): Load tokens from a JSON file.
  • load_token_from_excel(path): Load tokens from an Excel file.

Contributing

Contributions to EmbeddingPaw are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.

License

EmbeddingPaw is released under the MIT License.

About

A Python library for playing with text embeddings

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages