EmbeddingPaw is a Python library for playing with text embeddings. It provides a simple and intuitive interface for creating, manipulating, and visualizing embeddings using an OpenAI-like API.
To install EmbeddingPaw, you can use pip:
git clone https://github.com/Kira-Pgr/EmbeddingPaw.git
cd EmbeddingPaw
python setup.py sdist bdist_wheel
pip install .
To get started with EmbeddingPaw, you need to create an instance of the EmbeddingPaw
class with your API configuration:
from embeddingpaw import EmbeddingPaw
config = EmbeddingPaw(
base_url="http://localhost:1234/v1",
api_key="sk-xxxx",
embedding_db_path="embeddings_db.pkl"
)
You can use LMStudio to create your own local embedding server
or use the OpenAI API by providing your API key.
You can create a Token
object by providing the text you want to embed:
from embeddingpaw import Token
token = Token("Hello, world!")
The Token
class automatically retrieves the embedding for the given text using the configured API.
EmbeddingPaw provides various operations that you can perform on Token
objects:
get_similarity(token)
: Calculate the cosine similarity between two tokens.get_closest_token(num=1)
: Find the closest token(s) in the embedding database.
You can also perform arithmetic operations on token embeddings using the following operators:
- Addition (
+
): Add the embeddings of two tokens. - Subtraction (
-
): Subtract the embeddings of two tokens. - Multiplication (
*
): Multiply the embeddings of two tokens. - Division (
/
): Divide the embeddings of two tokens. - Matrix Multiplication (
@
): Perform matrix multiplication on the embeddings of two tokens.
You can create a TokenArray
object to work with multiple tokens:
from embeddingpaw import TokenArray
token_array = TokenArray([token1, token2, token3])
The TokenArray
class provides methods for manipulating and analyzing the array of tokens:
append(token)
: Append a token to the array.pop()
: Remove the last token from the array.delete(text)
: Delete a token from the array based on its text.pca(n_components=3)
: Apply Principal Component Analysis (PCA) to reduce the dimensionality of the embeddings.cluster_tokens(range_k=range(2, 10))
: Cluster the tokens and show the result in a table.
EmbeddingPaw includes a TokenVisualizer
class for visualizing token embeddings in a 3D scatter plot:
from embeddingpaw import TokenVisualizer
visualizer = TokenVisualizer(token_array)
visualizer.show_web() # Render the visualization in a web browser
visualizer.show_notebook() # Render the visualization in a Jupyter notebook
The EmbeddingPawDatabase
class allows you to manage and interact with an embedding database:
from embeddingpaw import EmbeddingPawDatabase
db = EmbeddingPawDatabase()
The database provides methods for adding, deleting, and loading tokens:
add_token(token)
: Add a token to the database.delete_token(text)
: Delete a token from the database based on its text.load_token_from_txt(path)
: Load tokens from a text file.load_token_from_json(path)
: Load tokens from a JSON file.load_token_from_excel(path)
: Load tokens from an Excel file.
Contributions to EmbeddingPaw are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request on the GitHub repository.
EmbeddingPaw is released under the MIT License.