Skip to content

IPFS_Embeddings_Py

endomorphosis edited this page Dec 20, 2024 · 1 revision

This module is for generating embeddings, it can either generate single embeddings, or it can take as input the pinset of an ipfs node, or huggingface dataset, numbering in the millions of rows, and generate embeddings for the entire pinset/dataset. The module is supposed to pipeline and parallelize the generation of the dataset, so that there will be maximum GPU usage, and it uses the ipfs_accelerate_py as the source of its computation. First it processes the dataset to ensure that each row as an IPFS CID that tags the data as it flows through queues in the data processing pipeline, first into queues for processing the data into tokens, then from tokens into chunks of tokens, then embedding the chunks with the maximum batch size found, and then reassembling them into an index, then sharding the index into shards with no more than 4096 rows, and with no more than 25MB download size. The module also supports ingesting the vector results into Qdrant, ElasticSearch, but the search libraries are included in the ipfs_faiss module.

Clone this wiki locally