Best way of indexing a large text file #7

ReneReiterer · 2023-06-22T07:32:16Z

ReneReiterer
Jun 22, 2023

Hello,
first of, really love your package, so easy to work with.
But since i am pretty new to everything about vector databases, i cant really seem to get my head around on how to best index a large text file so it can be efficiently searched by your package. Any Tips or Ideas would be really helpfull

Stevenic · 2023-06-22T07:45:05Z

Stevenic
Jun 22, 2023
Maintainer

Thanks @ReneReiterer… Vectra is in its early stages and will gain more features over time so there are a few pieces you’ll need to pull in from other libraries. You basically need a text chunker that breaks your large text file into chunks.

I’d recommend the chunker in LangChain.JS. I can’t remember the exact class name but I think it’s called the RecursiveCharacterSplitter class or something similar. You need to give the class a chunk size (I’d recommend 1600 characters) and an overlap size (I’d recommend 200 characters.)

You can just pass the whole file into this chunker and then generate embeddings for each chunk and then store each chunk in a LocalIndex. You can store the text from the chunk as metadata for the chunk and then when you query the index with the embeddings for a users query you can take the top 5 chunks and add the text of those chunks to your prompt.

3 replies

ReneReiterer Jun 22, 2023
Author

Sounds easy enough ^^.
Thanks for the fast answer, will try that

Stevenic Jun 22, 2023
Maintainer

Im planning to add text chunkers directly to Vectra and will likely build a higher level LocalDocumentIndex class that acts more like a full featured semantic index. Just working my way slowly towards that

Stevenic Jun 22, 2023
Maintainer

Another tip… if you’re planning to index a lot of files it’s important that you think about the size of the metadata being added to the index as this all needs to be loaded into memory at some point. I would consider storing just the startPos, endPos, and url as metadata for indexes with a lot of file. This let’s you mark the beginning and end of the chunk position wise and you can use the fs to retrieve the chunks as needed when generating your prompt.

Stevenic · 2023-09-24T08:32:18Z

Stevenic
Sep 24, 2023
Maintainer

The new TextSplitter class in Vectra is pretty much state-of-the-art and results in a 90% reduction in index size. Vectra now also includes a CLI for crawling documents so you should just let Vectra do its thing... :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way of indexing a large text file #7

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best way of indexing a large text file #7

ReneReiterer Jun 22, 2023

Replies: 2 comments · 3 replies

Stevenic Jun 22, 2023 Maintainer

ReneReiterer Jun 22, 2023 Author

Stevenic Jun 22, 2023 Maintainer

Stevenic Jun 22, 2023 Maintainer

Stevenic Sep 24, 2023 Maintainer

ReneReiterer
Jun 22, 2023

Replies: 2 comments 3 replies

Stevenic
Jun 22, 2023
Maintainer

ReneReiterer Jun 22, 2023
Author

Stevenic Jun 22, 2023
Maintainer

Stevenic Jun 22, 2023
Maintainer

Stevenic
Sep 24, 2023
Maintainer