Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Customize Chunk Size Per Connector #1632

Open
stianrincon opened this issue Jun 13, 2024 · 0 comments
Open

Customize Chunk Size Per Connector #1632

stianrincon opened this issue Jun 13, 2024 · 0 comments

Comments

@stianrincon
Copy link
Contributor

After using Danswer for a while at BrightInsight, we propose adding a feature to customize chunk sizes when creating a connector.

Main Goals:

  1. Customize Chunk Size: Allow increasing or decreasing the vector database chunk size. Currently, this is set by DOC_EMBEDDING_CONTEXT_SIZE.
  2. Customize Chunk Overlap: Allow increasing or decreasing the vector database chunk overlap. Currently, this is set by CHUNK_OVERLAP.

Specific Details:

  • This modification will be off by default. To turn it on, we will use the environment setting ENABLE_VECTOR_DB_SETTINGS. This way, Danswer will continue working as usual unless this setting is enabled.
  • If ENABLE_VECTOR_DB_SETTINGS is true, when adding a new connector, two new fields will appear: one for DOC_EMBEDDING_CONTEXT_SIZE and another for CHUNK_OVERLAP.
  • Update the connector_credential_pair Table to save the values of DOC_EMBEDDING_CONTEXT_SIZE and CHUNK_OVERLAP. This way, we can reuse these settings when syncing again the connector.
  • Modify the chunking logic to check if a connector has DOC_EMBEDDING_CONTEXT_SIZE and CHUNK_OVERLAP in the database. If not, use the existing logic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant