-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Integrate document vector indexing (#13)
* Add docs and document loading script Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add METADATA_KEY Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Update DB environment variable names Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add QDRANT to the environment for configuration Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Migrate and test load_docs and add the GUIDE.md Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Implement wbdocs to schema converter Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Update document metadata Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add other fields Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add disciple field Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Update the document schema Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Apply black format Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Fix typing for qdrant file Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Fix linting Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Fix lint Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add note on the advantages of using the metadata standard Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Fix wbdocs metadata mapper Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add context generation script and schema2info Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add APIPrompt Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Fix static method Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> * Add the indexing guide to the documentation Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com> --------- Signed-off-by: Aivin V. Solatorio <avsolatorio@gmail.com>
- Loading branch information
1 parent
f873907
commit 8f088e4
Showing
24 changed files
with
1,697 additions
and
76 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Guide for indexing documents and data | ||
|
||
Create the following directory structure: | ||
|
||
``` | ||
data/sources/<data_type>/<collection>/ | ||
- <extension>/ | ||
- metadata/ | ||
``` | ||
|
||
For example: | ||
|
||
``` | ||
data/sources/docs/prwp/ | ||
- pdf/ | ||
- metadata/ | ||
``` | ||
|
||
## Content | ||
|
||
Each `<extension>` directory contains the files to be indexed in the format specified by the extension. The files in this directory will be passed to the appropriate LangChain loader. | ||
|
||
## Metadata | ||
|
||
The `metadata` directory contains the metadata for the documents to be indexed. The files in this directory will be passed to the stored data in the index together with the vector for the each chunk of the content. | ||
|
||
To maximize the interoperability and reusability of the functionalities in LLM4Data and other related applications built on top of it, we use the [schema guide](https://mah0001.github.io/schema-guide/) to define the metadata for the documents and data. | ||
|
||
Using the standardized schema will allow you to easily integrate your own data and documents with applications built on top of LLM4Data such as the Chat4Dev application. | ||
|
||
## Indexing | ||
|
||
To index the documents and metadata, run the following command: | ||
|
||
```bash | ||
python -m llm4data.scripts.indexing.docs.load_docs --path=data/sources/docs/prwp/pdf --strict | ||
``` | ||
|
||
This will process the documents and store the vectors generated to the configured vector index. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -189,7 +189,6 @@ def llm2sql_answer( | |
drop_na=True, | ||
num_samples=20, | ||
): | ||
|
||
if params is None: | ||
params = {} | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.