Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2 feature request docs upload on the fly 2 #4

Merged
merged 7 commits into from
Aug 21, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
![GitHub](https://img.shields.io/github/license/RCGAI/SimplyRetrieve)
![GitHub release (with filter)](https://img.shields.io/github/v/release/RCGAI/SimplyRetrieve)

Latest: Users can now create and append knowledge on-the-fly through the newly added `Knowledge Tab` in the GUI.

## What is SimplyRetrieve?

*SimplyRetrieve* is an open-source tool with the goal of providing a fully localized, lightweight and user-friendly GUI and API platform for *Retrieval-Centric Generation* (RCG) approach to the machine learning community.
Expand Down Expand Up @@ -37,12 +39,13 @@ This tool is constructed based mainly on the awesome and familiar libraries of [
- Git clone this repository.
- In GPU-based Linux machine, activate your favourite python venv, install the necessary packages
- `pip install -r requirements.txt`
- If you would like to use your own data as a knowledge source, you can follow these steps. However, if you prefer to start with a simpler example, you can skip these steps and use the default simple sample knowledge source provided by the tool. Note that the sample knowledge source is intended only for demonstration purposes and should not be used for performance evaluations. To achieve accurate results, it is recommended to use your own knowledge source or the Wikipedia source for general usage.
- Optional: If you would like to use your own data as a knowledge source, you can follow these steps. However, if you prefer to start with a simpler example, you can skip these steps and use the default simple sample knowledge source provided by the tool. Note that the sample knowledge source is intended only for demonstration purposes and should not be used for performance evaluations. To achieve accurate results, it is recommended to use your own knowledge source or the Wikipedia source for general usage.
- Prepare knowledge source for retrieval: Put related documents(pdf etc.) into `chat/data/` directory and run the data preparation script (`cd chat/` then the following command)
```
CUDA_VISIBLE_DEVICES=0 python prepare.py --input data/ --output knowledge/ --config configs/default_release.json
```
- Supported document formats are `pdf, txt, doc, docx, ppt, pptx, html, md, csv`, and can be easily expanded by editing configuration file. Follow the tips on [this issue](https://github.com/nltk/nltk/issues/1787) if NLTK related error occurred.
- **Latest: Knowledge Base creation feature is now available through the `Knowledge Tab` of the GUI tool. Users can now add knowledge on-the-fly. Running the above prepare.py script prior to running the tool is not a necessity.**

## How to run this tool?
After setting up the prerequisites above, set the current path to `chat` directory (`cd chat/`), execute the command below. Then `grab a coffee!` as it will just take a few minutes to load.
Expand Down
38 changes: 35 additions & 3 deletions chat/chat.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from prompts.initialize import initialize_prompts, save_prompts
from retrieval.retrieve import embedding_create, index_retrieve
from retrieval.retrieve import retriever_mokb, retriever_weighting, initialize_retriever
from prepare import upload_knowledge, insert_knowledge
from configs.config import read_config, save_config

# Command Line Arguments Setting
Expand Down Expand Up @@ -76,6 +77,17 @@ def update_config(*config):
print("Update Config Completed")
return

# Update KnowledgeBase
def update_knowledge(config, path_files, k_dir, k_basename, k_disp, k_desc, progress=gr.Progress()):
global kwargs, retriever_name, knowledge, index, encoder, retriever_mode, embed_mode, drop_retriever
config_loaded = json.loads(config)
config_loaded = insert_knowledge(config_loaded, k_dir, k_basename, k_disp, k_desc)
kwargs["retriever_config"]["retriever"] = config_loaded["retriever_config"]["retriever"]
status = upload_knowledge(kwargs, path_files, k_dir, k_basename, progress)
if args.retriever:
retriever_name, knowledge, index, encoder, retriever_mode, embed_mode = initialize_retriever(kwargs)
return status, gr.Dropdown.update(choices=retriever_name), json.dumps(config_loaded, indent=4)

# Load Logs
def load_logs():
return df
Expand All @@ -99,7 +111,7 @@ def main():
with gr.Row():
chkbox_retriever = gr.Checkbox(label="Use KnowledgeBase", value=retriever_chk, visible=args.retriever)
drop_retmode = gr.Dropdown(retriever_mode, value=retriever_mode[0], type="index", multiselect=False, visible=args.retriever, label="KnowledgeBase Mode")
drop_retriever = gr.Dropdown(retriever_name, value=retriever_name[0], type="index", multiselect=False, visible=args.retriever, label="KnowledgeBase")
drop_retriever = gr.Dropdown(retriever_name, value=retriever_name[0], type="value", multiselect=False, visible=args.retriever, label="KnowledgeBase")
with gr.Column():
chkbox_retweight = gr.Checkbox(label="Prompt-Weighting", value=0, visible=args.retriever)
slider_retweight = gr.Slider(label="KnowledgeBase Weightage", minimum=0, maximum=100, value=100, step=1)
Expand All @@ -118,7 +130,8 @@ def bot(history, *retriever):
mode = retriever[1]
if mode == None:
mode = 0
idx = retriever[2]
#idx = retriever[2]
idx = retriever_name.index(retriever[2])
if idx == None:
idx = 0
flag_weight = retriever[3]
Expand Down Expand Up @@ -204,7 +217,8 @@ def api_user_bot(user_message, *retriever):
mode = retriever[1]
if mode == None:
mode = 0
idx = retriever[2]
#idx = retriever[2]
idx = retriever_name.index(retriever[2])
if idx == None:
idx = 0
flag_weight = retriever[3]
Expand Down Expand Up @@ -358,6 +372,19 @@ def api_user_llm(user_message):
col_count=(2,"dynamic"),
interactive=False, type="array", wrap=False)

# KnowledgeBase Creation Tab
with gr.Tab("Knowledge"):
with gr.Row():
files_knowledge = gr.File(file_count="multiple")
gr.Markdown("If *KnowledgeBase Filename* existed, Knowledge will be inserted in Append Mode.")
with gr.Row():
dir_knowledgebase = gr.Textbox(label='KnowledgeBase Directory', value="knowledge/", visible=args.fullfeature)
name_knowledgebase = gr.Textbox(label='KnowledgeBase Filename', value="knowledge_new", visible=args.fullfeature)
disp_knowledgebase = gr.Textbox(label='KnowledgeBase Display Name', value="knowledge New", visible=args.fullfeature)
desc_knowledgebase = gr.Textbox(label='KnowledgeBase Description', value="My personal knowledge", visible=args.fullfeature)
btn_k_upload = gr.Button("Create Knowledge", visible=args.fullfeature)
progress_k = gr.Textbox(label='Progress', value="Ready", visible=args.fullfeature)

# Events of Chat AI
retriever = [chkbox_retriever, drop_retmode, drop_retriever, chkbox_retweight, slider_retweight, chkbox_logging]
analysis = [anls_query, anls_prompt, anls_res, anls_rkslscore, anls_qkslscore, anls_rktlscore, anls_qktlscore]
Expand All @@ -380,6 +407,11 @@ def api_user_llm(user_message):
load_log_btn.click(fn=load_logs, inputs=None, outputs=datahistory, api_name="load-logs")
save_log_btn.click(fn=save_logs, inputs=save_log_path, api_name="save-logs")

# Events of KnowledgeBase Creation
btn_k_upload.click(fn=update_knowledge, inputs=[config_txt, files_knowledge, dir_knowledgebase,
name_knowledgebase, disp_knowledgebase, desc_knowledgebase],
outputs=[progress_k, drop_retriever, config_txt], api_name="upload-knowledge")

# App Main Settings
app.queue(max_size=100, api_open=args.api, concurrency_count=args.concurrencycount)
app.launch(share=False, server_name="0.0.0.0")
Expand Down
4 changes: 2 additions & 2 deletions chat/configs/default_release.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@
{
"name": "Expert Knowledge",
"description": "about knowledge of the expert",
"knowledgebase": "knowledge/local_knowledgebase.tsv",
"index": "knowledge/local_index.index",
"knowledgebase": "knowledge/knowledge_sample.tsv",
"index": "knowledge/knowledge_sample.index",
"index_type": "hnsw"
}
],
Expand Down
8 changes: 4 additions & 4 deletions chat/configs/default_release_multikb.json
Original file line number Diff line number Diff line change
Expand Up @@ -31,15 +31,15 @@
{
"name": "Expert Knowledge",
"description": "about knowledge of the expert",
"knowledgebase": "knowledge/local_knowledgebase.tsv",
"index": "knowledge/local_index.index",
"knowledgebase": "knowledge/knowledge_sample.tsv",
"index": "knowledge/knowledge_sample.index",
"index_type": "hnsw"
},
{
"name": "Junior Knowledge",
"description": "about knowledge of the junior",
"knowledgebase": "knowledge/local_knowledgebase_junior.tsv",
"index": "knowledge/local_index_junior.index",
"knowledgebase": "knowledge/knowledge_junior.tsv",
"index": "knowledge/knowledge_junior.index",
"index_type": "hnsw"
}
],
Expand Down
File renamed without changes.
69 changes: 69 additions & 0 deletions chat/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
from langchain.text_splitter import RecursiveCharacterTextSplitter, TokenTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
import faiss
import gradio as gr

parser = argparse.ArgumentParser()
parser.add_argument('--config', type=str, default=None)
Expand Down Expand Up @@ -83,6 +84,11 @@ def documents_load_config(doc_path, loaders):
docs.extend(loaders['.'+doc_path.split('.')[-1]](doc_path).load())
return docs

def documents_load_file(doc_file, loaders):
docs = []
docs.extend(loaders['.'+doc_file.name.split('.')[-1]](doc_file.name).load())
return docs

def documents_split(docs, encoder, chunk_size, chunk_overlap):
#text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
text_splitter = TokenTextSplitter(encoding_name=encoder, chunk_size=chunk_size, chunk_overlap=chunk_overlap)
Expand Down Expand Up @@ -145,6 +151,69 @@ def index_save(index_split, path, filename, fileformat):
faiss.write_index(index_split, savepath)
return

def insert_knowledge(config, k_dir, k_basename, k_disp, k_desc):
config_new = config
for item in config_new["retriever_config"]["retriever"]:
if item["knowledgebase"] == os.path.join(k_dir, k_basename + ".tsv"):
item["name"] = k_disp
item["description"] = k_desc
return config_new
new_knowledge = {"name": k_disp,
"description": k_disp,
"knowledgebase": os.path.join(k_dir, k_basename + ".tsv"),
"index": os.path.join(k_dir, k_basename + ".index"),
"index_type": "hnsw"
}
config_new["retriever_config"]["retriever"].append(new_knowledge)
return config_new

def upload_knowledge(config, path_files, k_dir, k_basename, progress=gr.Progress()):
global cnt_save
if os.path.exists(os.path.join(k_dir, k_basename+args.out_docsext)):
cnt_save = 1
else:
cnt_save = 0
progress(0.1, desc="Preparing")
os.makedirs(k_dir, exist_ok=True)
kwargs = config
print("configs:", kwargs)
initialize_loaders(kwargs)
docslist = path_files
print("total number of readable documents:", len(docslist))
print("readable documents:", docslist)

progress(0.3, desc="Loading, Splitting and Saving Documents")
print("loading, splitting and saving documents...")
cnt_passage = 0
cnt_split = 0
for item in tqdm(docslist):
docs = documents_load_file(item, kwargs['loader_config']['ext_types'])
docs_split = documents_split(docs, args.split_encoder, args.split_chunk_size, args.split_chunk_overlap)
documents_save(docs_split, k_dir, k_basename, args.out_docsext, cnt_split)
cnt_passage += len(docs)
cnt_split += len(docs_split)
print("total number of loaded passages:", cnt_passage)
print("total number of split passages:", cnt_split)

progress(0.5, desc="Creating and Saving Embedding")
print("creating embedding")
embed_split = embedding_create(args.embed_encoder, k_dir, k_basename, args.out_docsext)
print("total number of embeddings:", len(embed_split))
print("saving embedding")
embedding_save(embed_split, k_dir, k_basename, args.out_embedext)

progress(0.8, desc="Creating and Saving Index")
print("creating index")
embed_split = np.array(embed_split)
index_split = index_create(embed_split, args.index_method, args.index_hnsw_m, args.index_ivfpq_nlist, args.index_ivfpq_nsegment, args.index_ivfpq_nbit)
print("total number of indexes:", index_split.ntotal)
print("saving index")
index_save(index_split, k_dir, k_basename, args.out_indexext)

print("documents preparation completed")

return "Ready"

def main():
print(args)

Expand Down
6 changes: 6 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,9 @@ This manual provides simple instructions on operating the GUI of this tool. The
- Click the `Load Logs` button to display the query-response history and analytical info.
- Click the `Save Logs` button to save the query-response history and analytical info.
- Logs will be saved in the subdirectory of `analysis/`. You can change the `Save Log Path` next to the Save Logs button.

## Knowledge Tab
- `Drag and drop` new documents (pdf etc.), then click the `Create Knowledge` button to create knowledge base on-the-fly directly from GUI. Support multiple documents.
- If same Knowledgebase Filename existed, new knowledge will be automatically appended to that knowledge base.
- After knowledge base creation, Chat Tab will be automatically updated to include the new knowledge. New knowledge can be selected from the dropdown menu of knowledgebase under Chat Tab. For appended knowledge, current knowledge base used will be automatically updated.
- To make the newly created knowledge base permanent even after reloading the tool, click the `Save Config` button under Config Tab to save a copy of the automatically updated configs.