-
Notifications
You must be signed in to change notification settings - Fork 3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Structured Index of Documents (#9411)
* Structured Index of Documents * 替换pdf为url * 更改下载方式 * 更新README * 更新README * 修改环境安装指令
- Loading branch information
1 parent
297dbce
commit a26ddc4
Showing
14 changed files
with
1,382 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,181 @@ | ||
# 文档层次化索引 | ||
|
||
## 方法 | ||
|
||
1. 加载数据(load):把需要处理的 pdf 或者 html 文档加载到流程中。 | ||
2. 文档语篇结构解析(parse):使用大语言模型对文档进行语篇结构解析,根据语义重新切分文章,并解析出文档的语篇结构树。 | ||
3. 层次化摘要生成(summary):根据语篇结构树,自底向上对文档解析结果进行层次化摘要生成,生成不同层次信息的摘要。 | ||
4. 层次化索引构建(index):通过文本编码器,将这些不同层次的文本摘要片段嵌入到稠密检索的向量空间中,从而构建一个层次化文本索引。这种索引不仅包含了局部信息,还包含了较高层次的全局信息,能够支持对多种粒度信息的召回,以适应用户查询中的不同信息需求。 | ||
|
||
## 安装 | ||
|
||
### 环境依赖 | ||
|
||
推荐安装 gpu 版本的[PaddlePaddle](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/conda/linux-conda.html),以 cuda11.7的 paddle 为例,安装命令如下: | ||
|
||
```bash | ||
conda install paddlepaddle-gpu==2.6.2 cudatoolkit=11.7 -c https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/ -c conda-forge | ||
``` | ||
安装其他依赖: | ||
```bash | ||
pip install -r requirements.txt | ||
``` | ||
|
||
### 数据准备 | ||
|
||
- 源文档:需要构建层次化索引的文档语料,如路径`data/source`下的文档示例。每篇文档为单个文件,目前支持 PDF 或 HTML 格式。 | ||
脚本`data/source/download.sh`可用于下载示例文档: | ||
```bash | ||
apt install jq -y # 安装 jq 工具, 需要系统权限,若已安装可跳过 | ||
cd data/source | ||
bash download.sh | ||
``` | ||
- 查询文件:用户查询文本,目前支持 json 格式,单条查询为`query_id: query_text`,如查询文件示例`data/query.json`。 | ||
|
||
|
||
## 运行 | ||
|
||
### 索引构建 | ||
|
||
为单个文档文件构建层次化索引: | ||
```bash | ||
python construct_index.py \ | ||
--source data/source/2308.12950.pdf \ | ||
--parse_model_name_or_path Qwen/Qwen2-72B-Instruct \ | ||
--summarize_model_name_or_path Qwen/Qwen2-72B-Instruct \ | ||
--encode_model_name_or_path BAAI/bge-large-en-v1.5 \ | ||
--log_dir .logs | ||
``` | ||
|
||
为整个路径下的所有文档文件构建层次化索引: | ||
```bash | ||
python construct_index.py \ | ||
--source data/source \ | ||
--parse_model_name_or_path Qwen/Qwen2-72B-Instruct \ | ||
--summarize_model_name_or_path Qwen/Qwen2-72B-Instruct \ | ||
--encode_model_name_or_path BAAI/bge-large-en-v1.5 \ | ||
--log_dir .logs | ||
``` | ||
|
||
可调整参数包括: | ||
- `source`: 需要构建层次化索引的所有源文件的目录路径,或需要构建层次化索引的单个源文件 | ||
|
||
- `parse_model_name_or_path`: 用于文档语篇结构解析(parse)的模型的名称或路径 | ||
|
||
- `parse_model_url`: 用于文档语篇结构解析(parse)的模型的 URL。如果不需要则不要写这个参数 | ||
|
||
- `summarize_model_name_or_path`: 用于文档层次化摘要(summarize)的模型的名称或路径 | ||
|
||
- `summarize_model_url`: 用于文档层次化摘要(summarize)的模型的 URL。如果不需要则不要写这个参数 | ||
|
||
- `encode_model_name_or_path`: 用于文本编码的模型的名称或路径 | ||
|
||
- `log_dir`: 保存日志文件的路径 | ||
|
||
层次化索引的结果会保存在 `data/index/{encode_model_name_or_path}`, 每个源文档在此路径下有两个对应的缓存文件用于检索:`.pkl`文件包含源文档的层次化摘要文本,`.npy`文件包含对应的摘要文本编码向量。 | ||
例如,对 `data/source/CodeLlama.pdf` 构建的层次化索引缓存文件包括 `index/BAAI/bge-large-en-v1.5/CodeLlama.npy` 和 `index/BAAI/bge-large-en-v1.5/CodeLlama.pkl`。 | ||
|
||
### 检索输出 | ||
|
||
在层次化索引中检索查询相关摘要片段,并输出检索结果。 | ||
|
||
以文件形式查询多条文本: | ||
```bash | ||
python query.py \ | ||
--search_result_dir data/search_result \ | ||
--encode_model_name_or_path BAAI/bge-large-en-v1.5 \ | ||
--log_dir .logs \ | ||
--query_filepath data/query.json \ | ||
--top_k 5 \ | ||
--embedding_batch_size 128 | ||
``` | ||
|
||
以文本形式查询单条文本: | ||
```bash | ||
python query.py \ | ||
--search_result_dir data/search_result \ | ||
--encode_model_name_or_path BAAI/bge-large-en-v1.5 \ | ||
--log_dir .logs \ | ||
--query_text "What is the relationship between CodeLlama and Llama?" \ | ||
--top_k 5 \ | ||
--embedding_batch_size 1 | ||
``` | ||
|
||
可调整参数为: | ||
- `search_result_dir`: 保存查询的检索结果的路径 | ||
|
||
- `encode_model_name_or_path`: 用于文本编码的模型的名称或路径 | ||
|
||
- `query_filepath`: query 的文件路径。如果有,它必须是一个查询字典的 JSON 文件 | ||
|
||
- `query_text`: 单条 query 的文本。如果有,它必须是一个字符串 | ||
|
||
- `top_k`: 设置为每条查询返回前 top_k 个结果 | ||
|
||
- `embedding_batch_size`: 编码 query 时的批处理大小 | ||
|
||
- `log_dir`: 保存日志文件的路径 | ||
|
||
检索结果保存在`{search_result_dir}/{encode_model_name_or_path}`路径下。此路径下的每个结果文件对应一次查询调用,包含若干条查询,即每次会在`{search_result_dir}/{encode_model_name_or_path}`路径下产生一个`query_{时间戳}.json`的文件记录查询结果,由查询 ID 唯一标识单次查询中的每条查询。若通过`query_text`传入查询文本,则查询 ID 设置为`"0"`。 | ||
|
||
例如上述单条查询的检索结果如下: | ||
```json | ||
{ | ||
"0": { | ||
"query": "What is the relationship between CodeLlama and Llama?", | ||
"hits": [ | ||
{ | ||
"corpus_id": 122, | ||
"score": 0.7032119035720825, | ||
"content": "CoDE LLAMA is a family of large language models for code, based on LLAMA 2, designed for state-of-the-art performance in programming tasks, including infilling, large context handling, and zero-shot instruction-following, with a focus on safety and alignment.", | ||
"source": "data/source/2308.12950.pdf", | ||
"level": 0 | ||
}, | ||
{ | ||
"corpus_id": 127, | ||
"score": 0.6490256786346436, | ||
"content": "CoDE LLAMA models are general-purpose code generation tools, with specialized versions like CoDE LLAMA -PyTHON for Python code and CoDE LLAMA -INsTRUCT for understanding and executing instructions.", | ||
"source": "data/source/2308.12950.pdf", | ||
"level": 3 | ||
}, | ||
{ | ||
"corpus_id": 128, | ||
"score": 0.6398724317550659, | ||
"content": "CoDE LLAMA -PyTHON is specialized for Python code generation, while CoDE LLAMA -INsTRUCT models are designed to understand and execute instructions.", | ||
"source": "data/source/2308.12950.pdf", | ||
"level": 4 | ||
}, | ||
{ | ||
"corpus_id": 161, | ||
"score": 0.6116989254951477, | ||
"content": "CoDE LLAMA models are designed for real-world applications, excelling in infilling and large context handling, and they achieve state-of-the-art performance on code generation benchmarks while ensuring safety and alignment.", | ||
"source": "data/source/2308.12950.pdf", | ||
"level": 2 | ||
}, | ||
{ | ||
"corpus_id": 129, | ||
"score": 0.6056838631629944, | ||
"content": "CoDE LLAMA -INsTRUCT are instruction-following models designed to understand and execute instructions.", | ||
"source": "data/source/2308.12950.pdf", | ||
"level": 5 | ||
} | ||
] | ||
} | ||
} | ||
``` | ||
其中,每条 query 检索结果的格式如下: | ||
``` | ||
查询ID: { | ||
"query": 查询文本, | ||
"hits": [ | ||
{ | ||
"corpus_id": 本条语料在所有语料中的编号, | ||
"score": 相似度分数, | ||
"content": 语料摘要内容, | ||
"source": 语料来源文档的路径, | ||
"level": 本条语料在源文档中的信息粒度层级, 0代表最高级, 数字越大,信息粒度越细 | ||
}, | ||
... | ||
] | ||
} | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from dataclasses import dataclass, field | ||
|
||
|
||
@dataclass | ||
class StructuredIndexerArguments: | ||
""" | ||
Arguments for StructuredIndexer. | ||
""" | ||
|
||
log_dir: str = field(default=".logs", metadata={"help": "log directory"}) | ||
|
||
|
||
@dataclass | ||
class StructuredIndexerEncodeArguments(StructuredIndexerArguments): | ||
""" | ||
Arguments for encoding corpus in StructuredIndexer. | ||
""" | ||
|
||
encode_model_name_or_path: str = field( | ||
default="BAAI/bge-large-en-v1.5", metadata={"help": "encode model name or path"} | ||
) | ||
|
||
|
||
@dataclass | ||
class StructuredIndexerPipelineArguments(StructuredIndexerEncodeArguments): | ||
""" | ||
Arguments for building StructuredIndex pipeline for a single corpus file. | ||
""" | ||
|
||
source: str = field(default="data/source", metadata={"help": "source file or directory"}) | ||
parse_model_name_or_path: str = field( | ||
default="Qwen/Qwen2-7B-Instruct", metadata={"help": "parse model name or path"} | ||
) | ||
parse_model_url: str = field(default=None, metadata={"help": "parse model url if you use api"}) | ||
summarize_model_name_or_path: str = field( | ||
default="Qwen/Qwen2-7B-Instruct", | ||
metadata={"help": "summarize model name or path"}, | ||
) | ||
summarize_model_url: str = field(default=None, metadata={"help": "summarize model url if you use api"}) | ||
|
||
|
||
@dataclass | ||
class RetrievalArguments(StructuredIndexerEncodeArguments): | ||
""" | ||
Arguments for StructuredIndex to retrieve. | ||
""" | ||
|
||
search_result_dir: str = field(default="search_result", metadata={"help": "search result directory"}) | ||
query_filepath: str = field(default="query.json", metadata={"help": "query file path"}) | ||
query_text: str = field(default=None, metadata={"help": "query text"}) | ||
top_k: int = field(default=5, metadata={"help": "top k results for each query"}) | ||
embedding_batch_size: int = field(default=128, metadata={"help": "embedding batch size for queries"}) |
48 changes: 48 additions & 0 deletions
48
slm/pipelines/examples/structured_index/construct_index.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import os | ||
|
||
from arguments import StructuredIndexerPipelineArguments | ||
from src.structured_index import StructuredIndexer | ||
|
||
from paddlenlp.trainer import PdArgumentParser | ||
|
||
if __name__ == "__main__": | ||
parser = PdArgumentParser(StructuredIndexerPipelineArguments) | ||
(args,) = parser.parse_args_into_dataclasses() | ||
|
||
structured_indexer = StructuredIndexer(log_dir=args.log_dir) | ||
assert os.path.exists(args.source) | ||
if os.path.isfile(args.source): | ||
structured_indexer.pipeline( | ||
filepath=args.source, | ||
parse_model_name_or_path=args.parse_model_name_or_path, | ||
parse_model_url=args.parse_model_url, | ||
summarize_model_name_or_path=args.summarize_model_name_or_path, | ||
summarize_model_url=args.summarize_model_url, | ||
encode_model_name_or_path=args.encode_model_name_or_path, | ||
) | ||
else: | ||
for root, _, files in os.walk(args.source): | ||
for file in files: | ||
filepath = os.path.join(root, file) | ||
structured_indexer.pipeline( | ||
filepath=filepath, | ||
parse_model_name_or_path=args.parse_model_name_or_path, | ||
parse_model_url=args.parse_model_url, | ||
summarize_model_name_or_path=args.summarize_model_name_or_path, | ||
summarize_model_url=args.summarize_model_url, | ||
encode_model_name_or_path=args.encode_model_name_or_path, | ||
) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
{ | ||
"0" : "What is big model alignment?", | ||
"1" : "What are the benefits of aligning large models?", | ||
"2" : "How to improve the decoding speed of large language model inference?", | ||
"3" : "What is the difference between CodeLlama and Llama?", | ||
"4" : "What is Grouped Multiple-Degradation Restoration with Image Degradation Similarity?" | ||
} |
30 changes: 30 additions & 0 deletions
30
slm/pipelines/examples/structured_index/data/source/download.sh
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
#!/bin/bash | ||
|
||
download() { | ||
local url=$1 | ||
local ext=$2 | ||
|
||
# 获取文件的basename | ||
local filename=$(basename "$url") | ||
|
||
# 检查文件名是否以ext结尾 | ||
if [[ "$filename" != *".$ext" ]]; then | ||
filename="$filename.$ext" | ||
fi | ||
|
||
# 下载文件 | ||
echo "Downloading $url as $filename" | ||
curl -o "$filename" "$url" | ||
} | ||
|
||
# 读取JSON文件并解析所有ext和对应的URL | ||
json_file="source_url.json" | ||
exts=$(jq -r 'keys[]' "$json_file") | ||
|
||
# 遍历每个ext并下载对应的文件 | ||
for ext in $exts; do | ||
urls=$(jq -r --arg ext "$ext" '.[$ext][]' "$json_file") | ||
for url in $urls; do | ||
download "$url" "$ext" | ||
done | ||
done |
10 changes: 10 additions & 0 deletions
10
slm/pipelines/examples/structured_index/data/source/source_url.json
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"pdf": [ | ||
"https://arxiv.org/pdf/2406.15877", | ||
"https://arxiv.org/pdf/2407.12273", | ||
"https://arxiv.org/pdf/2308.12950", | ||
"https://arxiv.org/pdf/1810.04805", | ||
"https://arxiv.org/pdf/2402.12374", | ||
"https://aclanthology.org/2023.ccl-2.7.pdf" | ||
] | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from arguments import RetrievalArguments | ||
from src.structured_index import StructuredIndexer | ||
|
||
from paddlenlp.trainer import PdArgumentParser | ||
|
||
if __name__ == "__main__": | ||
parser = PdArgumentParser(RetrievalArguments) | ||
(args,) = parser.parse_args_into_dataclasses() | ||
|
||
structured_indexer = StructuredIndexer(log_dir=args.log_dir) | ||
|
||
from src.utils import load_data | ||
|
||
if args.query_text is None: | ||
queries_dict = load_data(args.query_filepath, mode="Searching") | ||
else: | ||
|
||
assert isinstance(args.query_text, str) | ||
queries_dict = {"0": args.query_text} | ||
|
||
structured_indexer.search( | ||
queries_dict=queries_dict, | ||
output_dir=args.search_result_dir, | ||
model_name_or_path=args.encode_model_name_or_path, | ||
top_k=args.top_k, | ||
embedding_batch_size=args.embedding_batch_size, | ||
) |
Oops, something went wrong.