Feat omniparse #1408

seehi · 2024-07-23T06:43:36Z

Features

config新增omniparse配置
基于omniparse的API封装成SDK
RAG在解析pdf文件时指定使用omniparse，而不是llamaindex默认的PyPDF

Feature Docs

Influence

Result

Other

codecov-commenter · 2024-07-23T06:52:02Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 48.90110% with 93 lines in your changes missing coverage. Please review.

Project coverage is 30.82%. Comparing base (a369912) to head (a771112).
Report is 2 commits behind head on main.

Files	Patch %	Lines
metagpt/utils/omniparse_client.py	34.61%	51 Missing ⚠️
metagpt/rag/parsers/omniparse.py	42.85%	32 Missing ⚠️
metagpt/rag/engines/simple.py	41.66%	7 Missing ⚠️
metagpt/rag/schema.py	89.28%	3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1408      +/-   ##
==========================================
+ Coverage   30.65%   30.82%   +0.16%     
==========================================
  Files         320      324       +4     
  Lines       19423    19603     +180     
==========================================
+ Hits         5954     6042      +88     
- Misses      13469    13561      +92

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

better629 · 2024-07-23T09:30:21Z

examples/rag/omniparse.py

+    logger.info(document_parse_ret)
+
+    # pdf
+    pdf_parse_ret = await client.parse_pdf(file_input=TEST_PDF)


Should we only use parse_file and route to different _parse_pdf and so on inside?

In fact, parse_document has already implemented this function, but it is more troublesome to be compatible with multiple file formats. It requires the file's triple (filename, file_bytes, mime_type) information.

better629 · 2024-07-23T09:31:44Z

metagpt/utils/omniparse_client.py

+        headers = headers or {}
+        _headers = {"Authorization": f"Bearer {self.api_key}"} if self.api_key else {}
+        headers.update(**_headers)
+        async with httpx.AsyncClient() as client:


use existed apost from https://github.com/geekan/MetaGPT/blob/main/metagpt/utils/ahttp_client.py ?

apost does not support file upload yet and the upload method is a bit different from httpx file upload. aiohttp needs to be organized through aiohttp.FormData

If you want to use apost, you need to repackage a file upload function, and the omniparse input parameter format package also needs to be rewritten, which is a bit of a big change.

better629 · 2024-07-23T09:38:38Z

metagpt/rag/engines/simple.py

+            pdf_parser = OmniParse(
+                api_key=config.omniparse.api_key,
+                base_url=config.omniparse.base_url,
+                parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD),


what if input_files are txts or docs in SimpleDirectoryReader.

No impact, use file_extractor parameter of SimpleDirectoryReader to specify the parser of the corresponding file. Currently, only PDF will use OmniParse. Other files are still parsed by llamaindex built-in

@staticmethod def _get_file_extractor() -> dict[str:BaseReader]: """ Get the file extractor. Currently, only PDF use OmniParse. Other document types use the built-in reader from llama_index. Returns: dict[file_type: BaseReader] """ file_extractor: dict[str:BaseReader] = {} if config.omniparse.base_url: pdf_parser = OmniParse( api_key=config.omniparse.api_key, base_url=config.omniparse.base_url, parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD), ) file_extractor[".pdf"] = pdf_parser return file_extractor

file_extractor = cls._get_file_extractor() documents = SimpleDirectoryReader( input_dir=input_dir, input_files=input_files, file_extractor=file_extractor ).load_data()

if you config with omniparse.base_url, you will use pdf_parser from OmniParse. But what if files are docs in input_files.

SimpleDirectoryReader will determine the file type based on input_files and match the corresponding file parser. When using OmniParse, file_extractor specifies file_extractor[".pdf"] = pdf_parser, so only .pdf files will use OmniParse, and others will still use the default llamaindex parser. For details, see the SimpleDirectoryReader.load_file source code

file_suffix = input_file.suffix.lower() if file_suffix in default_file_reader_suffix or file_suffix in file_extractor: # use file readers if file_suffix not in file_extractor: # instantiate file reader if not already reader_cls = default_file_reader_cls[file_suffix] file_extractor[file_suffix] = reader_cls() reader = file_extractor[file_suffix] # load data -- catch all errors except for ImportError try: kwargs = {"extra_info": metadata} if fs and not is_default_fs(fs): kwargs["fs"] = fs docs = reader.load_data(input_file, **kwargs)

better629 · 2024-07-23T09:47:59Z

metagpt/utils/omniparse_client.py

+from metagpt.utils.common import aread_bin
+
+
+class OmniParseClient:


Here already has a omniparse-client, should we use it directly.

OmniParseClient can be used alone as an independent client (SDK) of omniparse service, while the encapsulated OmniParse is integrated into llamaindex to meet its data parsing specifications, but it maintains an OmniParseClient internally for real document parsing, so that the encapsulation is more decoupled

seehi · 2024-07-27T03:14:59Z

omniparse已有SDK，可以考虑直接使用

HuiDBK · 2024-07-27T05:52:11Z

The SDK written by omniparse is not yet complete, and many interface functions will directly report errors without being used.

better629 · 2024-08-06T03:42:59Z

lgtm

HuiDBK added 7 commits July 18, 2024 20:40

mg集成omniparse

22b9990

代码优化

5287e02

单测相关

79334de

cr修改，单测完善

758acf8

代码优化

f9d3a8c

代码优化

6c39c80

处理冲突

340e148

seehi had a problem deploying to unittest July 23, 2024 06:43 — with GitHub Actions Failure

seehi temporarily deployed to pre-commit July 23, 2024 06:43 — with GitHub Actions Inactive

压缩omniparse案例文件大小

8a4e8f8

HuiDBK had a problem deploying to unittest July 23, 2024 07:59 — with GitHub Actions Failure

整理examples下rag目录

a771112

HuiDBK had a problem deploying to unittest July 23, 2024 08:55 — with GitHub Actions Failure

better629 reviewed Jul 23, 2024

View reviewed changes

处理冲突

015212d

HuiDBK had a problem deploying to unittest August 6, 2024 03:30 — with GitHub Actions Failure

better629 merged commit 22e1009 into geekan:main Aug 6, 2024
0 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat omniparse #1408

Feat omniparse #1408

seehi commented Jul 23, 2024

codecov-commenter commented Jul 23, 2024 •

edited

Loading

better629 Jul 23, 2024

HuiDBK Jul 23, 2024

better629 Jul 23, 2024

HuiDBK Jul 23, 2024 •

edited

Loading

HuiDBK Jul 23, 2024

better629 Jul 23, 2024

HuiDBK Jul 23, 2024

HuiDBK Jul 23, 2024

better629 Jul 25, 2024

HuiDBK Jul 26, 2024

better629 Jul 23, 2024

HuiDBK Jul 23, 2024

seehi commented Jul 27, 2024

HuiDBK commented Jul 27, 2024

better629 commented Aug 6, 2024

		from metagpt.utils.common import aread_bin


		class OmniParseClient:

Feat omniparse #1408

Feat omniparse #1408

Conversation

seehi commented Jul 23, 2024

codecov-commenter commented Jul 23, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HuiDBK Jul 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seehi commented Jul 27, 2024

HuiDBK commented Jul 27, 2024

better629 commented Aug 6, 2024

codecov-commenter commented Jul 23, 2024 •

edited

Loading

HuiDBK Jul 23, 2024 •

edited

Loading