Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat omniparse #1408

Merged
merged 10 commits into from
Aug 6, 2024
Merged

Feat omniparse #1408

merged 10 commits into from
Aug 6, 2024

Conversation

seehi
Copy link
Contributor

@seehi seehi commented Jul 23, 2024

Features

  • config新增omniparse配置
  • 基于omniparse的API封装成SDK
  • RAG在解析pdf文件时指定使用omniparse,而不是llamaindex默认的PyPDF

Feature Docs

Influence

image

Result

image image

Other

@codecov-commenter
Copy link

codecov-commenter commented Jul 23, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 48.90110% with 93 lines in your changes missing coverage. Please review.

Project coverage is 30.82%. Comparing base (a369912) to head (a771112).
Report is 2 commits behind head on main.

Files Patch % Lines
metagpt/utils/omniparse_client.py 34.61% 51 Missing ⚠️
metagpt/rag/parsers/omniparse.py 42.85% 32 Missing ⚠️
metagpt/rag/engines/simple.py 41.66% 7 Missing ⚠️
metagpt/rag/schema.py 89.28% 3 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1408      +/-   ##
==========================================
+ Coverage   30.65%   30.82%   +0.16%     
==========================================
  Files         320      324       +4     
  Lines       19423    19603     +180     
==========================================
+ Hits         5954     6042      +88     
- Misses      13469    13561      +92     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

logger.info(document_parse_ret)

# pdf
pdf_parse_ret = await client.parse_pdf(file_input=TEST_PDF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we only use parse_file and route to different _parse_pdf and so on inside?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, parse_document has already implemented this function, but it is more troublesome to be compatible with multiple file formats. It requires the file's triple (filename, file_bytes, mime_type) information.

headers = headers or {}
_headers = {"Authorization": f"Bearer {self.api_key}"} if self.api_key else {}
headers.update(**_headers)
async with httpx.AsyncClient() as client:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use existed apost from https://github.com/geekan/MetaGPT/blob/main/metagpt/utils/ahttp_client.py ?

Copy link
Contributor

@HuiDBK HuiDBK Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apost does not support file upload yet and the upload method is a bit different from httpx file upload. aiohttp needs to be organized through aiohttp.FormData

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to use apost, you need to repackage a file upload function, and the omniparse input parameter format package also needs to be rewritten, which is a bit of a big change.

pdf_parser = OmniParse(
api_key=config.omniparse.api_key,
base_url=config.omniparse.base_url,
parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if input_files are txts or docs in SimpleDirectoryReader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No impact, use file_extractor parameter of SimpleDirectoryReader to specify the parser of the corresponding file. Currently, only PDF will use OmniParse. Other files are still parsed by llamaindex built-in

@staticmethod
def _get_file_extractor() -> dict[str:BaseReader]:
    """
    Get the file extractor.
    Currently, only PDF use OmniParse. Other document types use the built-in reader from llama_index.

    Returns:
        dict[file_type: BaseReader]
    """
    file_extractor: dict[str:BaseReader] = {}
    if config.omniparse.base_url:
        pdf_parser = OmniParse(
            api_key=config.omniparse.api_key,
            base_url=config.omniparse.base_url,
            parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD),
        )
        file_extractor[".pdf"] = pdf_parser

    return file_extractor

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file_extractor = cls._get_file_extractor()
documents = SimpleDirectoryReader(
    input_dir=input_dir, input_files=input_files, file_extractor=file_extractor
).load_data()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you config with omniparse.base_url, you will use pdf_parser from OmniParse. But what if files are docs in input_files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleDirectoryReader will determine the file type based on input_files and match the corresponding file parser. When using OmniParse, file_extractor specifies file_extractor[".pdf"] = pdf_parser, so only .pdf files will use OmniParse, and others will still use the default llamaindex parser. For details, see the SimpleDirectoryReader.load_file source code

        file_suffix = input_file.suffix.lower()
        if file_suffix in default_file_reader_suffix or file_suffix in file_extractor:
            # use file readers
            if file_suffix not in file_extractor:
                # instantiate file reader if not already
                reader_cls = default_file_reader_cls[file_suffix]
                file_extractor[file_suffix] = reader_cls()
            reader = file_extractor[file_suffix]

            # load data -- catch all errors except for ImportError
            try:
                kwargs = {"extra_info": metadata}
                if fs and not is_default_fs(fs):
                    kwargs["fs"] = fs
                docs = reader.load_data(input_file, **kwargs)

from metagpt.utils.common import aread_bin


class OmniParseClient:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here already has a omniparse-client, should we use it directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OmniParseClient can be used alone as an independent client (SDK) of omniparse service, while the encapsulated OmniParse is integrated into llamaindex to meet its data parsing specifications, but it maintains an OmniParseClient internally for real document parsing, so that the encapsulation is more decoupled

@seehi
Copy link
Contributor Author

seehi commented Jul 27, 2024

omniparse已有SDK,可以考虑直接使用

@HuiDBK
Copy link
Contributor

HuiDBK commented Jul 27, 2024

The SDK written by omniparse is not yet complete, and many interface functions will directly report errors without being used.
image
image
image

@better629
Copy link
Collaborator

lgtm

@better629 better629 merged commit 22e1009 into geekan:main Aug 6, 2024
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants