-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat omniparse #1408
Feat omniparse #1408
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1408 +/- ##
==========================================
+ Coverage 30.65% 30.82% +0.16%
==========================================
Files 320 324 +4
Lines 19423 19603 +180
==========================================
+ Hits 5954 6042 +88
- Misses 13469 13561 +92 ☔ View full report in Codecov by Sentry. |
logger.info(document_parse_ret) | ||
|
||
pdf_parse_ret = await client.parse_pdf(file_input=TEST_PDF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we only use parse_file
and route to different _parse_pdf
and so on inside?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, parse_document
has already implemented this function, but it is more troublesome to be compatible with multiple file formats. It requires the file's triple (filename, file_bytes, mime_type) information.
headers = headers or {} | ||
_headers = {"Authorization": f"Bearer {self.api_key}"} if self.api_key else {} | ||
headers.update(**_headers) | ||
async with httpx.AsyncClient() as client: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use existed apost
from https://github.com/geekan/MetaGPT/blob/main/metagpt/utils/ahttp_client.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apost
does not support file upload yet and the upload method is a bit different from httpx file upload. aiohttp needs to be organized through aiohttp.FormData
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to use apost
, you need to repackage a file upload function, and the omniparse input parameter format package also needs to be rewritten, which is a bit of a big change.
pdf_parser = OmniParse( | ||
api_key=config.omniparse.api_key, | ||
base_url=config.omniparse.base_url, | ||
parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if input_files
are txts or docs in SimpleDirectoryReader
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No impact, use file_extractor
parameter of SimpleDirectoryReader
to specify the parser of the corresponding file. Currently, only PDF will use OmniParse
. Other files are still parsed by llamaindex built-in
@staticmethod
def _get_file_extractor() -> dict[str:BaseReader]:
"""
Get the file extractor.
Currently, only PDF use OmniParse. Other document types use the built-in reader from llama_index.
Returns:
dict[file_type: BaseReader]
"""
file_extractor: dict[str:BaseReader] = {}
if config.omniparse.base_url:
pdf_parser = OmniParse(
api_key=config.omniparse.api_key,
base_url=config.omniparse.base_url,
parse_options=OmniParseOptions(parse_type=OmniParseType.PDF, result_type=ParseResultType.MD),
)
file_extractor[".pdf"] = pdf_parser
return file_extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
file_extractor = cls._get_file_extractor()
documents = SimpleDirectoryReader(
input_dir=input_dir, input_files=input_files, file_extractor=file_extractor
).load_data()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you config with omniparse.base_url, you will use pdf_parser from OmniParse. But what if files are docs in input_files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SimpleDirectoryReader
will determine the file type based on input_files
and match the corresponding file parser. When using OmniParse
, file_extractor
specifies file_extractor[".pdf"] = pdf_parser
, so only .pdf
files will use OmniParse
, and others will still use the default llamaindex
parser. For details, see the SimpleDirectoryReader.load_file
source code
file_suffix = input_file.suffix.lower()
if file_suffix in default_file_reader_suffix or file_suffix in file_extractor:
# use file readers
if file_suffix not in file_extractor:
# instantiate file reader if not already
reader_cls = default_file_reader_cls[file_suffix]
file_extractor[file_suffix] = reader_cls()
reader = file_extractor[file_suffix]
# load data -- catch all errors except for ImportError
try:
kwargs = {"extra_info": metadata}
if fs and not is_default_fs(fs):
kwargs["fs"] = fs
docs = reader.load_data(input_file, **kwargs)
from metagpt.utils.common import aread_bin | ||
|
||
|
||
class OmniParseClient: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here already has a omniparse-client
, should we use it directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OmniParseClient
can be used alone as an independent client (SDK) of omniparse service, while the encapsulated OmniParse
is integrated into llamaindex
to meet its data parsing specifications, but it maintains an OmniParseClient
internally for real document parsing, so that the encapsulation is more decoupled
omniparse已有SDK,可以考虑直接使用 |
lgtm |
Features
Feature Docs
Influence
Result
Other