Modify `_get_sub_docs` to use Custom Separator #254

adreichert · 2024-06-27T17:23:08Z

Summary

This PR modifies _get_sub_docs to use the separator passed into the LlamaParse constructor. I'm making this change as the string \n---\n occurs occasionally in our documents. If pagination is important, we need to use a separator less likely to occur in our documents such as \n$$$$$$$$\n.

Testing

CLI

Automated Tests passed

% export LLAMA_CLOUD_API_KEY=llx-[...]
% make test                                                                      
pytest tests
====================================================================== test session starts =======================================================================
platform darwin -- Python 3.11.7, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/areichert/Documents/llama_parse
configfile: pyproject.toml
plugins: anyio-4.4.0
collected 3 items                                                                                                                                                

tests/test_reader.py ...                                                                                                                                   [100%]

======================================================================= 3 passed in 15.01s =======================================================================

Test Script

We parsed this two page document, which has a \n---\n where the background color changes.

[...]
**TO HELP ENGAGE EMPLOYEES**

---

Fulkrum has been providing inspection, [...]

Test Script

import llama_parse
LLAMAPARSE_API_KEY = '[...]'
OPENAI_API_KEY = '[...]'

def parse(split_by_page, page_separator):
    print(f"{split_by_page=}, {page_separator=}")
    parser = llama_parse.LlamaParse(
        result_type='markdown',
        api_key=LLAMAPARSE_API_KEY,
        verbose=False,
        invalidate_cache=True,
        gpt4o_mode=True,
        gpt4o_api_key=OPENAI_API_KEY,
        ignore_errors=True,
        split_by_page=split_by_page,
        page_separator=page_separator,
    )
    result = parser.load_data('fulkrum.pdf')
    print(f"{len(result)} pages")


if __name__ == '__main__':
    parse(False, None)
    parse(True, None)
    parse(True, "\n$$$$$$$$\n")

Results

When not splitting by page, the output list contains 1 element
When using the default separator, we get five, an incorrect number of pages
When using the separator that doesn't appear in the document, we get the correct number of pages.

% python test2.py
split_by_page=False, page_separator=None
1 pages
split_by_page=True, page_separator=None
5 pages
split_by_page=True, page_separator='\n$$$$$$$$\n'
2 pages

Move _get_sub_docs to private function

b644065

hexapode approved these changes Jul 16, 2024

View reviewed changes

Merge branch 'main' into adreichert/split-on-custom-separator

6dd9230

jerryjliu merged commit 8938286 into run-llama:main Jul 17, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify `_get_sub_docs` to use Custom Separator #254

Modify `_get_sub_docs` to use Custom Separator #254

adreichert commented Jun 27, 2024 •

edited

Loading

Modify _get_sub_docs to use Custom Separator #254

Modify _get_sub_docs to use Custom Separator #254

Conversation

adreichert commented Jun 27, 2024 • edited Loading

Summary

Testing

CLI

Test Script

Test Script

Results

Modify `_get_sub_docs` to use Custom Separator #254

Modify `_get_sub_docs` to use Custom Separator #254

adreichert commented Jun 27, 2024 •

edited

Loading