Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify _get_sub_docs to use Custom Separator #254

Merged

Conversation

adreichert
Copy link
Contributor

@adreichert adreichert commented Jun 27, 2024

Summary

This PR modifies _get_sub_docs to use the separator passed into the LlamaParse constructor. I'm making this change as the string \n---\n occurs occasionally in our documents. If pagination is important, we need to use a separator less likely to occur in our documents such as \n$$$$$$$$\n.

Testing

CLI

Automated Tests passed

% export LLAMA_CLOUD_API_KEY=llx-[...]
% make test                                                                      
pytest tests
====================================================================== test session starts =======================================================================
platform darwin -- Python 3.11.7, pytest-8.2.2, pluggy-1.5.0
rootdir: /Users/areichert/Documents/llama_parse
configfile: pyproject.toml
plugins: anyio-4.4.0
collected 3 items                                                                                                                                                

tests/test_reader.py ...                                                                                                                                   [100%]

======================================================================= 3 passed in 15.01s =======================================================================

Test Script

We parsed this two page document, which has a \n---\n where the background color changes.

[...]
**TO HELP ENGAGE EMPLOYEES**

---

Fulkrum has been providing inspection, [...]

Test Script

import llama_parse
LLAMAPARSE_API_KEY = '[...]'
OPENAI_API_KEY = '[...]'

def parse(split_by_page, page_separator):
    print(f"{split_by_page=}, {page_separator=}")
    parser = llama_parse.LlamaParse(
        result_type='markdown',
        api_key=LLAMAPARSE_API_KEY,
        verbose=False,
        invalidate_cache=True,
        gpt4o_mode=True,
        gpt4o_api_key=OPENAI_API_KEY,
        ignore_errors=True,
        split_by_page=split_by_page,
        page_separator=page_separator,
    )
    result = parser.load_data('fulkrum.pdf')
    print(f"{len(result)} pages")


if __name__ == '__main__':
    parse(False, None)
    parse(True, None)
    parse(True, "\n$$$$$$$$\n")

Results

  • When not splitting by page, the output list contains 1 element
  • When using the default separator, we get five, an incorrect number of pages
  • When using the separator that doesn't appear in the document, we get the correct number of pages.
% python test2.py
split_by_page=False, page_separator=None
1 pages
split_by_page=True, page_separator=None
5 pages
split_by_page=True, page_separator='\n$$$$$$$$\n'
2 pages

@jerryjliu jerryjliu merged commit 8938286 into run-llama:main Jul 17, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants