Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MDS-6105] Permit Condition Extraction Improvements #3230

Merged
merged 19 commits into from
Sep 3, 2024

Conversation

simensma-fresh
Copy link
Collaborator

@simensma-fresh simensma-fresh commented Aug 28, 2024

Objective

MDS-6105

This PR includes some changes to the permit condition extraction pipeline in an attempt to increase accuracy of condition extraction, in addition to add a couple of extra features.

permit_condition_pipeline

Azure Document Intelligence for text extraction, OCR, layout analysis
Instead of extracting text from PDFs using PyPDF, we switched to Azure Document Intelligence (General Layout Model)

  • Better accuracy for extracted text
  • Built in support for OCR
  • Layout models gives additional metadata for each paragraph extracted such as a bounding box, and type (title, sectionHeader, pageHeader etc.)

API Changes
The API Response for the results endpoint now contain a meta dict, which includes the bounding box of the condition, and answers to questions about the condition from GPT4 (defined in prompt template yml file)

image

Pipeline Changes
Previously, we basically asked GPT4 to do everything (extract all the text, answer questions, derive structure). This yielded pretty good results, but for larger documents always made some mistakes (e.g. parent clauses missing, list item inconsistencies or mismatches in section numbers). Changed a bit of an approach here with the introduction of Azure Ai Document Intelligence to more of a manual approach, and instead just using GPT4 to ask meta questions about each condition

  1. pdf_converter - Extract text paragraphs, layout hints, and bounding boxes from Azure Intelligence (AzureDocumentIntelligenceConverter)
  2. filter_paragraphs - using the layout hints and bounding boxes, identify the paragraphs that actually contain conditions, and remove things like page headers and footers.
  3. parse_hierarchy - Extracts the numbering part of each paragraph (e.g. A, (1), a. and uses that in addition to the bounding boxes to identify the section, paragraph, subparagraph, clause, subclause for each condition
  4. prompt_builder / llm / json_fixer: These steps are used to ask GPT questions about each paragraph (do they require a report? when's the due date? etc.)
  5. combine_metadata: The output from GPT4 is combined with the conditions

@@ -1,18 +1,21 @@
###
# Utility script to compare extracted permit conditions from CSV files to generate a csv and html report of how well they match
# Usage: python compare_extraction_results.py --csv_pairs <auto_extracted_csv> <manual_extracted_csv> --csv_pairs <auto_extracted_csv> <manual_extracted_csv> ...
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just included a couple of tweaks to the report generation to include the extracted meta dict in the report html

@@ -43,34 +44,69 @@ def authenticate_with_oauth():
return oauth_session


def refresh_token(oauth_session):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added automatic refreshing of auth token here as it would sometimes timeout due to how long the extraction process takes

{
"text": "(a) The operational setback distance at the crest is defined with a Factor of Safety FoS > 1.1 for short term analysis of deep-seated failures and areas with less than a FoS of 1.1 are identifiable as a high-risk zone;",
"meta": {"bounding_box": {"left": 2.4115}},
},
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example from a permit that contains the following structure:

Section (Uppercase letter) -> Paragraph (number) -> Subparagraph (lowercase letter) -> Clause (roman numeral) -> Subclause (lowercase letter)

It serves as a good test case as lowercase letters are in this cased used for both subparagraphs and subclauses

@simensma-fresh simensma-fresh added 🏭 CI/CD This pull request includes CI/CD changes. 👍 Ready for review Pull request has been double checked by the author and is ready for comments and feedback. 💾 Backend This pull request includes backend changes. labels Aug 29, 2024
@simensma-fresh simensma-fresh marked this pull request as ready for review August 29, 2024 02:14
Copy link

Quality Gate Passed Quality Gate passed for 'bcgov-sonarcloud_mds_permits'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
80.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

"app.permit_conditions.converters.azure_document_intelligence_converter.DocumentAnalysisClient"
)
def test_run(mock_client, converter, tmp_path):
os.environ["DEBUG_MODE"] = "faalse"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this typo probably doesn't matter, pointing it out anyway

@simensma-fresh simensma-fresh merged commit 82c9023 into develop Sep 3, 2024
8 checks passed
@simensma-fresh simensma-fresh deleted the MDS-6086_Permit-condition-extraction branch September 3, 2024 17:48
simensma-fresh added a commit that referenced this pull request Sep 5, 2024
* [MDS-6086] Added support for async permit condition extraction using celery

* MDS-6086 Tweaks after PR feedback

* MDS-6986 Fixed tests + added github action to run permit service tests

* MDS-6086 Moved tests folder

* MDS-6086 Run elasticsearch using https locally

* MDS-6086 Fixed celery setup to accept certs

* Fixed cert job startup issue

* Fixes

* Permit condition extraction using azure document intelligence

* MDS-6086 Fixed tests

* Added more tests, cleanup

* Added more tests, cleanup, plug in GPT4 to answer questions

* MDS-6086 Added missing tests, cleanup

* Tweak sonar-project

* Update to reportPaths

* Updated tests

* Add missing test

* MDS-6086 Added tests for permit condition pipeline

* MDS-6086 Fix sonarcloud issues
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💾 Backend This pull request includes backend changes. 🏭 CI/CD This pull request includes CI/CD changes. 👍 Ready for review Pull request has been double checked by the author and is ready for comments and feedback.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants