[MDS-6105] Permit Condition Extraction Improvements #3230

simensma-fresh · 2024-08-28T22:22:41Z

Objective

This PR includes some changes to the permit condition extraction pipeline in an attempt to increase accuracy of condition extraction, in addition to add a couple of extra features.

Azure Document Intelligence for text extraction, OCR, layout analysis
Instead of extracting text from PDFs using PyPDF, we switched to Azure Document Intelligence (General Layout Model)

Better accuracy for extracted text
Built in support for OCR
Layout models gives additional metadata for each paragraph extracted such as a bounding box, and type (title, sectionHeader, pageHeader etc.)

API Changes
The API Response for the results endpoint now contain a meta dict, which includes the bounding box of the condition, and answers to questions about the condition from GPT4 (defined in prompt template yml file)

Pipeline Changes
Previously, we basically asked GPT4 to do everything (extract all the text, answer questions, derive structure). This yielded pretty good results, but for larger documents always made some mistakes (e.g. parent clauses missing, list item inconsistencies or mismatches in section numbers). Changed a bit of an approach here with the introduction of Azure Ai Document Intelligence to more of a manual approach, and instead just using GPT4 to ask meta questions about each condition

pdf_converter - Extract text paragraphs, layout hints, and bounding boxes from Azure Intelligence (AzureDocumentIntelligenceConverter)
filter_paragraphs - using the layout hints and bounding boxes, identify the paragraphs that actually contain conditions, and remove things like page headers and footers.
parse_hierarchy - Extracts the numbering part of each paragraph (e.g. A, (1), a. and uses that in addition to the bounding boxes to identify the section, paragraph, subparagraph, clause, subclause for each condition
prompt_builder / llm / json_fixer: These steps are used to ask GPT questions about each paragraph (do they require a report? when's the due date? etc.)
combine_metadata: The output from GPT4 is combined with the conditions

…celery

simensma-fresh · 2024-08-28T22:28:09Z

services/permits/app/compare_extraction_results.py

@@ -1,18 +1,21 @@
 ###
 # Utility script to compare extracted permit conditions from CSV files to generate a csv and html report of how well they match
 # Usage: python compare_extraction_results.py --csv_pairs <auto_extracted_csv> <manual_extracted_csv> --csv_pairs <auto_extracted_csv> <manual_extracted_csv> ...


Just included a couple of tweaks to the report generation to include the extracted meta dict in the report html

simensma-fresh · 2024-08-28T23:37:43Z

services/permits/app/extract_and_validate_pdf.py

@@ -43,34 +44,69 @@ def authenticate_with_oauth():
    return oauth_session


+def refresh_token(oauth_session):


Added automatic refreshing of auth token here as it would sometimes timeout due to how long the extraction process takes

simensma-fresh · 2024-08-28T23:42:03Z

services/permits/tests/test_parse_hierarchy.py

+        {
+            "text": "(a) The operational setback distance at the crest is defined with a Factor of Safety FoS > 1.1 for short term analysis of deep-seated failures and areas with less than a FoS of 1.1 are identifiable as a high-risk zone;",
+            "meta": {"bounding_box": {"left": 2.4115}},
+        },


This is an example from a permit that contains the following structure:

Section (Uppercase letter) -> Paragraph (number) -> Subparagraph (lowercase letter) -> Clause (roman numeral) -> Subclause (lowercase letter)

It serves as a good test case as lowercase letters are in this cased used for both subparagraphs and subclauses

sonarqubecloud · 2024-08-29T16:40:26Z

Quality Gate passed for 'bcgov-sonarcloud_mds_permits'

Issues
3 New issues
0 Accepted issues

Measures
0 Security Hotspots
80.9% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

taraepp · 2024-09-03T15:15:48Z

services/permits/tests/test_azure_document_intelligence_converter.py

+    "app.permit_conditions.converters.azure_document_intelligence_converter.DocumentAnalysisClient"
+)
+def test_run(mock_client, converter, tmp_path):
+    os.environ["DEBUG_MODE"] = "faalse"


this typo probably doesn't matter, pointing it out anyway

* [MDS-6086] Added support for async permit condition extraction using celery * MDS-6086 Tweaks after PR feedback * MDS-6986 Fixed tests + added github action to run permit service tests * MDS-6086 Moved tests folder * MDS-6086 Run elasticsearch using https locally * MDS-6086 Fixed celery setup to accept certs * Fixed cert job startup issue * Fixes * Permit condition extraction using azure document intelligence * MDS-6086 Fixed tests * Added more tests, cleanup * Added more tests, cleanup, plug in GPT4 to answer questions * MDS-6086 Added missing tests, cleanup * Tweak sonar-project * Update to reportPaths * Updated tests * Add missing test * MDS-6086 Added tests for permit condition pipeline * MDS-6086 Fix sonarcloud issues

simensma-fresh added 12 commits August 26, 2024 13:53

[MDS-6086] Added support for async permit condition extraction using …

75e89f3

…celery

MDS-6086 Tweaks after PR feedback

e11ba7d

MDS-6986 Fixed tests + added github action to run permit service tests

4c0a294

MDS-6086 Moved tests folder

a742cbf

MDS-6086 Run elasticsearch using https locally

21777a3

MDS-6086 Fixed celery setup to accept certs

90e39ee

Fixed cert job startup issue

fb3cf22

Fixes

3e8b7b5

Permit condition extraction using azure document intelligence

88cf462

MDS-6086 Fixed tests

5bb55d6

Added more tests, cleanup

267bb81

Added more tests, cleanup, plug in GPT4 to answer questions

b9fa122

simensma-fresh commented Aug 28, 2024

View reviewed changes

MDS-6086 Added missing tests, cleanup

088b03e

simensma-fresh commented Aug 28, 2024

View reviewed changes

simensma-fresh added 4 commits August 28, 2024 23:57

Tweak sonar-project

92c7727

Update to reportPaths

6c8e26b

Updated tests

f7f1b49

Add missing test

431d4f1

simensma-fresh requested review from matbusby-fw, henryoforeh-dev, taraepp and asinn134 August 29, 2024 02:13

simensma-fresh added 🏭 CI/CD This pull request includes CI/CD changes. 👍 Ready for review Pull request has been double checked by the author and is ready for comments and feedback. 💾 Backend This pull request includes backend changes. labels Aug 29, 2024

simensma-fresh marked this pull request as ready for review August 29, 2024 02:14

simensma-fresh added 2 commits August 29, 2024 16:10

MDS-6086 Added tests for permit condition pipeline

3aaf28c

MDS-6086 Fix sonarcloud issues

8233995

henryoforeh-dev approved these changes Aug 29, 2024

View reviewed changes

taraepp reviewed Sep 3, 2024

View reviewed changes

taraepp approved these changes Sep 3, 2024

View reviewed changes

simensma-fresh merged commit 82c9023 into develop Sep 3, 2024
8 checks passed

simensma-fresh deleted the MDS-6086_Permit-condition-extraction branch September 3, 2024 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MDS-6105] Permit Condition Extraction Improvements #3230

[MDS-6105] Permit Condition Extraction Improvements #3230

simensma-fresh commented Aug 28, 2024 •

edited

Loading

simensma-fresh Aug 28, 2024

simensma-fresh Aug 28, 2024

simensma-fresh Aug 28, 2024

sonarqubecloud bot commented Aug 29, 2024

taraepp Sep 3, 2024

		@@ -43,34 +44,69 @@ def authenticate_with_oauth():
		return oauth_session


		def refresh_token(oauth_session):

[MDS-6105] Permit Condition Extraction Improvements #3230

[MDS-6105] Permit Condition Extraction Improvements #3230

Conversation

simensma-fresh commented Aug 28, 2024 • edited Loading

Objective

simensma-fresh Aug 28, 2024

Choose a reason for hiding this comment

simensma-fresh Aug 28, 2024

Choose a reason for hiding this comment

simensma-fresh Aug 28, 2024

Choose a reason for hiding this comment

sonarqubecloud bot commented Aug 29, 2024

Quality Gate passed for 'bcgov-sonarcloud_mds_permits'

taraepp Sep 3, 2024

Choose a reason for hiding this comment

simensma-fresh commented Aug 28, 2024 •

edited

Loading