-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MDS-6105] Permit Condition Extraction Improvements #3230
Conversation
@@ -1,18 +1,21 @@ | |||
### | |||
# Utility script to compare extracted permit conditions from CSV files to generate a csv and html report of how well they match | |||
# Usage: python compare_extraction_results.py --csv_pairs <auto_extracted_csv> <manual_extracted_csv> --csv_pairs <auto_extracted_csv> <manual_extracted_csv> ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just included a couple of tweaks to the report generation to include the extracted meta
dict in the report html
@@ -43,34 +44,69 @@ def authenticate_with_oauth(): | |||
return oauth_session | |||
|
|||
|
|||
def refresh_token(oauth_session): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added automatic refreshing of auth token here as it would sometimes timeout due to how long the extraction process takes
{ | ||
"text": "(a) The operational setback distance at the crest is defined with a Factor of Safety FoS > 1.1 for short term analysis of deep-seated failures and areas with less than a FoS of 1.1 are identifiable as a high-risk zone;", | ||
"meta": {"bounding_box": {"left": 2.4115}}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example from a permit that contains the following structure:
Section (Uppercase letter) -> Paragraph (number) -> Subparagraph (lowercase letter) -> Clause (roman numeral) -> Subclause (lowercase letter)
It serves as a good test case as lowercase letters are in this cased used for both subparagraphs and subclauses
Quality Gate passed for 'bcgov-sonarcloud_mds_permits'Issues Measures |
"app.permit_conditions.converters.azure_document_intelligence_converter.DocumentAnalysisClient" | ||
) | ||
def test_run(mock_client, converter, tmp_path): | ||
os.environ["DEBUG_MODE"] = "faalse" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this typo probably doesn't matter, pointing it out anyway
* [MDS-6086] Added support for async permit condition extraction using celery * MDS-6086 Tweaks after PR feedback * MDS-6986 Fixed tests + added github action to run permit service tests * MDS-6086 Moved tests folder * MDS-6086 Run elasticsearch using https locally * MDS-6086 Fixed celery setup to accept certs * Fixed cert job startup issue * Fixes * Permit condition extraction using azure document intelligence * MDS-6086 Fixed tests * Added more tests, cleanup * Added more tests, cleanup, plug in GPT4 to answer questions * MDS-6086 Added missing tests, cleanup * Tweak sonar-project * Update to reportPaths * Updated tests * Add missing test * MDS-6086 Added tests for permit condition pipeline * MDS-6086 Fix sonarcloud issues
Objective
MDS-6105
This PR includes some changes to the permit condition extraction pipeline in an attempt to increase accuracy of condition extraction, in addition to add a couple of extra features.
Azure Document Intelligence for text extraction, OCR, layout analysis
Instead of extracting text from PDFs using PyPDF, we switched to Azure Document Intelligence (General Layout Model)
API Changes
The API Response for the results endpoint now contain a meta dict, which includes the bounding box of the condition, and answers to questions about the condition from GPT4 (defined in prompt template yml file)
Pipeline Changes
Previously, we basically asked GPT4 to do everything (extract all the text, answer questions, derive structure). This yielded pretty good results, but for larger documents always made some mistakes (e.g. parent clauses missing, list item inconsistencies or mismatches in section numbers). Changed a bit of an approach here with the introduction of Azure Ai Document Intelligence to more of a manual approach, and instead just using GPT4 to ask meta questions about each condition