-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor[Requests]: Validate requests from endpoints using pydantic #35
Conversation
WalkthroughThe pull request introduces several changes across multiple files, primarily focusing on enhancing the functionality and structure of data processing and API interactions. Key updates include modifications to the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/v0.3.0 #35 +/- ##
==================================================
+ Coverage 54.68% 57.15% +2.46%
==================================================
Files 37 37
Lines 1772 1706 -66
==================================================
+ Hits 969 975 +6
+ Misses 803 731 -72 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🧹 Outside diff range and nitpick comments (7)
backend/app/requests/schemas.py (3)
5-7
: Consider adding field description and validation.While the model is correctly defined, consider adding:
- Field description using
Field
for better API documentation- Validation to ensure page numbers are positive integers
- page_number: Optional[int] = None + page_number: Optional[int] = Field( + None, + description="The page number where the sentence appears in the document", + ge=0 + )
8-11
: Add text validation and field descriptions.Consider adding constraints and documentation for the text field:
- text: str + text: str = Field( + ..., + description="The actual text content of the sentence", + min_length=1 + ) - metadata: Optional[SentenceMetadata] = None + metadata: Optional[SentenceMetadata] = Field( + None, + description="Additional metadata about the sentence" + )
12-16
: Enhance field validation and documentation.Add proper validation for language codes and word count:
+from typing import Literal + class TextExtractionResponse(BaseModel): - content: List[StructuredSentence] - word_count: int - lang: str + content: List[StructuredSentence] = Field( + ..., + description="List of structured sentences extracted from the document" + ) + word_count: int = Field( + ..., + description="Total number of words in the extracted text", + ge=0 + ) + lang: str = Field( + ..., + description="ISO 639-1 language code of the text", + min_length=2, + max_length=2, + regex="^[a-z]{2}$" + )backend/app/processing/file_preprocessing.py (1)
31-45
: LGTM! Consider adding type hints for better code clarity.The refactoring from list comprehension to explicit loops improves readability and maintainability. The metadata handling with fallback is well implemented.
Consider adding type hints to improve code clarity:
- docs = [] - metadatas = [] + docs: list[str] = [] + metadatas: list[dict] = []backend/app/requests/__init__.py (2)
34-46
: Consider reviewing the timeout duration.The 360-second timeout seems quite long for a text extraction operation. Consider:
- Adding the timeout as a configurable setting
- Implementing a more reasonable default timeout
- Adding retry logic for long-running operations
87-87
: Consider making the references parameter configurable.The
references
parameter is hardcoded toTrue
. Consider making this configurable if there are cases where references aren't needed.- params={"references": True} + params={"references": settings.INCLUDE_REFERENCES} # Add to settingsbackend/app/processing/process_queue.py (1)
Line range hint
303-308
: Avoid duplicate appending of page numbersThere is a possibility that the page numbers are being appended multiple times to
page_numbers
, especially if both conditionals lead to appending. Please ensure that page numbers are only appended when appropriate, and consider deduplicating thepage_numbers
list if necessary.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (6)
- backend/app/processing/file_preprocessing.py (3 hunks)
- backend/app/processing/process_queue.py (5 hunks)
- backend/app/requests/init.py (3 hunks)
- backend/app/requests/schemas.py (1 hunks)
- backend/app/utils.py (0 hunks)
- backend/tests/processing/test_process_queue.py (4 hunks)
💤 Files with no reviewable changes (1)
- backend/app/utils.py
🧰 Additional context used
🔇 Additional comments (9)
backend/app/requests/schemas.py (1)
1-4
: LGTM! Clean and appropriate imports.The imports are well-structured and include all necessary typing utilities and Pydantic components.
backend/app/processing/file_preprocessing.py (1)
91-91
: Verify the impact of removing asset.type parameter.The removal of
asset.type
parameter fromextract_text_from_file
call might affect other parts of the codebase.Let's verify other usages of this function:
backend/tests/processing/test_process_queue.py (1)
1-1
: LGTM! Import aligns with pydantic refactoring.The addition of ExtractFieldsResponse import supports the PR objective of using pydantic for request validation.
backend/app/requests/__init__.py (5)
4-4
: LGTM! Import changes align with pydantic validation.The addition of pydantic schema imports supports the PR's objective of structured request validation.
51-52
: LGTM! Improved response handling with pydantic.The use of
TextExtractionResponse
pydantic model ensures proper response validation and typing.
95-97
: LGTM! Well-structured response handling.The use of
ExtractFieldsResponse
pydantic model ensures proper validation and typing of the extracted data.
Line range hint
99-103
: LGTM! Comprehensive error handling.The credit limit exception handling is well-implemented with appropriate error details and logging.
34-34
: Verify the impact of API endpoint changes.The endpoint has changed from
/v1/extract/file/content
to/v1/parse
. This breaking change needs careful consideration.Also applies to: 42-42
backend/app/processing/process_queue.py (1)
163-163
: Ensure proper process status handling when re-queuingWhen not all process steps are ready, the process is re-added to the execution queue without updating its status. Please verify that this behavior correctly maintains the process status and does not lead to unintended side effects, such as processes being stuck in an inconsistent state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (1)
backend/app/requests/__init__.py (1)
Line range hint
148-148
: Fix file handle leak in highlight_sentences_in_pdf.Similar to the earlier issue, this function also has a file handle leak.
Apply this change:
- file = open(file_path, "rb") - files["file"] = (os.path.basename(file_path), file) + with open(file_path, "rb") as file: + files["file"] = (os.path.basename(file_path), file)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- backend/app/processing/file_preprocessing.py (4 hunks)
- backend/app/processing/process_queue.py (5 hunks)
- backend/app/requests/init.py (4 hunks)
- backend/tests/processing/test_process_queue.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- backend/app/processing/file_preprocessing.py
- backend/app/processing/process_queue.py
- backend/tests/processing/test_process_queue.py
🧰 Additional context used
🔇 Additional comments (2)
backend/app/requests/__init__.py (2)
4-4
: LGTM! Import changes align with Pydantic integration.The addition of Pydantic models for response validation improves type safety and request handling.
60-60
: LGTM! Good improvements in type safety and resource management.The changes show several improvements:
- Added return type hint for better type safety
- Proper file handling with context manager
- Enhanced error handling for credit limit scenarios
Also applies to: 74-75, 87-87, 92-97
backend/app/requests/__init__.py
Outdated
def extract_text_from_file(api_token: str, file_path: str, metadata: bool=True) -> TextExtractionResponse: | ||
# Prepare the headers with the Bearer token | ||
headers = {"x-authorization": f"Bearer {api_token}"} | ||
files = {} | ||
file = open(file_path, "rb") | ||
files["file"] = (os.path.basename(file_path), file) | ||
|
||
response = requests.post( | ||
f"{settings.pandaetl_server_url}/v1/extract/file/content", | ||
f"{settings.pandaetl_server_url}/v1/parse", | ||
files=files, | ||
headers=headers, | ||
timeout=360, | ||
params={"metadata": metadata} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Fix file handle leak while maintaining new improvements.
While the type hints and metadata parameter are good improvements, there are critical issues that need to be addressed:
- The file handle is not properly closed (potential resource leak)
- Error logging might expose sensitive information
Apply these changes:
-def extract_text_from_file(api_token: str, file_path: str, metadata: bool=True) -> TextExtractionResponse:
- headers = {"x-authorization": f"Bearer {api_token}"}
- files = {}
- file = open(file_path, "rb")
- files["file"] = (os.path.basename(file_path), file)
+def extract_text_from_file(api_token: str, file_path: str, metadata: bool = True) -> TextExtractionResponse:
+ headers = {"x-authorization": f"Bearer {api_token}"}
+ files = {}
+
+ with open(file_path, "rb") as file:
+ files["file"] = (os.path.basename(file_path), file)
+ response = requests.post(
+ f"{settings.pandaetl_server_url}/v1/parse",
+ files=files,
+ headers=headers,
+ timeout=360,
+ params={"metadata": metadata}
+ )
Also, consider sanitizing the error logging to exclude potentially sensitive response data.
Also applies to: 51-52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
backend/tests/processing/test_process_queue.py (2)
51-58
: Fix inconsistent indentation in mock data structure.The indentation in the mock data structure is inconsistent, which affects readability.
Apply this diff to fix the indentation:
mock_extract_data.return_value = ExtractFieldsResponse(fields=[{"field1": "value1"}], references=[[{ - "name": "ESG_Reporting_Assurance", - "sources": [ - "Assurance" - ] - }]] + "name": "ESG_Reporting_Assurance", + "sources": [ + "Assurance" + ] + }]] )
69-69
: Add type validation for page_numbers array.While the assertion checks the structure, it would be beneficial to validate that
page_numbers
is an empty array rather than any falsy value.- assert result["context"] == [[{'name': 'ESG_Reporting_Assurance', 'sources': ['Assurance'], 'page_numbers': []}]] + assert result["context"] == [[{'name': 'ESG_Reporting_Assurance', 'sources': ['Assurance'], 'page_numbers': list()}]]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- backend/tests/processing/test_process_queue.py (4 hunks)
🧰 Additional context used
🔇 Additional comments (2)
backend/tests/processing/test_process_queue.py (2)
166-173
: Previous review comment about reducing test data duplication is still applicable.
Line range hint
1-173
: Add test cases to improve process_queue.py coverage.According to the coverage report, process_queue.py has 4 lines missing coverage (50% patch coverage). Consider adding test cases for:
- Error scenarios in extract_process
- Edge cases in ChromaDB initialization
- Different status transitions in update_process_step_status
Let's identify the specific uncovered lines:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
backend/tests/processing/test_process_queue.py (1)
51-58
: Improve readability of mock response structure.The mock data structure is correct but could be more readable with consistent indentation.
mock_extract_data.return_value = ExtractFieldsResponse( fields=[{"field1": "value1"}], references=[[{ "name": "ESG_Reporting_Assurance", "sources": ["Assurance"] }]] )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- backend/app/requests/init.py (4 hunks)
- backend/app/requests/schemas.py (1 hunks)
- backend/tests/processing/test_process_queue.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- backend/app/requests/init.py
- backend/app/requests/schemas.py
🧰 Additional context used
🔇 Additional comments (2)
backend/tests/processing/test_process_queue.py (2)
1-1
: LGTM! Import aligns with Pydantic validation objectives.The import of ExtractFieldsResponse is correctly placed and supports the PR's goal of validating requests using Pydantic.
70-70
: LGTM! Comprehensive assertion for context structure.The assertion properly validates the expected format of the context data.
mock_extract_data.return_value = ExtractFieldsResponse(fields=[{"field1": "value1"}], | ||
references=[[{ | ||
"name": "ESG_Reporting_Assurance", | ||
"sources": [ | ||
"Assurance" | ||
] | ||
}]] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Based on the code paths found in process_queue.py and the existing test coverage, I can now provide a specific response about the missing test coverage.
Add test cases for these specific untested scenarios
The following critical paths lack test coverage:
- Process status checks (
ProcessStatus.STOPPED
condition) - Asset preprocessing readiness checks (
all_process_steps_ready
) - Error handling for empty process steps ("No process found!" exception)
- Document content validation (word count > 500 check)
- Previous/next sentence handling in vectorstore operations
- Empty PDF content handling
The existing tests only cover:
- Basic exception handling (ValueError and CreditLimitExceeded)
- Happy path for extract process
- Process step status updates
- Short reference matching
🔗 Analysis chain
Add test cases to improve coverage.
According to the PR comments, the patch coverage for process_queue.py is at 50.00% with 4 lines missing coverage. Consider adding test cases for error scenarios and edge cases to improve coverage.
Let me gather more information about the test coverage by examining the existing test file and comparing it with the code paths we found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Identify untested code paths in process_queue.py
# Test: Search for conditional statements and error handling in process_queue.py
rg -A 3 "if |try:|except:|raise" backend/app/processing/process_queue.py
Length of output: 6695
Script:
#!/bin/bash
# Check existing test coverage for the identified code paths
# Look for test cases in test_process_queue.py
rg -A 3 "def test_|async def test_" backend/tests/processing/test_process_queue.py
# Look for mocked error scenarios
rg "raise|except" backend/tests/processing/test_process_queue.py
Length of output: 2159
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests