feat(clp-package): Add support for extracting JSON streams from archives. #569

haiqi96 · 2024-10-31T01:40:27Z

Description

This PR adds json extraction task for the clp-package to prepare for the log-viewer support. The PR contains the following changes:

Added a new JsonExtractionJob to the query-scheduler.
Refactor the query scheduler and extract_task so that they support both IR extraction and Json extraction
Rename the extraction jobs. Such that IR extraction and Json extraction jobs are categorized as "Stream extraction job". Refactor multiple places in the code base to use the new naming standard.
Support submitting a Json extraction job from decompress.sh script

Validation performed

For CLP-unstructured.
Confirmed that a compressed file and be opened in the log-viewer via CLP webui.

For CLP-Json
Confirmed that a Jsonl file gets can be extracted when submitting a job via decompress.sh script

Summary by CodeRabbit

New Features
- Introduced support for JSON extraction alongside existing IR extraction capabilities.
- Added new command options for JSON extraction and updated relevant job handling logic.
- Enhanced user interface to accommodate JSON extraction queries.
Bug Fixes
- Improved error handling for extraction job submissions.
Documentation
- Updated configuration files and environment variables to reflect the transition from IR to stream terminology.
Refactor
- Renamed variables and methods to align with the new streaming data focus across multiple components.
- Updated naming conventions for MongoDB collections and related configurations.

To see the specific tasks where the Asana app for GitHub is being used, see below:
- https://app.asana.com/0/0/1208758891566689

coderabbitai · 2024-10-31T01:40:34Z

Walkthrough

The pull request introduces significant changes across multiple files to support JSON extraction alongside existing IR extraction functionalities. Key modifications include the addition of new constants, classes, and methods, as well as renaming attributes and functions to reflect a shift from "IR" to "stream" terminology. These updates enhance command handling, job configuration, and overall flexibility in data extraction processes, ensuring that both IR and JSON types can be processed effectively.

Changes

File	Change Summary
components/clp-package-utils/clp_package_utils/general.py	Added constant `EXTRACT_JSON_CMD`, renamed `ir_output_dir` to `stream_output_dir`, updated references in `generate_container_config` and `validate_worker_config` to accommodate new naming conventions.
components/clp-package-utils/clp_package_utils/scripts/decompress.py	Renamed `handle_extract_ir_cmd` to `handle_extract_stream_cmd`, added support for JSON extraction, updated command handling logic, and introduced a new parser for JSON extraction commands.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py	Renamed `submit_and_monitor_ir_extraction_job_in_db` to `submit_and_monitor_extraction_job_in_db`, updated to handle multiple extraction job types, and modified command handling to support both IR and JSON extraction.
components/job-orchestration/job_orchestration/scheduler/constants.py	Added new enum value `EXTRACT_JSON` to `QueryJobType`.
components/job-orchestration/job_orchestration/scheduler/job_config.py	Introduced `ExtractJsonJobConfig` class with attributes `archive_id` and `target_chunk_size`.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py	Added `StreamExtractionHandle`, `IrExtractionHandle`, and `JsonExtractionHandle` classes, updated job handling logic to support both extraction types, and improved error handling with new utility functions.
components/job-orchestration/job_orchestration/scheduler/scheduler_data.py	Added `ExtractJsonJob` class and modified `ExtractIrJob` class to remove specific properties.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py	Updated parameter names from "ir" to "stream" in various functions, reflecting a shift in terminology.
components/clp-py-utils/clp_py_utils/clp_config.py	Renamed `ir_output` to `stream_output`, introduced `StreamOutput` class, and updated validation logic accordingly.
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py	Changed command-line argument from `--ir-collection` to `--stream-collection`, updating associated logic.
components/job-orchestration/job_orchestration/executor/query/celeryconfig.py	Updated import and task routing from `extract_ir_task` to `extract_stream_task`.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py	Updated `make_command` function to handle both IR and JSON extraction, renamed `extract_ir` to `extract_stream`, and modified parameters accordingly.
components/log-viewer-webui/server/settings.json	Renamed MongoDB configuration keys from `MongoDbIrFilesCollectionName` to `MongoDbStreamFilesCollectionName` and updated directory paths.
components/log-viewer-webui/server/src/DbManager.js	Renamed variables and methods to reflect changes from IR to stream, added new method `submitAndWaitForExtractStreamJob`.
components/log-viewer-webui/server/src/app.js	Updated MongoDB collection name from `irFilesCollectionName` to `streamFilesCollectionName`.
components/log-viewer-webui/server/src/routes/static.js	Renamed `irFilesDir` to `streamFilesDir` and updated static file serving logic accordingly.
components/package-template/src/etc/clp-config.yml	Changed `ir_collection_name` to `stream_collection_name` and removed `ir_output` section, adding a new `stream_output` section.
components/log-viewer-webui/server/.env	Removed `IR_DATA_DIR` and added `STREAMS_DATA_DIR` with updated path.
components/log-viewer-webui/client/src/ui/QueryStatus.jsx	Introduced functionality for handling JSON extraction queries, modified control flow to accommodate new parameters.

Possibly related PRs

log-viewer-webui: Update yscope-log-viewer to the latest version. #565: The changes in this PR are related to the introduction of the EXTRACT_JSON_CMD constant and modifications to the CLPDockerMounts class, which are connected to updates in the decompress.py scripts that also introduce support for JSON extraction and modify command handling.

Suggested reviewers

davemarco

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1d1599e and a5eb217.

📒 Files selected for processing (1)

components/package-template/src/etc/clp-config.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/package-template/src/etc/clp-config.yml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 13

🧹 Outside diff range and nitpick comments (7)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)
75-75: Consider adding docstring to document the new enum value.

Since this is a new feature for JSON extraction, it would be helpful to add documentation explaining its purpose and usage.

Add a docstring to the enum class:
 class QueryJobType(IntEnum):
+    """Enum representing different types of query jobs.
+    
+    SEARCH_OR_AGGREGATION: Standard search or aggregation query
+    EXTRACT_IR: Extract IR (Intermediate Representation) from logs
+    EXTRACT_JSON: Extract JSON content from logs
+    """
     SEARCH_OR_AGGREGATION = 0
     EXTRACT_IR = auto()
     EXTRACT_JSON = auto()
components/job-orchestration/job_orchestration/scheduler/job_config.py (2)
52-55: Add class documentation and field validation.

The class implementation looks good, but could benefit from these improvements:

Add a docstring describing the purpose of the class and its attributes

Consider adding validation for archive_id to ensure it's not empty

Consider adding bounds validation for target_chunk_size when present

Here's a suggested implementation:
 class ExtractJsonJobConfig(QueryJobConfig):
+    """Configuration for JSON extraction jobs.
+
+    Attributes:
+        archive_id: Unique identifier for the archive to extract from
+        target_chunk_size: Optional size limit for extracted chunks in bytes
+    """
     archive_id: str
     target_chunk_size: typing.Optional[int] = None
+
+    @validator("archive_id")
+    def validate_archive_id(cls, v):
+        if not v.strip():
+            raise ValueError("archive_id cannot be empty")
+        return v
+
+    @validator("target_chunk_size")
+    def validate_chunk_size(cls, v):
+        if v is not None and v <= 0:
+            raise ValueError("target_chunk_size must be positive when specified")
+        return v
55-55: Remove extra blank line.

There's an unnecessary extra blank line after the class definition. Maintain consistent spacing with other classes in the file.
 class ExtractJsonJobConfig(QueryJobConfig):
     archive_id: str
     target_chunk_size: typing.Optional[int] = None
-
components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (1)
71-79: Consider adding class documentation

Please add a docstring to describe:

The purpose of this job type

Expected configuration parameters

Any specific behaviour or constraints

Example addition:
 class ExtractJsonJob(QueryJob):
+    """Represents a job for extracting JSON data from archives.
+
+    This job type handles the extraction of JSON-formatted data using the specified
+    configuration parameters from ExtractJsonJobConfig.
+    """
     extract_json_config: ExtractJsonJobConfig
components/job-orchestration/job_orchestration/executor/query/extract_ir_task.py (1)

34-34: Standardize the capitalization in log messages

Consider standardizing the capitalization of 'JSON' in your log messages to align with common conventions. Use 'Start JSON extraction' instead of 'Start Json extraction' for consistency and clarity.

Also applies to: 52-52
components/clp-package-utils/clp_package_utils/scripts/decompress.py (1)
Line range hint 167-167: Use the appropriate JobType when generating the container name.

At line 167, the container_name is generated using JobType.IR_EXTRACTION regardless of the actual job type. This may cause confusion or incorrect container labelling when handling JSON extraction commands.

Modify the code to select the correct JobType based on the job_command. For example:
-container_name = generate_container_name(JobType.IR_EXTRACTION)
+if job_command == EXTRACT_IR_CMD:
+    job_type = JobType.IR_EXTRACTION
+elif job_command == EXTRACT_JSON_CMD:
+    job_type = JobType.JSON_EXTRACTION
+else:
+    logger.exception(f"Unexpected command: {job_command}")
+    return -1
+
+container_name = generate_container_name(job_type)
Ensure that JobType.JSON_EXTRACTION is defined in clp_package_utils.general.
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

940-952: Address the TODO: improve variable names for better clarity

There's a TODO comment suggesting finding better names for waiting_jobs_map and job_key. Improving these names would enhance code readability and maintainability.

Would you like assistance in proposing more descriptive variable names? I can help generate suggestions or open a GitHub issue to track this task.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b92a6c6 and 496e8d8.

📒 Files selected for processing (8)

components/clp-package-utils/clp_package_utils/general.py (1 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (4 hunks)
components/job-orchestration/job_orchestration/executor/query/extract_ir_task.py (5 hunks)
components/job-orchestration/job_orchestration/scheduler/constants.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/job_config.py (1 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (14 hunks)
components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (2 hunks)

✅ Files skipped from review due to trivial changes (1)

components/clp-package-utils/clp_package_utils/general.py

🔇 Additional comments (10)

components/job-orchestration/job_orchestration/scheduler/constants.py (1)

75-75: The enum addition looks good, let's verify its usage.

The addition of EXTRACT_JSON to QueryJobType follows the correct pattern and maintains enum value consistency.

Let's verify the implementation of this new job type across the codebase:

✅ Verification successful

The EXTRACT_JSON job type is properly integrated across the codebase

The verification shows comprehensive implementation of the EXTRACT_JSON functionality:

Properly defined in job orchestration constants
Fully implemented job config class ExtractJsonJobConfig with required fields
Integrated into the query scheduler with proper handling
Command-line interface support in CLP package utils
Consistent error handling and job state management

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the implementation of EXTRACT_JSON job type
# Expected: Find corresponding handler implementations and configurations

# Check for job type usage
echo "Checking for EXTRACT_JSON usage:"
rg "EXTRACT_JSON" -A 3

# Look for related JSON extraction implementations
echo -e "\nChecking for JSON extraction related implementations:"
rg "ExtractJson(Job|Config)" -A 3

Length of output: 12690

components/job-orchestration/job_orchestration/scheduler/job_config.py (1)

52-54: Verify integration with dependent components.

The class looks well-structured, but let's verify its integration with other components.

✅ Verification successful

Integration verification completed successfully

The ExtractJsonJobConfig class is properly integrated across the codebase:

Correctly defined in QueryJobType enum as EXTRACT_JSON
Properly used in query scheduler for job handling
Consistently used in decompress utilities with correct configuration parameters
Well-integrated with the job orchestration system

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage and integration of ExtractJsonJobConfig

# Test 1: Check for consistent usage in decompress.py
echo "Checking decompress.py usage patterns..."
rg -A 5 "ExtractJsonJobConfig" "components/clp-package-utils/clp_package_utils/scripts/native/decompress.py"

# Test 2: Verify QueryJobType enum includes EXTRACT_JSON
echo "Verifying QueryJobType enum..."
rg -A 10 "class QueryJobType" "components/job-orchestration/job_orchestration/scheduler/query/"

Length of output: 746

Script:

#!/bin/bash
# Let's try a different approach to verify the integration

# Check for QueryJobType enum in the entire codebase
echo "Searching for QueryJobType enum definition..."
rg -A 10 "class QueryJobType.*Enum" 

# Check all imports and usages of ExtractJsonJobConfig
echo "Checking ExtractJsonJobConfig usages..."
rg "ExtractJsonJobConfig" -A 3

# Check for any job type constants or enums related to EXTRACT_JSON
echo "Checking for EXTRACT_JSON job type..."
rg "EXTRACT_JSON" -B 2 -A 2

Length of output: 14013

components/job-orchestration/job_orchestration/scheduler/scheduler_data.py (2)

14-14: Import addition looks good!

The ExtractJsonJobConfig import follows the established pattern and is properly grouped with related job configuration imports.

71-79: Implementation is correct and well-structured!

The ExtractJsonJob class properly implements the QueryJob interface and follows the established patterns seen in ExtractIrJob.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

147-150: Ensure 'extraction_config' is initialized before use

If an unsupported job type is passed, extraction_config might be undefined when calling submit_and_monitor_extraction_job_in_db. Verify that extraction_config is always initialized before it is used.

Run the following script to check for potential use of uninitialized variables:

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (5)

45-49: Correctly imported ExtractJsonJobConfig

The addition of ExtractJsonJobConfig to the imports ensures proper configuration handling for JSON extraction jobs.

58-58: Added ExtractJsonJob to scheduler imports

Including ExtractJsonJob in the imports allows the scheduler to manage JSON extraction jobs appropriately.

74-75: Initialization of active_archive_json_extractions

The new dictionary active_archive_json_extractions is properly initialized to track JSON extraction jobs and their associated waiting jobs.

Line range hint 380-389: Ensure correct task function is used for EXTRACT_JSON jobs

The code dispatches both EXTRACT_JSON and EXTRACT_IR job types using the extract_ir.s task function. Please verify if extract_ir.s is appropriate for EXTRACT_JSON jobs or if a separate task function should be used.

803-808: Verify use of 'ir_collection' when checking for JSON files

In json_file_exists_for_archive, the function checks the ir_collection for orig_file_id matching archive_id to determine if the JSON file exists. Please verify if accessing ir_collection is appropriate for JSON files or if a different collection should be used.

coderabbitai · 2024-10-31T01:50:11Z

components/job-orchestration/job_orchestration/executor/query/extract_ir_task.py

    results_cache_uri: str,
    ir_collection: str,
 ) -> Optional[List[str]]:
    if StorageEngine.CLP == storage_engine:
+        logger.info("Start IR extraction")
+        extract_ir_config = ExtractIrJobConfig.parse_obj(job_config_obj)


⚠️ Potential issue

Add error handling when parsing job configuration objects

When parsing job_config_obj into ExtractIrJobConfig and ExtractJsonJobConfig, consider handling potential ValidationError exceptions that Pydantic may raise if the input data is invalid. Implementing try-except blocks will enhance the robustness of your code by preventing unhandled exceptions.

Also applies to: 53-53

coderabbitai · 2024-10-31T01:50:11Z

components/job-orchestration/job_orchestration/executor/query/extract_ir_task.py

@@ -63,7 +84,7 @@ def extract_ir(
    clp_metadata_db_conn_params: dict,
    results_cache_uri: str,
 ) -> Dict[str, Any]:


🛠️ Refactor suggestion

Rename the extract_ir function to reflect its generalized purpose

Since the extract_ir function now handles both IR and JSON extraction tasks, consider renaming it to reflect its broader functionality. A name like extract_data or perform_extraction would more accurately describe the function's purpose.

components/job-orchestration/job_orchestration/executor/query/extract_ir_task.py

components/clp-package-utils/clp_package_utils/scripts/decompress.py

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (13)

components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1)

31-35: Consider enhancing index creation with monitoring and performance options.

While the index creation logic is sound, consider these improvements:

Add success logging for better operational monitoring
Consider adding index options like background: true for large collections

Here's a suggested enhancement:

     try:
         with MongoClient(results_cache_uri) as results_cache_client:
             stream_collection = results_cache_client.get_default_database()[stream_collection_name]
 
-            file_split_id_index = IndexModel(["file_split_id"])
-            orig_file_id_index = IndexModel(["orig_file_id", "begin_msg_ix", "end_msg_ix"])
-            stream_collection.create_indexes([file_split_id_index, orig_file_id_index])
+            file_split_id_index = IndexModel(["file_split_id"], background=True)
+            orig_file_id_index = IndexModel(
+                ["orig_file_id", "begin_msg_ix", "end_msg_ix"],
+                background=True
+            )
+            stream_collection.create_indexes([file_split_id_index, orig_file_id_index])
+            logger.info("Successfully created indices for stream collection")

components/log-viewer-webui/server/src/app.js (1)

40-40: Consider using consistent casing for the property name.

The property name StreamFilesCollectionName uses PascalCase, which is inconsistent with JavaScript/Node.js conventions where object properties typically use camelCase. Consider renaming it to streamFilesCollectionName.
-                StreamFilesCollectionName: settings.MongoDbStreamFilesCollectionName,
+                streamFilesCollectionName: settings.MongoDbStreamFilesCollectionName,

components/log-viewer-webui/server/src/DbManager.js (3)

87-87: Follow JavaScript naming conventions for private fields

The private field name #StreamFilesCollection should follow JavaScript camelCase naming conventions for variables and private fields.

Apply this change:
-    #StreamFilesCollection;
+    #streamFilesCollection;
Line range hint 141-145: Update method name and documentation to use Stream terminology

The method name and documentation still reference "Ir"/"IR" while the implementation uses the new Stream terminology. This inconsistency should be addressed for clarity and maintainability.

Consider these changes:

Rename the method from getExtractedIrFileMetadata to getExtractedStreamFileMetadata

Update the JSDoc comment to replace "IR file" with "Stream file"
-     * Gets the metadata for an IR file extracted from part of an original file, where the original
+     * Gets the metadata for a Stream file extracted from part of an original file, where the original
      * file has the given ID and the extracted part contains the given log event index.
      *
      * @param {string} origFileId
      * @param {number} logEventIdx
      * @return {Promise<object>} A promise that resolves to the extracted IR file's metadata.
      */
-    async getExtractedIrFileMetadata (origFileId, logEventIdx) {
+    async getExtractedStreamFileMetadata (origFileId, logEventIdx) {
Line range hint 180-191: Improve parameter naming and code style

Two suggestions for improvement:

The parameter name should follow camelCase convention

The collection assignment can be more concise

Apply these changes:
-     * @param {string} config.StreamFilesCollectionName
+     * @param {string} config.streamFilesCollectionName
      */
    #initMongo (config) {
        this.#fastify.register(fastifyMongo, {
            forceClose: true,
            url: `mongodb://${config.host}:${config.port}/${config.database}`,
        }).after((err) => {
            if (err) {
                throw err;
            }
-            this.#StreamFilesCollection =
-                this.#fastify.mongo.db.collection(config.StreamFilesCollectionName);
+            this.#streamFilesCollection = this.#fastify.mongo.db.collection(config.streamFilesCollectionName);
        });
    }

components/clp-package-utils/clp_package_utils/scripts/decompress.py (2)

Line range hint 169-171: Update JobType for JSON extraction.

The container name is generated using JobType.IR_EXTRACTION even when handling JSON extraction commands. Consider adding a new job type for JSON extraction or using a more generic type.

-    container_name = generate_container_name(JobType.IR_EXTRACTION)
+    container_name = generate_container_name(
+        JobType.IR_EXTRACTION if job_command == EXTRACT_IR_CMD else JobType.JSON_EXTRACTION
+    )

256-261: Enhance JSON extraction parser configuration.

Consider the following improvements:

Move the default chunk size to a named constant for better maintainability
Enhance the help text to provide more context about the parameters

+DEFAULT_JSON_CHUNK_SIZE = 100000  # Add at the top with other constants
+
     # Json extraction command parser
     json_extraction_parser = command_args_parser.add_parser(EXTRACT_JSON_CMD)
-    json_extraction_parser.add_argument("archive_id", type=str, help="Archive ID")
+    json_extraction_parser.add_argument("archive_id", type=str, help="Unique identifier for the archive to extract JSON from")
     json_extraction_parser.add_argument(
         "--target-chunk-size",
         type=int,
-        help="Target chunk size",
-        default=100000
+        help="Target size in bytes for each JSON chunk during extraction",
+        default=DEFAULT_JSON_CHUNK_SIZE
     )

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (1)

Line range hint 102-160: Consider improving error handling and variable declarations

While the implementation is functionally correct, consider these improvements:

Move variable declarations (extraction_config and job_type) closer to where they're first used
Add more specific error messages for unsupported commands

-    extraction_config: QueryJobConfig
-    job_type: QueryJobType
     if EXTRACT_IR_CMD == command:
-        job_type = QueryJobType.EXTRACT_IR
+        job_type: QueryJobType = QueryJobType.EXTRACT_IR
         orig_file_id: str
         if parsed_args.orig_file_id:
             orig_file_id = parsed_args.orig_file_id
         else:
             orig_file_path = parsed_args.orig_file_path
             orig_file_id = get_orig_file_id(clp_config.database, parsed_args.orig_file_path)
             if orig_file_id is None:
                 logger.error(f"Cannot find orig_file_id corresponding to {orig_file_path}")
                 return -1
-        extraction_config = ExtractIrJobConfig(
+        extraction_config: QueryJobConfig = ExtractIrJobConfig(
             orig_file_id=orig_file_id,
             msg_ix=parsed_args.msg_ix,
             target_uncompressed_size=parsed_args.target_uncompressed_size,
         )
     elif EXTRACT_JSON_CMD == command:
-        job_type = QueryJobType.EXTRACT_JSON
-        extraction_config = ExtractJsonJobConfig(
+        job_type: QueryJobType = QueryJobType.EXTRACT_JSON
+        extraction_config: QueryJobConfig = ExtractJsonJobConfig(
             archive_id=parsed_args.archive_id, target_chunk_size=parsed_args.target_chunk_size
         )
     else:
-        logger.exception(f"Unsupported stream extraction command: {command}")
+        logger.error(f"Unsupported stream extraction command: {command}. Supported commands: {EXTRACT_IR_CMD}, {EXTRACT_JSON_CMD}")
         return -1

components/clp-py-utils/clp_py_utils/clp_config.py (1)

Line range hint 348-367: Add docstring to explain the purpose of StreamOutput class

While the implementation is solid, consider adding a docstring to explain:

The purpose of this class in relation to stream extraction

The significance of the target_uncompressed_size parameter

The relationship between IR and JSON extraction outputs

Add this docstring:
 class StreamOutput(BaseModel):
+    """Configures output settings for stream extraction operations (IR and JSON).
+    
+    Attributes:
+        directory: Path where extracted stream files will be stored
+        target_uncompressed_size: Target size in bytes for uncompressed output files
+    """
     directory: pathlib.Path = pathlib.Path("var") / "data" / "stream"
     target_uncompressed_size: int = 128 * 1024 * 1024

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (3)

291-306: Enhance error handling and logging in stream extraction ID retrieval.

The function should provide more detailed error information when the archive or target IDs cannot be retrieved.

Apply this diff to improve error handling:

 def get_archive_and_target_ids_for_stream_extraction(
     db_conn, job_config: Dict[str, Any], job_type: QueryJobType
 ) -> Tuple[Optional[str], Optional[str]]:
     if QueryJobType.EXTRACT_IR == job_type:
         extract_ir_config = ExtractIrJobConfig.parse_obj(job_config)
         return get_archive_and_file_split_ids_for_ir_extraction(db_conn, extract_ir_config)

     extract_json_config = ExtractJsonJobConfig.parse_obj(job_config)
     archive_id = extract_json_config.archive_id
     if check_if_archive_exists(db_conn, extract_json_config.archive_id):
         return archive_id, archive_id
     else:
-        logger.error(f"archive {archive_id} does not exist")
+        logger.error(
+            f"Archive {archive_id} does not exist. Job config: {extract_json_config}"
+        )

     return None, None

Line range hint 630-708: Reduce code duplication in stream extraction job handling.

The job handling logic for IR and JSON extraction contains duplicated code blocks for status updates and logging. Consider extracting common functionality into helper methods.

Create a helper method for status updates:

+def update_job_status(
+    db_conn,
+    job_id: str,
+    status: QueryJobStatus,
+    prev_status: QueryJobStatus,
+    start_time: datetime.datetime,
+    num_tasks: int = 0,
+    duration: float = 0,
+) -> bool:
+    """Helper method to update job status with consistent logging."""
+    success = set_job_or_task_status(
+        db_conn,
+        QUERY_JOBS_TABLE_NAME,
+        job_id,
+        status,
+        prev_status,
+        start_time=start_time,
+        num_tasks=num_tasks,
+        duration=duration,
+    )
+    if not success:
+        logger.error(f"Failed to set job {job_id} as {status.to_str()}")
+    return success

 elif job_type in [QueryJobType.EXTRACT_IR, QueryJobType.EXTRACT_JSON]:
     archive_id, target_id = get_archive_and_target_ids_for_stream_extraction(
         db_conn, job_config, job_type
     )
     if not target_id:
-        if not set_job_or_task_status(
+        if not update_job_status(
             db_conn,
             job_id,
-            QueryJobStatus.FAILED,
-            QueryJobStatus.PENDING,
-            start_time=datetime.datetime.now(),
-            num_tasks=0,
-            duration=0,
-        ):
-            logger.error(f"Failed to set job {job_id} as failed")
+            status=QueryJobStatus.FAILED,
+            prev_status=QueryJobStatus.PENDING,
+            start_time=datetime.datetime.now()
+        ):
         continue

Line range hint 867-912: Improve type hints and error handling in stream extraction completion handler.

The function could benefit from more specific type hints and better error handling for edge cases.

Apply this diff to improve the code:

 async def handle_finished_stream_extraction_job(
-    db_conn, job: QueryJob, task_results: List[Any]
+    db_conn, job: Union[ExtractIrJob, ExtractJsonJob], task_results: List[Dict[str, Any]]
 ) -> None:
     global active_jobs
     global active_archive_json_extractions
     global active_file_split_ir_extractions

     job_id = job.id
     new_job_status = QueryJobStatus.SUCCEEDED

     num_tasks = len(task_results)
     if 1 != num_tasks:
+        task_ids = [str(result.get('task_id', 'unknown')) for result in task_results]
         logger.error(
-            f"Unexpected number of tasks for extraction job {job_id}. "
-            f"Expected 1, got {num_tasks}."
+            f"Unexpected number of tasks for extraction job {job_id}. Expected 1, got {num_tasks}. "
+            f"Task IDs: {', '.join(task_ids)}"
         )
         new_job_status = QueryJobStatus.FAILED

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (1)

34-35: Consistency in Logging Messages

Consider updating the log message to "Starting IR extraction" for grammatical consistency and clarity.

Apply this diff to adjust the log message:
-logger.info("Start IR extraction")
+logger.info("Starting IR extraction")

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 496e8d8 and 9518074.

📒 Files selected for processing (14)

components/clp-package-utils/clp_package_utils/general.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/decompress.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (4 hunks)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (5 hunks)
components/clp-py-utils/clp_py_utils/clp_config.py (7 hunks)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1 hunks)
components/job-orchestration/job_orchestration/executor/query/celeryconfig.py (1 hunks)
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (6 hunks)
components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)
components/log-viewer-webui/server/settings.json (1 hunks)
components/log-viewer-webui/server/src/DbManager.js (4 hunks)
components/log-viewer-webui/server/src/app.js (1 hunks)
components/log-viewer-webui/server/src/routes/static.js (1 hunks)
components/package-template/src/etc/clp-config.yml (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/clp-package-utils/clp_package_utils/general.py

🔇 Additional comments (33)

components/log-viewer-webui/server/settings.json (2)

9-9: Verify MongoDB collection migration requirements.

The collection name change from "ir-files" to "stream-files" may require data migration for existing deployments.

#!/bin/bash
# Check if there are any references to the old collection name that need updating
rg -l "ir-files" --type js --type py

Consider adding a migration script to handle existing data in the "ir-files" collection if needed.

12-12: Ensure proper stream directory setup.

The path change from "ir" to "stream" requires proper directory setup in new and existing deployments.

Consider:

Adding directory creation in deployment scripts
Documenting the path change in upgrade guides
Ensuring proper permissions are set for the new directory

✅ Verification successful

Let's search more broadly for directory setup patterns to verify the implementation details and documentation.

Let's examine the configuration files and setup scripts to understand how the directory structure is managed.

Let's verify if the directory creation is handled in the startup or installation scripts.

Let's examine these Python scripts to verify if they handle the directory creation during startup.

Directory setup is properly handled by the system

After examining the codebase, I can confirm that:

The start_clp.py script automatically creates all required directories including the stream directory (var/data/stream) during system startup
Directory creation is handled through mkdir(parents=True, exist_ok=True) calls
Proper permissions are set through the container user management system

The path change from "ir" to "stream" is safely managed by the existing infrastructure and requires no additional setup in deployments.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to the old directory path
rg -l "var/data/ir" --type js --type py --type sh

Length of output: 36

Script:

#!/bin/bash
# Search for directory references in deployment or setup scripts
fd -e sh -e py setup install deploy | xargs rg -l "var/data"

# Search for any documentation files mentioning directory paths
fd -e md -e txt | xargs rg -l "var/data"

# Check for any configuration files that might reference these paths
fd -e json -e yaml -e conf | xargs rg -l "var/data"

Length of output: 450

Script:

#!/bin/bash
# Let's examine the content of the config files we found
cat components/log-viewer-webui/server/settings.json
cat components/package-template/src/etc/clp-config.yml

# Check for any installation or setup scripts in the package template
fd -e sh -e py . components/package-template/src/

Length of output: 3262

Script:

#!/bin/bash
# Let's examine the startup scripts to see if they handle directory creation
cat components/package-template/src/sbin/start-clp.sh

# Check if there are any installation or setup scripts in the root of the project
fd -e sh -e py setup install init | xargs rg -l "mkdir"

Length of output: 2144

Script:

#!/bin/bash
# Let's check the content of the start_clp.py script which might handle directory creation
cat components/clp-package-utils/clp_package_utils/scripts/start_clp.py

# Also check the package utils for directory setup
cat components/clp-package-utils/clp_package_utils/scripts/native/compress.py

Length of output: 51358

components/job-orchestration/job_orchestration/executor/query/celeryconfig.py (2)

12-12: LGTM: Task route updated to match new stream extraction pattern

The task route has been properly updated to use extract_stream_task.extract_stream, maintaining the correct queue assignment to QueueName.QUERY.

Let's verify the task registration:

#!/bin/bash
# Description: Verify the task registration is properly defined
# Expected: The new task should be properly defined in the extract_stream_task module

# Check for the task definition
ast-grep --pattern 'def extract_stream($$$)'

7-7: LGTM: Import statement correctly updated for stream extraction

The change from extract_ir_task to extract_stream_task aligns with the PR's objective of refactoring extraction tasks to support both IR and JSON extraction under a unified "stream" concept.

Let's verify this change is consistent across the codebase:

✅ Verification successful

Import change is properly integrated and consistent

The verification confirms that:

The extract_stream_task.py module exists in the correct location
All references to the task module are using the new extract_stream_task name
The task is properly imported in both the celery config and scheduler
No legacy references to extract_ir_task remain in the codebase

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the module renaming is consistent
# Expected: No references to the old module name should exist

# Check for any remaining references to the old module name
rg "extract_ir_task" --type py

Length of output: 46

Script:

#!/bin/bash
# Let's verify the task module existence and its references
fd "extract_stream_task.py" --type f

# Check for any imports or references to extract_stream_task
rg "extract_stream_task" --type py

# Look for celery task definitions related to stream extraction
ast-grep --pattern 'app.task(name="extract_stream")'

Length of output: 681

components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1)

27-27: LGTM! Variable renaming is consistent.

The variable renaming maintains consistency with the new stream-oriented terminology.

components/log-viewer-webui/server/src/app.js (1)

40-40: Verify MongoDB collection name changes.

The change from irFilesCollectionName to StreamFilesCollectionName is part of the IR to Stream renaming. Let's verify this change is consistent across the codebase.

✅ Verification successful

MongoDB collection name changes are consistent across the codebase

The verification shows that:

No references to the old irfiles naming pattern were found
All occurrences consistently use the new StreamFiles naming pattern
The settings.json correctly defines MongoDbStreamFilesCollectionName as "stream-files"
The change is properly reflected in both configuration and code files

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the MongoDB collection name changes are consistent
# Test 1: Check for any remaining references to the old collection name
rg -i 'irfiles(collection|dir)'

# Test 2: Check the new collection name usage
rg -i 'streamfiles(collection|dir)'

# Test 3: Verify settings.json has the updated property
cat components/log-viewer-webui/server/settings.json

Length of output: 2163

components/log-viewer-webui/server/src/routes/static.js (1)

25-28: Consider updating the route prefix to match new terminology

The route prefix "/ir" might be misleading as we're moving away from IR-specific terminology to a more generic "stream" concept. This could affect API consistency and documentation.

Let's verify if this route is referenced elsewhere:

Consider updating the prefix to "/stream" to maintain consistency with the new terminology, but ensure all clients are updated accordingly.

components/package-template/src/etc/clp-config.yml (2)

86-87: Validate stream output configuration impact.

The new stream_output section correctly replaces the old ir_output section. The directory path and configuration structure look appropriate.

Let's verify the deployment impact:

#!/bin/bash
# Description: Check for references to the old and new output paths
# Test: Search for references to ensure consistent updates
rg -g '!*.{log,md}' "(var/data/ir|var/data/stream)"

50-50: Verify collection renaming migration strategy.

The change from ir_collection_name to stream_collection_name looks good, but ensure there's a migration strategy for existing deployments.

Let's check for any migration scripts or related changes:

components/log-viewer-webui/server/src/DbManager.js (1)

Line range hint 1-236: Consider completing the IR to Stream terminology transition

The codebase is transitioning from IR-specific to Stream-generic terminology, but some references remain unchanged:

QUERY_JOB_TYPE.EXTRACT_IR constant
submitAndWaitForExtractIrJob method name and implementation

This partial transition might cause confusion and maintenance issues.

Let's verify the usage of IR terminology in the codebase:

Consider creating a tracking issue to:

Document all places where IR terminology needs to be updated
Plan the migration strategy
Update all references consistently
Update related tests and documentation

components/clp-package-utils/clp_package_utils/scripts/decompress.py (3)

17-17: LGTM! Import addition is well-placed.

The EXTRACT_JSON_CMD import is correctly grouped with related command imports.

Line range hint 150-162: LGTM! Function rename reflects broader purpose.

The rename from handle_extract_ir_cmd to handle_extract_stream_cmd better represents its expanded functionality to handle both IR and JSON extraction.

268-269: LGTM! Command handling is consistent.

The addition of JSON extraction command handling follows the established pattern and maintains consistency with the existing code structure.

components/clp-package-utils/clp_package_utils/scripts/native/decompress.py (5)

15-19: LGTM: Import and constant additions are appropriate

The new imports and constant additions properly support the JSON extraction functionality.

Also applies to: 25-25

78-81: LGTM: Function signature properly generalized

The updated signature appropriately supports multiple extraction job types through the generic job_type and job_config parameters.

Also applies to: 84-87

91-99: LGTM: Job submission and monitoring logic properly updated

The changes maintain robust error handling while supporting multiple job types.

298-304: LGTM: JSON extraction parser properly configured

The parser is well-structured with clear help messages and appropriate required flags.

310-311: LGTM: Command routing properly updated

The changes correctly route both IR and JSON extraction commands to the stream extraction handler.

components/clp-py-utils/clp_py_utils/clp_config.py (2)

273-273: LGTM: Consistent renaming from IR to stream collection

The renaming from ir_collection_name to stream_collection_name aligns well with the PR's objective to support both IR and JSON extraction under the unified "stream" terminology.

Also applies to: 287-292

430-430: Verify stream output directory creation in deployment

The integration of StreamOutput into CLPConfig looks correct. However, ensure that:

The stream output directory is properly created during deployment
Appropriate permissions are set for the directory

Also applies to: 440-440, 460-464, 533-533

components/clp-package-utils/clp_package_utils/scripts/start_clp.py (6)

289-289: LGTM: Collection name parameter updated consistently.

The change from --ir-collection to --stream-collection aligns with the broader refactoring to support both IR and JSON extraction under the stream terminology.

663-666: LGTM: Environment variables updated consistently.

The environment variables have been correctly updated from IR-specific to stream-specific naming:

CLP_STREAM_OUTPUT_DIR replaces IR-specific output directory
CLP_STREAM_COLLECTION replaces IR-specific collection name

713-713: LGTM: Directory creation updated consistently.

The change from IR output directory to stream output directory maintains consistency with the new terminology.

936-938: LGTM: MongoDB settings updated consistently.

The MongoDB-related settings have been properly updated to use stream terminology:

MongoDbStreamFilesCollectionName replaces IR-specific collection name
StreamFilesDir replaces IR-specific directory path

964-964: LGTM: Mount configuration updated consistently.

The mount configuration has been updated from IR output directory to stream output directory, maintaining consistency with the new terminology.

Line range hint 289-964: Verify the impact of terminology changes across the codebase.

While the changes in this file are consistent, we should verify that all IR to stream terminology changes have been applied consistently across the entire codebase.

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (7)

16-16: Import Statement Update Confirmed

The import of ExtractIrJobConfig and ExtractJsonJobConfig is necessary to handle both IR and JSON extraction configurations. This change is appropriate and aligns with the new functionality.

28-31: Updated Parameters in make_command Function

The addition of stream_output_dir, job_config_obj, and stream_collection parameters to the make_command function correctly accommodates the requirements for both IR and JSON extraction tasks.

51-69: Proper Handling of JSON Extraction Command

The new conditional branch for StorageEngine.CLP_S correctly introduces the logic for JSON extraction. The construction of the command and the parsing of ExtractJsonJobConfig are implemented appropriately.

78-78: Function Renaming Reflects Expanded Functionality

Renaming the function to extract_stream appropriately reflects its support for both IR and JSON extraction, enhancing code readability and clarity.

87-87: Updated Task Name Improves Clarity

Setting task_name to "Stream Extraction" provides a clearer description of the task being executed, aligning with the function's broadened scope.

104-105: Retrieval of New Environment Variables

The addition of stream_output_dir and stream_collection from environment variables ensures that the necessary configurations are accessible for the extraction processes.

112-115: Correct Parameters Passed to make_command

Passing stream_output_dir, job_config_obj, and stream_collection to the make_command function ensures it has all required information to construct the appropriate extraction command.

components/clp-py-utils/clp_py_utils/create-results-cache-indices.py

coderabbitai · 2024-11-05T21:55:26Z

components/log-viewer-webui/server/src/routes/static.js

+    let {StreamFilesDir} = settings;
+    if (false === path.isAbsolute(StreamFilesDir)) {
+        StreamFilesDir = path.resolve(rootDirname, StreamFilesDir);


🛠️ Refactor suggestion

Use consistent camelCase naming convention

The variable StreamFilesDir should follow JavaScript naming conventions by using camelCase instead of PascalCase, as it's not a class name.

Apply this diff to maintain consistency:

- let {StreamFilesDir} = settings; - if (false === path.isAbsolute(StreamFilesDir)) { - StreamFilesDir = path.resolve(rootDirname, StreamFilesDir); + let {streamFilesDir} = settings; + if (false === path.isAbsolute(streamFilesDir)) { + streamFilesDir = path.resolve(rootDirname, streamFilesDir);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let {StreamFilesDir} = settings;

if (false === path.isAbsolute(StreamFilesDir)) {

StreamFilesDir = path.resolve(rootDirname, StreamFilesDir);

let {streamFilesDir} = settings;

if (false === path.isAbsolute(streamFilesDir)) {

streamFilesDir = path.resolve(rootDirname, streamFilesDir);

components/clp-package-utils/clp_package_utils/scripts/decompress.py

components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

359-393: Consider adding error handling for MongoDB operations.

While the functions are well-structured and handle both extraction types effectively, the MongoDB operations could benefit from explicit error handling to gracefully handle connection issues or query failures.

Add try-catch blocks around MongoDB operations:

 def is_target_extracted(
     results_cache_uri: str, stream_collection_name: str, target_id: str, job_type: QueryJobType
 ) -> bool:
     target_key: str
     if QueryJobType.EXTRACT_IR == job_type:
         target_key = "file_split_id"
     else:
         target_key = "orig_file_id"

-    with pymongo.MongoClient(results_cache_uri) as results_cache_client:
-        stream_collection = results_cache_client.get_default_database()[stream_collection_name]
-        results_count = stream_collection.count_documents({target_key: target_id})
-        return 0 != results_count
+    try:
+        with pymongo.MongoClient(results_cache_uri) as results_cache_client:
+            stream_collection = results_cache_client.get_default_database()[stream_collection_name]
+            results_count = stream_collection.count_documents({target_key: target_id})
+            return 0 != results_count
+    except Exception as e:
+        logger.error(f"Failed to check extraction status: {e}")
+        return False

Line range hint 867-912: Consider extracting common logging patterns.

While the function effectively handles job completion with proper error handling, the logging statements could be made more maintainable by extracting common patterns.

Extract logging into helper functions:

+def log_task_completion(job_id: str, task_id: str, duration: float, success: bool = True) -> None:
+    status = "succeeded" if success else "failed"
+    logger.info(
+        f"extraction task job-{job_id}-task-{task_id} {status} in {duration} second(s)."
+    )
+
 async def handle_finished_stream_extraction_job(
     db_conn, job: QueryJob, task_results: List[Any]
 ) -> None:
     # ... existing code ...
     
-    logger.info(
-        f"extraction task job-{job_id}-task-{task_id} succeeded in "
-        f"{task_result.duration} second(s)."
-    )
+    log_task_completion(job_id, task_id, task_result.duration)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9518074 and ebe6395.

📒 Files selected for processing (2)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)
components/log-viewer-webui/server/src/routes/static.js (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/log-viewer-webui/server/src/routes/static.js

🔇 Additional comments (5)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (5)

291-306: Well-structured stream extraction utility function!

The function effectively handles both IR and JSON extraction types while maintaining good separation of concerns and SQL safety.

395-418: Well-implemented job creation function with proper type safety!

The function effectively handles both extraction types while maintaining type safety through proper configuration parsing and providing informative logging.

Line range hint 458-467: Excellent integration of JSON extraction with existing task group creation!

The changes effectively integrate JSON extraction while maintaining code reusability and consistency with the existing IR extraction functionality.

973-974: Clean integration of JSON extraction job status handling!

The changes effectively integrate JSON extraction job status handling while maintaining code consistency and reusability.

Line range hint 990-1004: Clean configuration integration!

The stream collection configuration is properly integrated and propagated through the job handling system.

Also applies to: 1088-1088

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

81-183: Add docstrings and type hints to abstract methods.

The StreamExtractionHandle class hierarchy is well-designed, but could be more maintainable with:

Docstrings explaining the purpose and contract of each abstract method
Return type hints for all methods

Apply this diff to improve the abstract class:

 class StreamExtractionHandle(ABC):
     def __init__(self, job_id: str):
         self.job_id = job_id

     @abstractmethod
-    def get_stream_id(self) -> Optional[str]: ...
+    def get_stream_id(self) -> Optional[str]:
+        """
+        Get the identifier for the stream being extracted.
+        
+        Returns:
+            str: The stream ID if available, None otherwise
+        """
+        ...

     @abstractmethod
-    def get_archive_for_stream_extraction(self, db_conn) -> Optional[str]: ...
+    def get_archive_for_stream_extraction(self, db_conn) -> Optional[str]:
+        """
+        Get the archive ID containing the stream to be extracted.
+        
+        Args:
+            db_conn: Database connection
+            
+        Returns:
+            str: The archive ID if found, None otherwise
+        """
+        ...

Line range hint 912-957: Improve error handling and logging consistency.

The job completion handling could be improved in several ways:

The logging messages at lines 954 and 956 are nearly identical
The error handling for unexpected task counts could be more specific

Consider these improvements:

-            logger.info(f"Completed stream extraction job {job_id}.")
-        else:
-            logger.info(f"Completed stream extraction job {job_id} with failing tasks.")
+            logger.info(f"Successfully completed stream extraction job {job_id}")
+        else:
+            logger.error(
+                f"Stream extraction job {job_id} failed: "
+                f"{'unexpected task count' if num_tasks != 1 else 'task execution failed'}"
+            )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ff15840 and e65799e.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

454-469: LGTM! Well-implemented archive existence check.

The function uses parameterized queries for SQL injection prevention and has clear error handling.

coderabbitai · 2024-11-15T22:08:40Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

+                # NOTE: The following two if blocks of `is_stream_extraction_active` and
+                # `is_stream_extracted` should not be reordered.
+                # We must:
+                # 1. First, check if the stream is in the process of being extracted
+                # (`is_stream_extraction_active`).
+                # 2. Then, check if the stream has already been extracted
+                # (`is_stream_extracted`).
+                #
+                # This order ensures correctness because `is_stream_extracted` returns True if
+                # any chunk of the stream has been extracted, but it does not guarantee that *all*
+                # chunks are extracted. If `is_stream_extracted` is checked first and the job is
+                # marked as successful based on its result, it is possible that the extraction job
+                # is still in progress, meaning the specific chunk requested by the web UI might
+                # not yet be ready.
+


🛠️ Refactor suggestion

Consider encapsulating the extraction state checks.

The ordering dependency between is_stream_extraction_active and is_stream_extracted is well documented but could be encapsulated to prevent future maintenance issues.

Consider creating a dedicated method to handle these checks:

+ def check_extraction_state( + self, + db_conn, + job_id: str, + results_cache_uri: str, + stream_collection_name: str + ) -> Tuple[bool, Optional[QueryJobStatus]]: + """ + Check the extraction state in the correct order: + 1. First, check if stream is being extracted + 2. Then, check if stream has been extracted + + Returns: + Tuple[bool, Optional[QueryJobStatus]]: + - bool: True if state handling is complete + - QueryJobStatus: The new job status if complete, None otherwise + """ + if self.is_stream_extraction_active(): + self.mark_job_as_waiting() + return True, QueryJobStatus.RUNNING + + if self.is_stream_extracted(results_cache_uri, stream_collection_name): + return True, QueryJobStatus.SUCCEEDED + + return False, None

Committable suggestion skipped: line range outside the PR's diff.

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (4)

81-100: Add docstrings to abstract methods for better maintainability.

The abstract class provides a good interface, but adding docstrings would make it more maintainable and help future implementations.

Apply this diff to add docstrings:

 class StreamExtractionHandle(ABC):
     def __init__(self, job_id: str):
         self.job_id = job_id
         self.archive_id: Optional[str] = None

     @abstractmethod
     def get_stream_id(self) -> Optional[str]: ...
+        """
+        Returns the unique identifier for the stream being extracted.
+        :return: String identifier for the stream, or None if not available
+        """

     @abstractmethod
     def is_stream_extraction_active(self) -> bool: ...
+        """
+        Checks if the stream is currently being extracted.
+        :return: True if extraction is in progress, False otherwise
+        """

     @abstractmethod
     def mark_job_as_waiting(self) -> None: ...
+        """
+        Marks the current job as waiting for stream extraction to complete.
+        """

     @abstractmethod
     def is_stream_extracted(self, results_cache_uri: str, stream_collection_name: str) -> bool: ...
+        """
+        Checks if the stream has been extracted to the results cache.
+        :param results_cache_uri: URI of the results cache
+        :param stream_collection_name: Name of the collection storing extracted streams
+        :return: True if stream is extracted, False otherwise
+        """

     @abstractmethod
     def create_stream_extraction_job(self) -> QueryJob: ...
+        """
+        Creates a new job for extracting the stream.
+        :return: Configured QueryJob instance
+        """

Line range hint 903-948: Improve error handling in stream extraction job completion.

The error handling in handle_finished_stream_extraction_job could be more robust:

Consider using early returns to reduce nesting
Add more detailed error messages

Apply this diff to improve the error handling:

 async def handle_finished_stream_extraction_job(
     db_conn, job: QueryJob, task_results: List[Any]
 ) -> None:
     global active_jobs
     global active_archive_json_extractions
     global active_file_split_ir_extractions

     job_id = job.id
-    new_job_status = QueryJobStatus.SUCCEEDED
+    new_job_status = QueryJobStatus.FAILED  # Default to failed, set to success only if all checks pass

     num_tasks = len(task_results)
     if 1 != num_tasks:
         logger.error(
-            f"Unexpected number of tasks for extraction job {job_id}. "
-            f"Expected 1, got {num_tasks}."
+            f"Invalid task count for extraction job {job_id}: expected 1, got {num_tasks}. "
+            f"This indicates a potential scheduling issue."
         )
-        new_job_status = QueryJobStatus.FAILED
-    else:
-        task_result = QueryTaskResult.parse_obj(task_results[0])
-        task_id = task_result.task_id
-        if not QueryJobStatus.SUCCEEDED == task_result.status:
-            logger.error(
-                f"extraction task job-{job_id}-task-{task_id} failed. "
-                f"Check {task_result.error_log_path} for details."
-            )
-            new_job_status = QueryJobStatus.FAILED
-        else:
-            logger.info(
-                f"extraction task job-{job_id}-task-{task_id} succeeded in "
-                f"{task_result.duration} second(s)."
-            )
+        return

+    task_result = QueryTaskResult.parse_obj(task_results[0])
+    task_id = task_result.task_id
+    
+    if not QueryJobStatus.SUCCEEDED == task_result.status:
+        logger.error(
+            f"Stream extraction task job-{job_id}-task-{task_id} failed. "
+            f"Error details available at: {task_result.error_log_path}"
+        )
+        return
+
+    logger.info(
+        f"Stream extraction task job-{job_id}-task-{task_id} completed successfully "
+        f"in {task_result.duration:.2f} second(s)."
+    )
+    new_job_status = QueryJobStatus.SUCCEEDED

76-77: Consider using a thread-safe data structure for job tracking.

The use of global dictionaries for tracking active jobs and extractions could lead to race conditions in a multi-threaded environment.

Consider using a thread-safe data structure or encapsulating the state management in a class:

from threading import Lock

class JobTracker:
    def __init__(self):
        self._lock = Lock()
        self._active_jobs = {}
        self._active_ir_extractions = {}
        self._active_json_extractions = {}

    def add_job(self, job_id: str, job: QueryJob) -> None:
        with self._lock:
            self._active_jobs[job_id] = job

    def remove_job(self, job_id: str) -> None:
        with self._lock:
            self._active_jobs.pop(job_id, None)

1009-1010: Use tuple for job type comparison.

Replace the list with a tuple for better performance when checking job types.

Apply this diff:

-            elif job_type in [QueryJobType.EXTRACT_JSON, QueryJobType.EXTRACT_IR]:
+            elif job_type in (QueryJobType.EXTRACT_JSON, QueryJobType.EXTRACT_IR):

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e65799e and fb8e2a6.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

677-691: Consider encapsulating the extraction state checks.

The ordering dependency between is_stream_extraction_active and is_stream_extracted is well documented but could be encapsulated to prevent future maintenance issues.

coderabbitai · 2024-11-15T23:35:05Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

+def check_if_archive_exists(
+    db_conn,
+    archive_id: str,
+) -> bool:
+
+    query = f"""SELECT 1 
+                FROM {CLP_METADATA_TABLE_PREFIX}archives WHERE
+                id = %s
+                """
+    with contextlib.closing(db_conn.cursor(dictionary=True)) as cursor:
+        cursor.execute(query, (archive_id,))
+        if cursor.fetchone():
+            return True
+
+    return False


💡 Codebase verification

SQL injection vulnerabilities found in multiple database operations

Several SQL queries are using string formatting (f-strings) to construct queries instead of using parameterized queries, particularly in:

compression_scheduler.py:

update_compression_task_metadata: Direct string interpolation for task_id

update_compression_job_metadata: Direct string interpolation for field values and job_id

Multiple UPDATE queries with interpolated status values and job IDs

query_scheduler.py:

Multiple SELECT queries with interpolated table names and status values

INSERT query in insert_query_tasks_into_db with interpolated values

These should be rewritten using parameterized queries similar to the pattern used in check_if_archive_exists.

🔗 Analysis chain

Use parameterized queries consistently.

Good use of parameterized query in check_if_archive_exists. However, this pattern should be applied consistently throughout the file.

Let's verify other SQL queries in the file:

Let me analyze the SQL queries in more detail by checking their full context.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash # Search for potential SQL injection vulnerabilities rg -A 3 "execute\(" | rg -B 3 "SELECT|UPDATE|INSERT|DELETE"

Length of output: 4240

Script:

#!/bin/bash # Get more context around SQL queries rg -B 5 -A 10 "execute\(" components/job-orchestration/job_orchestration/scheduler/

Length of output: 23781

kirkrodrigues

Skimmed things and they generally look good. I won't have time to finish a complete review within 24 hours, but will likely get to it later tonight.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

81-103: Add return type hints to abstract methods.

The abstract methods in StreamExtractionHandle should include return type hints for better type safety and documentation.

Apply this diff:

-    def get_stream_id(self) -> Optional[str]: ...
+    @abstractmethod
+    def get_stream_id(self) -> Optional[str]: ...

-    def is_stream_extraction_active(self) -> bool: ...
+    @abstractmethod
+    def is_stream_extraction_active(self) -> bool: ...

-    def is_stream_extracted(self, results_cache_uri: str, stream_collection_name: str) -> bool: ...
+    @abstractmethod
+    def is_stream_extracted(self, results_cache_uri: str, stream_collection_name: str) -> bool: ...

-    def mark_job_as_waiting(self) -> None: ...
+    @abstractmethod
+    def mark_job_as_waiting(self) -> None: ...

-    def create_stream_extraction_job(self) -> QueryJob: ...
+    @abstractmethod
+    def create_stream_extraction_job(self) -> QueryJob: ...

Line range hint 908-938: Improve error handling in stream extraction job completion.

The error handling in handle_finished_stream_extraction_job could be enhanced:

The assumption of exactly one task result is rigid and could be made more flexible
Error messages could provide more context about the failure

Apply this diff:

 async def handle_finished_stream_extraction_job(
     db_conn, job: QueryJob, task_results: List[Any]
 ) -> None:
     global active_jobs
     global active_archive_json_extractions
     global active_file_split_ir_extractions

     job_id = job.id
     new_job_status = QueryJobStatus.SUCCEEDED

     num_tasks = len(task_results)
-    if 1 != num_tasks:
+    if num_tasks == 0:
         logger.error(
-            f"Unexpected number of tasks for extraction job {job_id}. "
-            f"Expected 1, got {num_tasks}."
+            f"No task results received for extraction job {job_id}."
         )
         new_job_status = QueryJobStatus.FAILED
-    else:
+    elif num_tasks > 1:
+        logger.warning(
+            f"Multiple task results ({num_tasks}) received for extraction job {job_id}. "
+            f"Processing only the first result."
+        )
+
+    if new_job_status != QueryJobStatus.FAILED:
         task_result = QueryTaskResult.parse_obj(task_results[0])
         task_id = task_result.task_id
         if not QueryJobStatus.SUCCEEDED == task_result.status:
             logger.error(
-                f"extraction task job-{job_id}-task-{task_id} failed. "
-                f"Check {task_result.error_log_path} for details."
+                f"Stream extraction task job-{job_id}-task-{task_id} failed with status "
+                f"{task_result.status}. Check {task_result.error_log_path} for details."
             )
             new_job_status = QueryJobStatus.FAILED

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fb8e2a6 and 3d86291.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

682-696: Consider encapsulating the extraction state checks.

The ordering dependency between is_stream_extraction_active and is_stream_extracted is well documented but could be encapsulated to prevent future maintenance issues.

coderabbitai · 2024-11-16T16:26:40Z

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

+# Dictionary that maps IDs of clp-s archives being extracted to IDs of jobs waiting for them
+active_archive_json_extractions: Dict[str, List[str]] = {}


🛠️ Refactor suggestion

Consider encapsulating global state in a JobStateManager class.

The use of global dictionaries for tracking job states could lead to maintainability and thread-safety issues. Consider encapsulating this state in a dedicated class.

Example implementation:

class JobStateManager: def __init__(self): self._active_jobs: Dict[str, QueryJob] = {} self._active_file_split_ir_extractions: Dict[str, List[str]] = {} self._active_archive_json_extractions: Dict[str, List[str]] = {} self._lock = asyncio.Lock() # For thread safety async def add_waiting_ir_job(self, file_split_id: str, job_id: str) -> None: async with self._lock: if file_split_id not in self._active_file_split_ir_extractions: self._active_file_split_ir_extractions[file_split_id] = [] self._active_file_split_ir_extractions[file_split_id].append(job_id) async def add_waiting_json_job(self, archive_id: str, job_id: str) -> None: async with self._lock: if archive_id not in self._active_archive_json_extractions: self._active_archive_json_extractions[archive_id] = [] self._active_archive_json_extractions[archive_id].append(job_id)

Let's not assuming multithreading for this file. we currently only use single threading

@haiqi96, understood. Since the file is single-threaded, encapsulating the state for thread-safety may not be necessary.

(_/)
( •_•)
/ >💧

✏️ Learnings added

Learnt from: haiqi96 PR: y-scope/clp#569 File: components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py:76-77 Timestamp: 2024-11-17T23:24:08.758Z Learning: The `query_scheduler.py` file operates with single threading, and multithreading is not used.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (4)
76-77: Add empty line before comment for better readability

Add an empty line before the comment to improve readability and maintain consistency with the project's commenting style.
 active_file_split_ir_extractions: Dict[str, List[str]] = {}
 
+
 # Dictionary that maps IDs of clp-s archives being extracted to IDs of jobs waiting for them
 active_archive_json_extractions: Dict[str, List[str]] = {}
182-186: Rename function for clarity

The function name does_document_exist is too generic. Consider renaming it to mongodb_doc_exists to better reflect its specific purpose and implementation.
-def does_document_exist(mongodb_uri, collection_name, field, value):
+def mongodb_doc_exists(mongodb_uri, collection_name, field, value):
113-114: Improve error message clarity

The error message could be more specific about which parameters failed to resolve to an existing archive.
-            raise ValueError("Job parameters don't resolve to an existing archive")
+            raise ValueError(f"Failed to resolve archive for file_split_id={self._file_split_id}")
451-464: Rename function for consistency

The function name check_if_archive_exists could be shortened to archive_exists for better consistency with other function names in the codebase.
-def check_if_archive_exists(
+def archive_exists(
     db_conn,
     archive_id: str,
 ) -> bool:

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 3d86291 and 4b84812.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

955-961: ⚠️ Potential issue

Potential race condition in job cleanup

The code assumes that the job being completed is the one that did the extraction, but there could be a race condition if multiple jobs complete simultaneously.

Consider adding synchronization or using a more robust cleanup mechanism:

     if QueryJobType.EXTRACT_IR == job.get_type():
         extract_ir_config: ExtractIrJobConfig = job.get_config()
-        waiting_jobs = active_file_split_ir_extractions.pop(extract_ir_config.file_split_id)
+        file_split_id = extract_ir_config.file_split_id
+        if file_split_id in active_file_split_ir_extractions:
+            waiting_jobs = active_file_split_ir_extractions.pop(file_split_id)
     else:
         extract_json_config: ExtractJsonJobConfig = job.get_config()
-        waiting_jobs = active_archive_json_extractions.pop(extract_json_config.archive_id)
+        archive_id = extract_json_config.archive_id
+        if archive_id in active_archive_json_extractions:
+            waiting_jobs = active_archive_json_extractions.pop(archive_id)

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (4)

82-104: Add type hints to abstract methods for better type safety.

The abstract methods in StreamExtractionHandle would benefit from type hints to ensure proper implementation in derived classes.

Apply this diff to add type hints:

-    def get_stream_id(self) -> str: ...
+    def get_stream_id(self) -> str:
+        """Returns the ID of the stream being extracted."""
+        ...

-    def is_stream_extraction_active(self) -> bool: ...
+    def is_stream_extraction_active(self) -> bool:
+        """Returns True if the stream is currently being extracted."""
+        ...

-    def is_stream_extracted(self, results_cache_uri: str, stream_collection_name: str) -> bool: ...
+    def is_stream_extracted(self, results_cache_uri: str, stream_collection_name: str) -> bool:
+        """Returns True if the stream has been extracted."""
+        ...

-    def mark_job_as_waiting(self) -> None: ...
+    def mark_job_as_waiting(self) -> None:
+        """Marks the job as waiting for stream extraction to complete."""
+        ...

-    def create_stream_extraction_job(self) -> QueryJob: ...
+    def create_stream_extraction_job(self) -> QueryJob:
+        """Creates and returns a new stream extraction job."""
+        ...

182-186: Rename function to be more descriptive.

The function name document_exist could be more descriptive of its purpose in checking MongoDB documents.

Apply this diff to rename the function:

-def document_exist(mongodb_uri, collection_name, field, value):
+def mongodb_document_exists(mongodb_uri: str, collection_name: str, field: str, value: str) -> bool:

Line range hint 663-682: Use specific exception handling.

The current exception handling catches all ValueError instances. Consider catching specific exceptions for better error handling.

Apply this diff to improve error handling:

         try:
             if QueryJobType.EXTRACT_IR == job_type:
                 job_handle = IrExtractionHandle(job_id, job_config, db_conn)
             else:
                 job_handle = JsonExtractionHandle(job_id, job_config, db_conn)
-        except ValueError:
+        except ValueError as ve:
+            logger.exception(f"Invalid job parameters: {str(ve)}")
+            if not set_job_or_task_status(
+                db_conn,
+                QUERY_JOBS_TABLE_NAME,
+                job_id,
+                QueryJobStatus.FAILED,
+                QueryJobStatus.PENDING,
+                start_time=datetime.datetime.now(),
+                num_tasks=0,
+                duration=0,
+            ):
+                logger.error(f"Failed to set job {job_id} as failed")
+            continue
+        except Exception as e:
             logger.exception("Failed to initialize extraction job handle")

Line range hint 910-977: Consider extracting job cleanup logic into a separate method.

The job cleanup logic in handle_finished_stream_extraction_job could be extracted into a separate method for better maintainability.

Consider creating a helper method:

def cleanup_waiting_jobs(
    db_conn,
    job: QueryJob,
    job_id: str,
    new_job_status: QueryJobStatus,
    start_time: datetime.datetime
) -> None:
    """Cleanup waiting jobs after stream extraction completion."""
    waiting_jobs: List[str]
    if QueryJobType.EXTRACT_IR == job.get_type():
        extract_ir_config: ExtractIrJobConfig = job.get_config()
        waiting_jobs = active_file_split_ir_extractions.pop(extract_ir_config.file_split_id)
    else:
        extract_json_config: ExtractJsonJobConfig = job.get_config()
        waiting_jobs = active_archive_json_extractions.pop(extract_json_config.archive_id)

    waiting_jobs.remove(job_id)
    for waiting_job in waiting_jobs:
        logger.info(f"Setting status to {new_job_status.to_str()} for waiting jobs: {waiting_job}.")
        set_job_or_task_status(
            db_conn,
            QUERY_JOBS_TABLE_NAME,
            waiting_job,
            new_job_status,
            QueryJobStatus.RUNNING,
            num_tasks_completed=0,
            duration=(datetime.datetime.now() - start_time).total_seconds(),
        )

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4b84812 and 7a3398f.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (1)

106-145: Use double underscores for private members in derived classes.

The derived classes IrExtractionHandle and JsonExtractionHandle use single underscore for private members. For stronger encapsulation, use double underscores.

Also applies to: 147-180

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (3)

76-77: Consider encapsulating extraction state management

While the code operates in a single-threaded context, managing state through global dictionaries can make the code harder to maintain and test. Consider creating a dedicated class to manage extraction state, even without thread-safety concerns.

Example implementation:

class ExtractionStateManager:
    def __init__(self):
        self._active_file_split_ir_extractions: Dict[str, List[str]] = {}
        self._active_archive_json_extractions: Dict[str, List[str]] = {}
    
    def add_waiting_ir_job(self, file_split_id: str, job_id: str) -> None:
        if file_split_id not in self._active_file_split_ir_extractions:
            self._active_file_split_ir_extractions[file_split_id] = []
        self._active_file_split_ir_extractions[file_split_id].append(job_id)
    
    def add_waiting_json_job(self, archive_id: str, job_id: str) -> None:
        if archive_id not in self._active_archive_json_extractions:
            self._active_archive_json_extractions[archive_id] = []
        self._active_archive_json_extractions[archive_id].append(job_id)

182-186: Consider a more specific function name

The function name document_exists could be more specific to its purpose, such as check_mongodb_document_exists or verify_stream_document_exists, to better indicate its role in checking MongoDB documents.

Line range hint 661-751: Consider extracting job type handling into separate methods

The job type handling logic in handle_pending_query_jobs is becoming complex. Consider extracting the handling of each job type into separate methods for better maintainability.

Example refactor:

def handle_stream_extraction_job(
    self,
    db_conn,
    job_id: str,
    job_type: QueryJobType,
    job_config: Dict[str, Any],
    results_cache_uri: str,
    stream_collection_name: str,
) -> None:
    try:
        job_handle = self._create_job_handle(job_id, job_type, job_config, db_conn)
    except ValueError:
        self._handle_job_initialization_failure(db_conn, job_id)
        return

    if job_handle.is_stream_extraction_active():
        self._handle_active_extraction(db_conn, job_id, job_handle)
        return

    if job_handle.is_stream_extracted(results_cache_uri, stream_collection_name):
        self._handle_completed_extraction(db_conn, job_id, job_handle)
        return

    self._create_and_dispatch_extraction_job(db_conn, job_id, job_handle)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7a3398f and 1d1599e.

📒 Files selected for processing (1)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (19 hunks)

🔇 Additional comments (2)

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py (2)

82-104: Well-designed abstraction for stream extraction handling

The StreamExtractionHandle abstract base class provides a clean interface for handling different types of extractions. The abstraction effectively separates concerns and makes the code more maintainable.

1132-1134: LGTM: Clean integration of stream collection configuration

The addition of the stream collection name parameter is well-integrated into the existing configuration flow.

kirkrodrigues

For the PR title, how about:

feat(clp-package): Add support for extracting JSON streams from archives.

…ves. (y-scope#569) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

haiqi96 added 6 commits October 30, 2024 15:46

Make it work with a lot of duplications

97bcbd2

deduplication the code in task launching scripts

db1bf0a

Remove redundant task for json. Combine them into one

aea5630

Linter

d0e3b12

refactor

6af5d5a

Linter

496e8d8

haiqi96 marked this pull request as draft October 31, 2024 01:40

haiqi96 changed the title ~~Draft: add~~ Draft: add json extraction flow to package Oct 31, 2024

coderabbitai bot reviewed Oct 31, 2024

View reviewed changes

haiqi96 added 10 commits November 1, 2024 16:13

refactor

c4e118d

proposal

a75c42d

Add actual refactor

66fb365

Linter

310f0d5

deduplication

069f568

Decompressor end renaming

30c03e3

Renaming for task and scheduler

acb0237

Mass renaming

5f12b38

Renaming for webui

3c23fd2

missing change

be75c9c

haiqi96 force-pushed the json_extraction_temp branch from 4a06a8d to be75c9c Compare November 5, 2024 16:58

haiqi96 added 2 commits November 5, 2024 12:57

polishing

383781a

Update comments

9518074

haiqi96 changed the title ~~Draft: add json extraction flow to package~~ clp-package: Add support for Json extraction task Nov 5, 2024

haiqi96 marked this pull request as ready for review November 5, 2024 21:29

kirkrodrigues self-requested a review November 5, 2024 21:54

coderabbitai bot reviewed Nov 5, 2024

View reviewed changes

fixes

ebe6395

coderabbitai bot reviewed Nov 5, 2024

View reviewed changes

fixes

7b79dff

haiqi96 added 3 commits November 15, 2024 16:43

Further refactor

972cb2b

Update comments

5ec988f

small polishing

e65799e

coderabbitai bot reviewed Nov 15, 2024

View reviewed changes

Further polishing

fb8e2a6

coderabbitai bot reviewed Nov 15, 2024

View reviewed changes

kirkrodrigues requested changes Nov 16, 2024

View reviewed changes

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py Outdated Show resolved Hide resolved

haiqi96 added 2 commits November 16, 2024 11:16

Use proper encapsulation for class data member variables

314e008

Linter

3d86291

coderabbitai bot reviewed Nov 16, 2024

View reviewed changes

kirkrodrigues requested changes Nov 17, 2024

View reviewed changes

Apply suggestions from code review

4b84812

Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai bot reviewed Nov 17, 2024

View reviewed changes

Address code review

7a3398f

haiqi96 requested a review from kirkrodrigues November 17, 2024 23:25

coderabbitai bot reviewed Nov 17, 2024

View reviewed changes

kirkrodrigues requested changes Nov 18, 2024

View reviewed changes

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py Outdated Show resolved Hide resolved

components/job-orchestration/job_orchestration/scheduler/query/query_scheduler.py Outdated Show resolved Hide resolved

Fix silly mistakes

1d1599e

haiqi96 requested a review from kirkrodrigues November 18, 2024 03:38

coderabbitai bot reviewed Nov 18, 2024

View reviewed changes

Update comment.

a5eb217

kirkrodrigues approved these changes Nov 18, 2024

View reviewed changes

haiqi96 changed the title ~~feat(clp-package): Add support for Json extraction task~~ feat(clp-package): Add support for extracting JSON streams from archives. Nov 18, 2024

kirkrodrigues merged commit d969aaf into y-scope:main Nov 18, 2024
6 checks passed

This was referenced Nov 19, 2024

feat(webui): Support viewing search results in context for JSON logs (clp-json). #596

Merged

feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600

Merged

This was referenced Nov 26, 2024

feat(clp-package): Add support for deleting archives that are exclusively within a time range. #594

Merged

refactor(clp-package): Unify the metadata schema for JSON and IR streams. #620

Merged

jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024

feat(clp-package): Add support for extracting JSON streams from archi…

7054958

…ves. (y-scope#569) Co-authored-by: kirkrodrigues <2454684+kirkrodrigues@users.noreply.github.com>

coderabbitai bot mentioned this pull request Dec 14, 2024

feat(package): Add support for writing clp-s single file archive to s3. #634

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(clp-package): Add support for extracting JSON streams from archives. #569

feat(clp-package): Add support for extracting JSON streams from archives. #569

haiqi96 commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Oct 31, 2024

coderabbitai bot Oct 31, 2024

coderabbitai bot left a comment

coderabbitai bot Nov 5, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Nov 15, 2024

coderabbitai bot left a comment

coderabbitai bot Nov 15, 2024

kirkrodrigues left a comment

coderabbitai bot left a comment

coderabbitai bot Nov 16, 2024 •

edited

Loading

haiqi96 Nov 17, 2024

coderabbitai bot Nov 17, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

kirkrodrigues left a comment

		# Dictionary that maps IDs of clp-s archives being extracted to IDs of jobs waiting for them
		active_archive_json_extractions: Dict[str, List[str]] = {}

feat(clp-package): Add support for extracting JSON streams from archives. #569

feat(clp-package): Add support for extracting JSON streams from archives. #569

Conversation

haiqi96 commented Oct 31, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot commented Oct 31, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Oct 31, 2024

Choose a reason for hiding this comment

coderabbitai bot Oct 31, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 5, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 15, 2024

Choose a reason for hiding this comment

kirkrodrigues left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 16, 2024 • edited Loading

Choose a reason for hiding this comment

haiqi96 Nov 17, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 17, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

kirkrodrigues left a comment

Choose a reason for hiding this comment

haiqi96 commented Oct 31, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 31, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

coderabbitai bot Nov 16, 2024 •

edited

Loading