Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(clp-package): Unify the metadata schema for JSON and IR streams. #620

Merged
merged 3 commits into from
Dec 6, 2024

Conversation

haiqi96
Copy link
Contributor

@haiqi96 haiqi96 commented Dec 2, 2024

Description

This PR is a continuation of work in #596
The PR introduces three updates to the stream metadata so that both IR and JSON metadata share the same scheme:

  1. remove file_split_id from IR metadata, since it is not used by the WebUI.
  2. replace orig_file_id and archive_id with "stream_id".
  3. Rename "is_last_ir_chunk" to "is_last_chunk".

With the changes above, we are able to simplify the webui code in the same PR.

Validation performed

Manually tested both CLP and CLP-S log viewing.
Manually verified that extracted stream metadata in the mongodb matches expectation

Summary by CodeRabbit

Release Notes

  • New Features

    • Enhanced metadata handling for extracted streams with updated identifiers.
    • Added a new stream_id property to the response of extraction jobs.
  • Bug Fixes

    • Improved error handling for connection failures in result caching.
    • Enhanced error handling for file decompression scenarios.
  • Documentation

    • Updated type definitions and return types for API calls related to stream extraction.
    • Clarified parameter documentation for the decompression methods.

These changes improve the overall functionality and reliability of the application, ensuring better alignment with the updated data structure.

Copy link
Contributor

coderabbitai bot commented Dec 2, 2024

Walkthrough

The changes in this pull request involve modifications primarily to the output handling and result processing components across several files. Key updates include renaming constants in BSON document constructions, altering method signatures to accommodate unused parameters, and refining error handling in various methods. The changes reflect a restructuring of how results are stored and retrieved, particularly focusing on the identifiers used in the database schema. Overall, the functionality remains consistent while aligning the code with a new organizational structure.

Changes

File Change Summary
components/core/src/clp/clo/OutputHandler.cpp Updated flush method in ResultsCacheOutputHandler to change BSON keys; marked parameters as [[maybe_unused]] in CountOutputHandler::add_result.
components/core/src/clp/clo/clo.cpp Renamed BSON key in extract_ir function; enhanced error handling for connection failures; updated output handler creation in search function.
components/core/src/clp/clo/constants.hpp Removed OrigFileId from cResultsCacheKeys, added to SearchOutput, and updated StreamId in IrOutput.
components/core/src/clp_s/JsonConstructor.cpp Renamed constant from cOrigFileId to cStreamId in BSON construction; retained existing error handling and logic.
components/core/src/clp_s/archive_constants.hpp Removed cOrigFileId, added cStreamId in results_cache::decompression.
components/log-viewer-webui/client/src/api/query.js Renamed ExtractIrResp to ExtractStreamResp, added stream_id, removed other properties, and updated return type of submitExtractStreamJob.
components/log-viewer-webui/server/src/DbManager.js Changed query filter from orig_file_id to stream_id in getExtractedStreamFileMetadata method.
components/core/src/clp/clp/FileDecompressor.hpp Updated parameter documentation in decompress_to_ir method from is_last_ir_chunk to is_last_chunk.
components/core/src/clp/clp/decompression.cpp Enhanced error handling and control flow in decompress function; updated parameter name in decompress_to_ir.

Possibly related PRs

Suggested reviewers

  • gibber9809

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@haiqi96 haiqi96 marked this pull request as ready for review December 2, 2024 22:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
components/log-viewer-webui/server/src/DbManager.js (1)

174-176: LGTM: Query field aligned with stream_id convention

The update to use stream_id in the MongoDB query is consistent with the standardization effort.

Consider adding an index on stream_id, begin_msg_ix, and end_msg_ix fields if not already present, to optimize this query:

await this.#streamFilesCollection.createIndex({
  stream_id: 1,
  begin_msg_ix: 1,
  end_msg_ix: 1
});
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between ec0821d and b599a08.

📒 Files selected for processing (7)
  • components/core/src/clp/clo/OutputHandler.cpp (1 hunks)
  • components/core/src/clp/clo/clo.cpp (1 hunks)
  • components/core/src/clp/clo/constants.hpp (1 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (1 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
  • components/log-viewer-webui/client/src/api/query.js (2 hunks)
  • components/log-viewer-webui/server/src/DbManager.js (1 hunks)
🧰 Additional context used
📓 Learnings (1)
components/log-viewer-webui/client/src/api/query.js (1)
Learnt from: junhaoliao
PR: y-scope/clp#596
File: components/log-viewer-webui/client/src/api/query.js:16-23
Timestamp: 2024-11-21T15:51:33.203Z
Learning: In `components/log-viewer-webui/client/src/api/query.js`, the `ExtractJsonResp` type definition is accurate as-is and does not require modification. When suggesting changes to type definitions, ensure they align with the server-side definitions, referencing the source code if necessary.
🔇 Additional comments (6)
components/core/src/clp/clo/constants.hpp (2)

15-15: Verify SearchOutput namespace organization

The addition of OrigFileId to SearchOutput namespace is logical, maintaining clear separation between IR and search output constants.


8-8: Verify usage of new StreamId constant

The addition of StreamId aligns with the PR objective of standardizing metadata. However, we should verify all usages of this new constant.

✅ Verification successful

No issues found with StreamId constant usage

The new StreamId constant is properly used in the codebase, and there are no remaining instances of hardcoded "stream_id" strings that need to be updated. The constant is referenced in:

  • components/core/src/clp/clo/clo.cpp for results cache keys
  • components/core/src/clp_s/archive_constants.hpp which defines a similar constant
  • components/webui/imports/ui/SearchView/SearchResults/SearchResultsTable/index.jsx which uses the stream ID in a different context

The old file_split_id references are in different components and contexts, not related to this standardization change.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any remaining references to file_split_id to ensure complete migration
rg -i "file_split_id" --type cpp --type hpp

# Search for usage of the new StreamId constant
rg "StreamId\[\]" --type cpp --type hpp

Length of output: 141


Script:

#!/bin/bash
# Search for any remaining references to file_split_id
rg -i "file_split_id"

# Search for usage of the new StreamId constant
rg "StreamId"

# List all hpp and cpp files to understand the codebase structure
fd "\.(hpp|cpp)$"

Length of output: 37011

components/log-viewer-webui/client/src/api/query.js (2)

22-22: Confirm API response type alignment

The return type change to ExtractStreamResp appears correct, but we should verify the server-side response structure matches.

✅ Verification successful

API response type is properly defined and aligned

The ExtractStreamResp type is well-defined in the same file with all required properties (stream_id, begin_msg_ix, end_msg_ix, is_last_ir_chunk, path), and the server-side endpoint at /query/extract-stream handles the corresponding request parameters (extractJobType, streamId, logEventIdx). The type change is safe and correctly implemented.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for extract-stream endpoint implementation
rg -A 10 "extract-stream" --type cpp --type hpp

Length of output: 75


Script:

#!/bin/bash
# Search for extract-stream endpoint implementation in all files
rg -A 10 "extract-stream" -i

# Also search for ExtractStreamResp type definition
rg "ExtractStreamResp" -A 5

Length of output: 2831


5-12: Verify type definition completeness

The new ExtractStreamResp type includes is_last_ir_chunk which seems IR-specific. Consider if this should be in a separate type for IR responses.

✅ Verification successful

Let me gather more context about the usage of this type and related IR functionality.


Based on the gathered context, I can now provide a final response.

The is_last_ir_chunk property is correctly placed in ExtractStreamResp

The codebase shows that IR (Intermediate Representation) is a specific stream type in the system, with dedicated handling:

  • The server has an EXTRACT_IR job type distinct from EXTRACT_JSON
  • The client UI checks for IR streams with IS_IR_STREAM condition
  • The is_last_ir_chunk property is specifically used when handling IR stream extractions

The property belongs in the ExtractStreamResp type as it's a necessary part of the IR extraction response flow, not warranting a separate type definition.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of ExtractStreamResp type
rg "ExtractStreamResp" --type js --type ts

# Search for references to is_last_ir_chunk
rg "is_last_ir_chunk" --type js --type ts

Length of output: 380


Script:

#!/bin/bash
# Search for the implementation of extract stream functionality
ast-grep --pattern 'function extractStream($$$) { $$$ }'

# Search for IR related code context
rg -B 2 -A 2 "ir" --type js --type ts -i

# Look for the query.js file content
cat components/log-viewer-webui/client/src/api/query.js

Length of output: 31926

components/core/src/clp_s/archive_constants.hpp (1)

32-32: Verify CLP-S decompression functionality

The replacement of cOrigFileId with cStreamId in the decompression namespace aligns with the standardization objective. However, we should verify this change doesn't affect CLP-S decompression functionality.

✅ Verification successful

No impact on CLP-S decompression functionality

The change from cOrigFileId to cStreamId is safe as:

  • The old constant cOrigFileId has no references in the codebase
  • The new constant cStreamId is properly used in JsonConstructor.cpp for decompression results
  • The decompression functionality is well-tested through test-Segment.cpp which verifies data integrity after compression/decompression cycles
  • The change aligns with the standardization without affecting the core decompression logic
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of the old and new constants
rg "cOrigFileId" --type cpp --type hpp
rg "cStreamId" --type cpp --type hpp

# Look for any decompression-related tests
fd -e cpp -e hpp | rg -i "test.*decompress"

Length of output: 177


Script:

#!/bin/bash
# Search for usage of the old and new constants with correct file extensions
rg "cOrigFileId" -t cc -t h -t cpp -t hpp
rg "cStreamId" -t cc -t h -t cpp -t hpp

# Look for decompression-related tests with proper file extensions
find . -type f \( -name "*.cpp" -o -name "*.hpp" -o -name "*.cc" -o -name "*.h" \) -exec grep -l -i "test.*decompress" {} \;

# Search for decompression-related functionality
rg "decompress" -t cc -t h -t cpp -t hpp -A 5

Length of output: 381


Script:

#!/bin/bash
# Let's examine the test file found and check for usage of constants
cat ./components/core/tests/test-Segment.cpp

# Try searching with ripgrep without file type specifications
rg "cOrigFileId"
rg "cStreamId"

# Look for decompression functionality in the codebase
rg "decompress" -A 5

Length of output: 68017

components/core/src/clp_s/JsonConstructor.cpp (1)

125-125: LGTM: Consistent with stream_id standardization

The renaming from cOrigFileId to cStreamId aligns well with the PR objectives to standardize the metadata scheme.

components/core/src/clp/clo/OutputHandler.cpp Show resolved Hide resolved
components/core/src/clp/clo/clo.cpp Show resolved Hide resolved
gibber9809
gibber9809 previously approved these changes Dec 5, 2024
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gibber9809
Copy link
Contributor

LGTM.

I think for the title you may just need a space after the ":" character.

@haiqi96 haiqi96 changed the title chore(clp-package):Use a common scheme for JSON and IR metadata. chore(clp-package): Use a common scheme for JSON and IR metadata. Dec 5, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
components/core/src/clp/clp/FileDecompressor.hpp (1)

42-42: LGTM! Consider adding parameter description

The parameter rename from is_last_ir_chunk to is_last_chunk improves clarity. Consider adding a brief description of what this parameter indicates.

Add a description like:

-     * size_t begin_message_ix, size_t end_message_ix, bool is_last_chunk) -> bool;
+     * size_t begin_message_ix, size_t end_message_ix, bool is_last_chunk) -> bool;
+     * @param is_last_chunk Indicates whether this is the final chunk in the sequence
components/core/src/clp/clp/decompression.cpp (1)

285-285: Consider implementing final chunk handling

The is_last_chunk parameter is marked as unused, but it could be valuable for implementing special handling for the final chunk (e.g., cleanup operations, metadata updates, or validation).

Consider implementing final chunk handling:

-                                     [[maybe_unused]] bool is_last_chunk) {
+                                     bool is_last_chunk) {
             auto dest_ir_file_name = orig_file_id;
             dest_ir_file_name += "_" + std::to_string(begin_message_ix);
             dest_ir_file_name += "_" + std::to_string(end_message_ix);
+            if (is_last_chunk) {
+                // Add final chunk handling (e.g., metadata updates, validation)
+                SPDLOG_DEBUG("Processing final chunk for {}", orig_file_id);
+            }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b599a08 and 826138c.

📒 Files selected for processing (7)
  • components/core/src/clp/clo/clo.cpp (3 hunks)
  • components/core/src/clp/clo/constants.hpp (1 hunks)
  • components/core/src/clp/clp/FileDecompressor.hpp (1 hunks)
  • components/core/src/clp/clp/decompression.cpp (1 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (2 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
  • components/log-viewer-webui/client/src/api/query.js (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (5)
  • components/core/src/clp/clo/clo.cpp
  • components/core/src/clp_s/JsonConstructor.cpp
  • components/core/src/clp_s/archive_constants.hpp
  • components/log-viewer-webui/client/src/api/query.js
  • components/core/src/clp/clo/constants.hpp
🔇 Additional comments (1)
components/core/src/clp/clp/decompression.cpp (1)

Line range hint 287-290: Verify consistency with stream_id changes

The PR objectives mention replacing orig_file_id with stream_id, but the code still uses orig_file_id in file naming. This might need to be updated for consistency.

Let's verify the usage of these identifiers across the codebase:

@haiqi96 haiqi96 changed the title chore(clp-package): Use a common scheme for JSON and IR metadata. refactor(clp-package): Use a common scheme for JSON and IR metadata. Dec 5, 2024
Copy link
Collaborator

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the webUI changes lgtm

Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

refactor(clp-package): Unify the metadata schema for JSON and IR streams.

@haiqi96 haiqi96 changed the title refactor(clp-package): Use a common scheme for JSON and IR metadata. refactor(clp-package): Unify the metadata schema for JSON and IR streams. Dec 6, 2024
@haiqi96 haiqi96 merged commit 60d85d0 into y-scope:main Dec 6, 2024
24 checks passed
@haiqi96 haiqi96 deleted the stream_id_fix branch December 6, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants