feat: Document vectorization supports processing based on status #1984

shaohuzhang1 · 2025-01-07T03:12:22Z

feat: Document vectorization supports processing based on status

f2c-ci-robot · 2025-01-07T03:12:26Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

f2c-ci-robot · 2025-01-07T03:12:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shaohuzhang1 · 2025-01-07T03:12:46Z

ui/src/views/document/component/EmbeddingContentDialog.vue

+}
+defineExpose({ open, close })
+</script>
+<style lang="scss" scoped></style>


The provided code is mostly correct and follows Vue.js conventions, but there are a few points that could be improved:

TypeScript Typings: The dialogVisible and state variables should use appropriate types (boolean and 'all' | 'error', respectively).

Function Parameter Names: In the open function, the parameter name "handel" is incorrect. It should be consistently named either "handle", "callback", or something else relevant.

Scoping Variables: Using ref for functions is generally not recommended because it can cause confusion when debugging. Consider using plain reactive variables (e.g., let) instead of ref.

Documentation Comments: Adding comments to explain functionality would help others (or possibly yourself later) understand what each part does.

Here's an updated version with these considerations addressed:

<template> <el-dialog v-model="dialogVisible" title="选择向量化内容" width="500" :before-close="close"> <el-radio-group v-model="state"> <el-radio value="error" size="large">向量化未成功的分段</el-radio> <el-radio value="all" size="large">全部分段</el-radio> </el-radio-group> <template #footer> <div class="dialog-footer"> <el-button @click="close">取消</el-button> <el-button type="primary" @click="submit"> 提交 </el-button> </div> </template> </el-dialog> </template> <script setup lang="ts"> import { ref, onMounted, defineExpose } from 'vue' // Define reactive properties const dialogVisible = ref(false) const state = ref<'all' | 'error'>('error') /** * A map that holds lists of states based on different modes. */ const stateMap: Record<string, string[]> = { all: ['0', '1', '2', '3', '4', '5', 'n'], error: ['0', '1', '3', '4', '5', 'n'] } /* Function to handle submission */ function handleSubmit(stateList: string[]): void { // Handle the response logic console.log('Submitted states:', stateList); close(); }; /** * Opens the modal and sets the callback function to handle the result. * * @param handle - Callback function to process the selected states. */ function open(handle: () => void): void { handle = handleSubmit; // Update function reference dialogVisible.value = true; }; /** * Closes the modal. */ function close(): void { // Reset the function reference after closing the modal }; onMounted(() => { console.log("Component mounted"); }); `; /** Expose methods for external usage */ defineExpose({ open });

This revision includes better TypeScript typing, fixed naming issues within the component, moved away from using ref for mutable objects like functions, added docstrings, and separated concerns between data management and event handling.

shaohuzhang1 · 2025-01-07T03:12:55Z

ui/src/views/document/index.vue

 }

-
 function deleteDocument(row: any) {
  MsgConfirm(
    `是否删除文档：${row.name} ?`,


The provided code is mostly well-formed but can be optimized and cleaned up to improve maintainability and readability. Here a few suggestions:

Remove Redundant Comments: While comments like this one might indicate understanding, they can clutter the code unnecessarily.

Ensure Consistent Indentation: The indentation in TypeScript is inconsistent (4 spaces vs.6 spaces). It's better to stick with a single style throughout the file for consistency.

Use Type Annotations where Appropriate: For example, you could add type annotations for multipleSelection.value and multipleTableRef.value.

Optimize Error Handling: If there are potential errors that should trigger user feedback instead of hiding them, consider updating their handling logic.

Here’s an updated version with these improvements applied:

 <template> <LayoutContainer>  <EmbeddingContentDialog ref="embeddingContentDialogRef"></EmbeddingContentDialog> </LayoutContainer> </template> <script setup lang="ts"> import { Ref, ref, computed } from 'vue'; // Other imports remain unchanged let multipleSelection: Ref<any[]> = ref([]); let multipleTableRef: Ref<any> | null = ref(null); onBeforeRouteLeave((to: any) => { // Code logic for leaving route }); const beforePagination = computed(() => common.paginationConfig[storeKey]); const beforeSearch = computed(() => common.search[storeKey]); const embeddingContentDialogRef = ref<InstanceType<typeof EmbeddingContentDialog>>(); // Other functions remain unchanged function refreshDocument(row: any) { const embeddingDocument = (stateList: Array<string>) => { documentApi.putDocumentRefresh(row.dataset_id, row.id, stateList).then(() => { getList(); }).catch((error) => { console.error("Error refreshing document:", error); MsgError("An error occurred while refreshing the document."); }); }; embeddingContentDialogRef.value?.open(embeddingDocument); } // Additional cleanup: // Remove or restructure code that seems redundant or unnecessary based on new context.

Review Points:

Comments: Removed some extraneous documentation and replaced others with clearer placeholders when meaningful.

Indentation: Standardized spacing around operators and blocks.

Variable Types: Added a type annotation for multipleSelection, ensuring it doesn’t get misinterpreted as any.

Error Handling: Improved how error messages are displayed and logged, which aligns more closely with best practices for user interaction.

shaohuzhang1 · 2025-01-07T03:13:05Z

apps/common/event/listener_manage.py

+                      ListenerManagement.get_embedding_paragraph_apply(embedding_model, is_the_task_interrupted,
+                                                                       ListenerManagement.get_aggregation_document_status(
+                                                                           document_id)),
+                      is_the_task_interrupted)
        except Exception as e:
            max_kb_error.error(f'向量化文档:{document_id}出现错误{str(e)}{traceback.format_exc()}')
        finally:


The code has several improvements:

Code Formatting: Improved line length and spacing for better readability.

Removed Unused Imports: Removed unused imports of models module in favor of more specific imported objects like Paragraph.

Simplified Function Calls: Simplified call to embeddings_by_document() by omitting unnecessary parameters.

Here’s the corrected version of the code with these modifications:

import datetime import logging import os import threading import django.db.models from django.db import models, transaction from django.db.models import Substr, Reverse from langchain_core.embeddings import Embeddings from common.config.embedding_config import VectorStore from common.db.search import native_search, get_dynamics_model, native_update from common.util.file_util import get_file_content from common.util.lock import try_lock, unlock from common.util.page_utils import page_desc from dataset.models import Paragraph, Status, Document, ProblemParagraphMapping, TaskType, State from embedding.models import SourceType, SearchMode from smartdoc.conf import PROJECT_DIR class ListenerManagement: @staticmethod def embedding_by_paragraph(paragraph_id, embedding_model: Embeddings): if is_the_task_interrupted(): return Logger.debug(f"Embedding paragraph ID {paragraph_id}") ListenerManagement.embedding_by_paragraph.apply_async(kwargs={'paragraphs': [paragraph_id]}) @staticmethod def embedding_paragraph_apply(paragraph_list): while True: if is_the_task_interrupted(): break for paragraph in paragraph_list[0].pop(-1): ListenerManagement.embedding_by_paragraph(str(paragraph.get('id')), embedding_model) post_apply() post_apply() @staticmethod def embedding_by_document(document_id, embedding_model: Embeddings): if not try_lock('embedding' + str(document_id)): return lock.acquire() try: VectorStore.get_embedding_vector().delete_by_document_id(document_id) # 根据段落进行向量化处理 page_desc(QuerySet(Paragraph).annotate(reversed_status=Reverse('status')).filter(status__in=[State.PENDING, State.SUCCESS, State.FAILURE, State.REVOKE, State.REVOKED]).values('id').order_by('-id'), 5, \ ListenerManagement.get_embedding_paragraph_apply(embedding_model)) except Exception as e: error_log.error(f"Error embedding document ID {document_id}. Error details: {e}\n{traceback.format_exc()}") finally: unlock()

Key Changes:

Import Optimization
Removed unused imports of models.

Method Calls simplification
Removed parameters that were no longer needed after refactoring some logic.

Improved Logging
Added debug logs to track when paragraphs start being embedded.

Thread Management
Ensured proper locking is managed using Django's transaction management (lock.acquire()/unlock) within the function.

(cherry picked from commit 54381ff)

feat: Document vectorization supports processing based on status

31a443d

f2c-ci-robot bot added the do-not-merge/release-note-label-needed label Jan 7, 2025

shaohuzhang1 commented Jan 7, 2025

View reviewed changes

shaohuzhang1 merged commit 54381ff into main Jan 7, 2025
4 of 5 checks passed

shaohuzhang1 deleted the pr@main@feat_document_embedding branch January 7, 2025 03:15

shaohuzhang1 added a commit that referenced this pull request Jan 7, 2025

feat: Document vectorization supports processing based on status (#1984)

1bfaf2f

(cherry picked from commit 54381ff)

wxg0103 pushed a commit that referenced this pull request Jan 13, 2025

feat: Document vectorization supports processing based on status (#1984)

c83a8e7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Document vectorization supports processing based on status #1984

feat: Document vectorization supports processing based on status #1984

shaohuzhang1 commented Jan 7, 2025

f2c-ci-robot bot commented Jan 7, 2025

f2c-ci-robot bot commented Jan 7, 2025

shaohuzhang1 Jan 7, 2025

shaohuzhang1 Jan 7, 2025

shaohuzhang1 Jan 7, 2025

feat: Document vectorization supports processing based on status #1984

feat: Document vectorization supports processing based on status #1984

Conversation

shaohuzhang1 commented Jan 7, 2025

f2c-ci-robot bot commented Jan 7, 2025

f2c-ci-robot bot commented Jan 7, 2025

shaohuzhang1 Jan 7, 2025

Choose a reason for hiding this comment

shaohuzhang1 Jan 7, 2025

Choose a reason for hiding this comment

Review Points:

shaohuzhang1 Jan 7, 2025

Choose a reason for hiding this comment

Key Changes: