Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document word count is incorrent when segment change #10335

Closed
5 tasks done
nadirvishun opened this issue Nov 6, 2024 · 2 comments · Fixed by #10449
Closed
5 tasks done

document word count is incorrent when segment change #10335

nadirvishun opened this issue Nov 6, 2024 · 2 comments · Fixed by #10449
Assignees
Labels
🐞 bug Something isn't working

Comments

@nadirvishun
Copy link

nadirvishun commented Nov 6, 2024

Self Checks

  • This is only for bug report, if you would like to ask a question, please head to Discussions.
  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (我已阅读并同意 Language Policy).
  • [FOR CHINESE USERS] 请务必使用英文提交 Issue,否则会被关闭。谢谢!:)
  • Please do not modify this template :) and fill in all the required fields.

Dify version

0.11.0

Cloud or Self Hosted

Self Hosted (Docker), Self Hosted (Source)

Steps to reproduce

Both the documents and document_segments in the database contain a word_count field, but there are issues with both:

  • In segments: If it is in qa mode, the field content is used as the question and the field answer as the answer, but when counting words, only content is counted, and the word count in answer is not included, as shown in the following figure:

    image

  • In documents: It is only correct at the very beginning. Once segement undergoes create, batch create, update, delete, there is no associated change in the doucument, as shown in the following figure:

    image

  • In datesete: The word count is the sum of all documents, leading to the same issue:

    image

The existing data can be corrected by referring to the following SQL, For reference only:

  • First fix the segment
    UPDATE document_segments 
    SET word_count = CHAR_LENGTH ( CONTENT ) + CASE WHEN answer IS NULL THEN 0 ELSE CHAR_LENGTH ( answer ) END;
  • Then fix the document:
    UPDATE documents 
    SET word_count = T.word_count 
    FROM
        ( SELECT SUM ( word_count ) AS word_count, document_id FROM document_segments GROUP BY document_id ) T 
    WHERE
        ID = T.document_id;

✔️ Expected Behavior

word count is corrent.

❌ Actual Behavior

word count is incorrent.

Copy link

dosubot bot commented Nov 6, 2024

The issue you're experiencing with the word count in documents and document_segments seems to be related to how the word count is calculated and updated. Here are some insights based on the information available:

  1. Segment Word Count: The word count for document_segments is initially set using the length of the content field. However, in qa mode, it appears that only the content is being counted, and the answer field is not included in the word count. This discrepancy needs to be addressed in the code to ensure both content and answer are considered when calculating the word count for segments.

  2. Document Word Count: The word count for a document is supposed to be dynamically updated whenever segments are created, updated, or deleted. The Document class has a word_count property that calculates the total word count by summing up the word counts of all segments associated with the document. If this is not happening, it might be due to missing triggers or updates in the code that handle these changes [1][2].

  3. Dataset Word Count: The word count for the entire dataset is aggregated from individual documents. If the document word counts are incorrect, this will naturally lead to incorrect dataset word counts as well [2].

To resolve these issues, you may need to review and modify the code responsible for updating word counts in both document_segments and documents to ensure they reflect changes accurately. This might involve ensuring that both content and answer fields are included in the word count calculation for segments and verifying that document word counts are recalculated whenever segment changes occur.

To continue talking to Dosu, mention @dosu.

@dosubot dosubot bot added the 🐞 bug Something isn't working label Nov 6, 2024
@JohnJyong
Copy link
Collaborator

@nadirvishun currently, we only count the query length and it will be improved in next version
@JzoNgKVO If the word count does not reach 1K, the front end displays 0K

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants