Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Citation Minor Bugs #3294

Merged
merged 1 commit into from
Dec 1, 2024
Merged

Fix Citation Minor Bugs #3294

merged 1 commit into from
Dec 1, 2024

Conversation

yuhongsun96
Copy link
Contributor

Description

  • Citations where the LLM produces double brackets do not work, for example [[, 4, ]]
  • Citations that have some characters in the token after the end of the citation, for example [, 1, ].

How Has This Been Tested?

Unit tests haven't broken and a new test was added to cover these two cases

Accepted Risk (provide if relevant)

That code is complicated and brittle, hopefully no new bugs

Related Issue(s) (provide if relevant)

N/A

Mental Checklist:

  • All of the automated tests pass
  • All PR comments are addressed and marked resolved
  • If there are migrations, they have been rebased to latest main
  • If there are new dependencies, they are added to the requirements
  • If there are new environment variables, they are added to all of the deployment methods
  • If there are new APIs that don't require auth, they are added to PUBLIC_ENDPOINT_SPECS
  • Docker images build and basic functionalities work
  • Author has done a final read through of the PR right before merge

Backporting (check the box to trigger backport action)

Note: You have to check that the action passes, otherwise resolve the conflicts manually and tag the patches.

  • This PR should be backported (make sure to check that the backport attempt succeeds)

Copy link

vercel bot commented Dec 1, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 1, 2024 8:05pm

@yuhongsun96 yuhongsun96 merged commit 9bd0cb9 into main Dec 1, 2024
12 of 13 checks passed
@yuhongsun96 yuhongsun96 deleted the citation-fixes branch December 1, 2024 21:55
@@ -385,6 +385,16 @@ def process_text(
"Here is some text[[1]](https://0.com). Some other text",
["doc_0"],
),
# ['To', ' set', ' up', ' D', 'answer', ',', ' if', ' you', ' are', ' running', ' it', ' yourself', ' and',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was removed previously, but maybe we could add back a comment explaining the logic here?

    """
    Key aspects:
    1. Stream Processing:
    - Processes tokens one by one, allowing for real-time handling of large texts.
    2. Citation Detection:
    - Uses regex to find citations in the format [number].
    - Example: [1], [2], etc.
    3. Citation Mapping:
    - Maps detected citation numbers to actual document ranks using doc_id_to_rank_map.
    - Example: [1] might become [3] if doc_id_to_rank_map maps it to 3.
    4. Citation Formatting:
    - Replaces citations with properly formatted versions.
    - Adds links if available: [[1]](https://example.com)
    - Handles cases where links are not available: [[1]]()
    5. Duplicate Handling:
    - Skips consecutive citations of the same document to avoid redundancy.
    6. Output Generation:
    - Yields DanswerAnswerPiece objects for regular text.
    - Yields CitationInfo objects for each unique citation encountered.
    7. Context Awareness:
    - Uses context_docs to access document information for citations.
    This function effectively processes a stream of text, identifies and reformats citations,
    and provides both the processed text and citation information as output.
    """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comment (above one) wasn't very clear, but some steps of the logic would probably help keep this logic maintainable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants