Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Citation Minor Bugs #3294

Merged
merged 1 commit into from
Dec 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was removed previously, but maybe we could add back a comment explaining the logic here?

    """
    Key aspects:
    1. Stream Processing:
    - Processes tokens one by one, allowing for real-time handling of large texts.
    2. Citation Detection:
    - Uses regex to find citations in the format [number].
    - Example: [1], [2], etc.
    3. Citation Mapping:
    - Maps detected citation numbers to actual document ranks using doc_id_to_rank_map.
    - Example: [1] might become [3] if doc_id_to_rank_map maps it to 3.
    4. Citation Formatting:
    - Replaces citations with properly formatted versions.
    - Adds links if available: [[1]](https://example.com)
    - Handles cases where links are not available: [[1]]()
    5. Duplicate Handling:
    - Skips consecutive citations of the same document to avoid redundancy.
    6. Output Generation:
    - Yields DanswerAnswerPiece objects for regular text.
    - Yields CitationInfo objects for each unique citation encountered.
    7. Context Awareness:
    - Uses context_docs to access document information for citations.
    This function effectively processes a stream of text, identifies and reformats citations,
    and provides both the processed text and citation information as output.
    """

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous comment (above one) wasn't very clear, but some steps of the logic would probably help keep this logic maintainable

Original file line number Diff line number Diff line change
Expand Up @@ -67,23 +67,25 @@ def process_token(
if piece_that_comes_after == "\n" and in_code_block(self.llm_out):
self.curr_segment = self.curr_segment.replace("```", "```plaintext")

citation_pattern = r"\[(\d+)\]"
citation_pattern = r"\[(\d+)\]|\[\[(\d+)\]\]"
citations_found = list(re.finditer(citation_pattern, self.curr_segment))
possible_citation_pattern = r"(\[\d*$)" # [1, [, etc
possible_citation_pattern = r"(\[+\d*$)"
possible_citation_found = re.search(
possible_citation_pattern, self.curr_segment
)

if len(citations_found) == 0 and len(self.llm_out) - self.past_cite_count > 5:
self.current_citations = []

result = "" # Initialize result here
result = ""
if citations_found and not in_code_block(self.llm_out):
last_citation_end = 0
length_to_add = 0
while len(citations_found) > 0:
citation = citations_found.pop(0)
numerical_value = int(citation.group(1))
numerical_value = int(
next(group for group in citation.groups() if group is not None)
)

if 1 <= numerical_value <= self.max_citation_num:
context_llm_doc = self.context_docs[numerical_value - 1]
Expand Down Expand Up @@ -131,14 +133,6 @@ def process_token(

link = context_llm_doc.link

# Replace the citation in the current segment
start, end = citation.span()
self.curr_segment = (
self.curr_segment[: start + length_to_add]
+ f"[{target_citation_num}]"
+ self.curr_segment[end + length_to_add :]
)

self.past_cite_count = len(self.llm_out)
self.current_citations.append(target_citation_num)

Expand All @@ -149,6 +143,7 @@ def process_token(
document_id=context_llm_doc.document_id,
)

start, end = citation.span()
if link:
prev_length = len(self.curr_segment)
self.curr_segment = (
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -385,6 +385,16 @@ def process_text(
"Here is some text[[1]](https://0.com). Some other text",
["doc_0"],
),
# ['To', ' set', ' up', ' D', 'answer', ',', ' if', ' you', ' are', ' running', ' it', ' yourself', ' and',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit

# ' need', ' access', ' to', ' certain', ' features', ' like', ' auto', '-sync', 'ing', ' document',
# '-level', ' access', ' permissions', ',', ' you', ' should', ' reach', ' out', ' to', ' the', ' D',
# 'answer', ' team', ' to', ' receive', ' access', ' [[', '4', ']].', '']
(
"Unique tokens with double brackets and a single token that ends the citation and has characters after it.",
["... to receive access", " [[", "1", "]].", ""],
"... to receive access [[1]](https://0.com).",
["doc_0"],
),
],
)
def test_citation_extraction(
Expand Down
Loading