Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Research Reference Citations - with CL #4978

Open
flooie opened this issue Jan 27, 2025 · 8 comments
Open

Research Reference Citations - with CL #4978

flooie opened this issue Jan 27, 2025 · 8 comments
Assignees

Comments

@flooie
Copy link
Contributor

flooie commented Jan 27, 2025

Enable matching and annotation of new Reference Citations e.g. Smith at 211 in the html with citations generation.

The open question is really

  1. Does html with citations generate annotations individually or does it resolve citations together first. If it it doesnt resolve citations you may need to do that and rely on matched full case citations to generate the appropriate link. This is somewhat related to the linkify-ing pincites issue as nearly all reference citations are pin-cite citations.
@flooie flooie converted this from a draft issue Jan 27, 2025
@grossir grossir moved this from January 27 to Feb 7 to In progress in Case Law Sprint Jan 27, 2025
@grossir
Copy link
Contributor

grossir commented Jan 28, 2025

This issue was not clear to me at first read. After talking with Bill I understand this issue is actually 2 sub-issues

  1. Implement logic to include the new eyecite.ReferenceCitation model into Opinion.html_with_citations. That new model is basically the name of one of the parties and a pincite "($plaintiff|$defendant) at \d{1,5}". I have been checking this, and so far, I think we don't have to do anything, the logic in eyecite _resolve_reference_citations is enough (need to double check, Bill just sent me more examples)

  2. Implement logic in Courtlistener to identify harder to catch cases of "ReferenceCitations", not included in eyecite since we need to use some fields from the database. These are the cases 1 and 4 as described here:

So the spec is to find and figure out examples like:

  1. "As said in Roe..."
  2. "Roe at 222 says..."
  3. "Roe,410 U.S. 113 at 155."
  4. "As said in Citizens' Alliance..."

For that, we would need to pickup some of the OpinionCluster.case_name from the DB for resolved citations, and try to look for the parties names in the opinion's text and see if they are unambigously resolvable

@grossir
Copy link
Contributor

grossir commented Jan 29, 2025

About sub-issue 1,
I thought it should work out of the box; but after checking real examples, there is an unexpected bug.
On HTML sources, ReferenceCitations are usually wrapped on <i> or <em> tags. However, the citation span begins at the start of the name. For example See <i>Howman</i> at 85. Something more.... This causes the ReferenceCitation span to match the following text "Howman</i> at 85". When applying the annotations, it will look like

<i>
<span class="citation" data-id="122248">
<a href="/opinion/122248/howsam-v-dean-witter-reynolds-inc/" aria-description="Citation for case: Howsam v. Dean Witter Reynolds, Inc.">
Howman
</i> at 85
</a>
</span>

This is recognized as broken HTML, and causes the annotation to be skipped silently (see unbalanced_tags="skip"). Passing unbalanced_tags="wrap" makes it work, but it's probably undesired.

if opinion.source_is_html: # If opinion was originally HTML...
new_html = annotate_citations(
plain_text=opinion.cleaned_text,
annotations=generate_annotations(citation_resolutions),
source_text=opinion.source_text,
unbalanced_tags="skip", # Don't risk overwriting existing tags
)

How to solve this?

  • I don't think it's desirable to get rid of the <i> or <em> since they improve readibility. Perhaps we can manually modify the citation span in Courtlistener to include those inmediately preceding tags?
  • This may be happening when annotating other types of citations, and is a silent failure, that won't be tracked on the UnmatchedCitation table. Perhaps we should make it not fail silently
  • <i>($plaintiff|$defendant)</i> may be a good pattern to find ReferenceCitations without pincites, on eyecite itself

@mlissner
Copy link
Member

What happens if you do wrap with unbalanced_tags?

@grossir
Copy link
Contributor

grossir commented Jan 29, 2025

In the specific case I was looking at, it's duplicating the <span><a>... to account for the <i>

(citing 
<i>
<span class="citation" data-id="122248">
<a href="/opinion/122248/howsam-v-dean-witter-reynolds-inc/" aria-description="Citation for case: Howsam v. Dean Witter Reynolds, Inc.">
Howsam
</a>
</span>
</i>

<span class="citation" data-id="122248">
<a href="/opinion/122248/howsam-v-dean-witter-reynolds-inc/" aria-description="Citation for case: Howsam v. Dean Witter Reynolds, Inc.">
 at 84
</a>
</span>,

The text content of the original ReferenceCitation is "Howsam at 84" so it should be a single link
Apart from that, I don't know what extra side effects it may have, there is just the comment of "# Don't risk overwriting existing tags"

@mlissner
Copy link
Member

mlissner commented Jan 29, 2025

Oh, yeah, that's not great, eh?

This same problem came up at my old job a dozen years ago. HTML!

Seems like we need to identify that there's HTML before the token and then make sure it's inside too, so you wind up with:

<span class="citation" data-id="122248">
  <a href="/opinion/122248/howsam-v-dean-witter-reynolds-inc/" 
     aria-description="Citation for case: Howsam v. Dean Witter Reynolds, Inc.">
    <i>Howman</i> at 85
  </a>
</span>

Is that possible?

@grossir
Copy link
Contributor

grossir commented Jan 29, 2025

That's what I thought too as a possible solution:

  • check if the inmediately precedent characters are an opening HTML tag
  • if they are, include them in the citation.span()

I am trying to do it in Courtlistener; if it works maybe we can just do the fix on eyecite, or just leave it here

@grossir
Copy link
Contributor

grossir commented Jan 29, 2025

Seems like the problem was already known but ignored 🙂

# Id. citation with an intervening HTML tag
# (We expect the HTML to be unchanged, since it's too risky to
# modify with another tag in the way)
('<div><p>the improper views of the Legislature.\" 2 <i>id.,</i> '
'at <b>73, bolded</b>.</p>\n<p>Nathaniel Gorham of Massachusetts'
'</p></div>',
'<div><p>the improper views of the Legislature.\" 2 <i>id.,</i> '
'at <b>73, bolded</b>.</p>\n<p>Nathaniel Gorham of Massachusetts'
'</p></div>'),

@mlissner
Copy link
Member

This problem echos through workplaces, projects, and code bases. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

3 participants