Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo disambiguation favors disambiguate-add-names #60

Open
fbennett opened this issue Feb 21, 2020 · 16 comments
Open

Demo disambiguation favors disambiguate-add-names #60

fbennett opened this issue Feb 21, 2020 · 16 comments

Comments

@fbennett
Copy link

fbennett commented Feb 21, 2020

In the current disambiguation demo, citeproc-rs is adding names where disambiguation can be done by adding names or initials to existing partners. Steps to reproduce:

  1. Align the names, changing "Amadeus" to "Ariadne" to yield:

    K. Camembert et al. / K. Camembert et al.

  2. Add a char to the first given name in the first item to yield:

    Kurtz Camembert et al. / Kurt Camembert et al.

  3. Add a char to the last family name in the first item.

Expected
Further disambiguation is unnecessary, and with the style set to etal-use-first="1" and etal-min="1" or etal-min="2", a single name should be preferred. The result should remain:

Kurtz Camembert et al. / Kurt Camembert et al.

Åctual Result
The first given name is reverted to an initial, and two names are added:

K. Camembert, A. Rossi, I. Irrelevantz / K. Camembert, A. Rossi, I. Irrelevant

@cormacrelf
Copy link
Collaborator

cormacrelf commented Feb 22, 2020

It favours disambiguate-add-names because the spec demands adding names first, and stopping if that works. In this case, it does. “Adding names” here does not start with the previous attempt at disambiguation (ie Kurt/Kurtz), it starts from zero every time, which is the only deterministic approach. So it looks like, for the first reference:

K. Camembert et al (ambiguous, of degree 2 i.e. could have been produced by either citekey/citekey2)

Perform step 1 of disambiguation, add names until no further names yield a reduction in degree. It expands out to

K. Camembert, A. Rossi, I. Irrelevantz (unambiguous, degree 1)

And stops as successful. The same happens to the second cite, but it resolves to the second reference/citekey2 without the z. If adding names does not reduce the degree, then the name count bumper is reset to the last value that reduced it, or 0 if it was never reduced. Hence disambiguate-add-givenname can start with any number of additional names, as long as adding them reduced the degree.

(You’ll note we use different language to describe the process, and citeproc-rs doesn’t have “sets of ambiguous cites”, only sets of references that could have produced a cite. If you want me to expound a little, then I can, but I think it’s clear what’s going on in this example.)

@cormacrelf
Copy link
Collaborator

Sticking with Kurt/Kurtz is the expected behaviour if you have givenname-disambiguation-rule="all-names", and this works. This is because those given names have been globally expanded before add-names comes into play. But the demo and the default are by-cite, which means the givenname disambiguation only occurs in step 2 of the process described in the spec, and not globally or at any point before add-names.

@fbennett
Copy link
Author

You're right that it follows the specification exactly, but it's not what a copy editor would expect (and not what citeproc-js currently produces), and I'm pretty sure that users will complain.

@cormacrelf
Copy link
Collaborator

Are you saying citeproc-js reverses the first two disambiguation passes, because it’s simply better?

@fbennett
Copy link
Author

Yes, with by-cite it attempts given name addition/expansion before adding a name, to keep the cite as compact as possible.

@cormacrelf
Copy link
Collaborator

Only with by-cite? I don’t immediately see why the others wouldn’t also benefit, e.g. primary-name getting the second name Kurtzed. Right?

@fbennett
Copy link
Author

By second name, do you mean the A.Rossi names in the example? My understanding has been that styles that apply primary-name disambiguation do that and only that. The ones that I have seen mostly have year-suffix as a fallback, in case the limited disambiguation by name fails to resolve everything.

@fbennett
Copy link
Author

(Following up on the speculation above, I checked the CSL repo. Out of 217 styles that use primary-name, just one (swedish-legal.csl) uses it without year-suffix as a disambiguation fallback. There were previously two such styles, but the other international-journal-of-spatial-data-infrastructures-research) was amended on August 22 last year specifically to add the attribute.)

@cormacrelf
Copy link
Collaborator

cormacrelf commented Feb 22, 2020

Ah yes, my mistake, I think you’re right only by-cite needs it. In all other GNDRs, names that would have initials and/or given names added during cite disambiguation have already been added globally. I think this also means that the description of all-names as being for both name and cite disambiguation is redundant. If all names are as unique as possible already, you can’t disambiguate a cite by expanding them.

The upshot is you can always choose exactly one never have to do more than one of add-names and add-givennames, and you choose which based only on the GNDR: by-cite needs add-givennames, which includes an add-names pass; for the rest, add-givennames is guaranteed to do nothing, so you can skip it. This will make the program faster!

@fbennett
Copy link
Author

I think that's right, yes. Impressive and handy demo, by the way.

@bwiernik
Copy link

Thanks for this discussion folks! Did either of you happen to write tests for the disambiguation behavior?

@cormacrelf
Copy link
Collaborator

No but I'll take it

@bwiernik
Copy link

Cool. Did you see the clarified spec linked above?

@cormacrelf
Copy link
Collaborator

cormacrelf commented Nov 27, 2020

https://gist.github.com/cormacrelf/84bc9592cd10602d05a52bed938adece

While I was at it I wrote a better tool to convert between the two test case formats in case anyone feels like going full YAML anytime. I'll publish that tomorrow, it does have a caveat where it writes out fully parsed names and also won't handle ALL the weird sections/modes, but at least it does multiline strings for the csl: | field. I think you guys have some new sections (VERSION? I thought CSL had a version field for exactly that! 😄 ) that you can figure out.

@cormacrelf
Copy link
Collaborator

One note is that (it may be my setup that's wrong, but) citeproc-js fails that test, by not writing out the et al.. Source is in https://github.com/cormacrelf/citeproc-rs/tree/master/citeproc-js-runner, I already updated my citeproc/citeproc-test-runner packages.

@fbennett
Copy link
Author

There is nothing wrong with your setup, and the test fails here as well. At first I thought, "WTF?" but on a closer look I realized it's caused by et-al-min="1"/et-al-use-first="1". With et-al-min="2" the logical outcome is the same, but "et al." is included. It's by intention, and me being off-spec bad boy again. Suppressing "et al." in the former case was a hack to support bibliographies of individual members of staff. It should not be default behavior, and the test should pass. Thanks for flagging it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants