Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escaping hyphens in the Pages field #943

Closed
tom-a-horrocks opened this issue Apr 5, 2018 · 31 comments
Closed

Escaping hyphens in the Pages field #943

tom-a-horrocks opened this issue Apr 5, 2018 · 31 comments

Comments

@tom-a-horrocks
Copy link

Hi all,

I'm trying to export to bibtex the citation for a journal article with page numbers "16-1", "16-2", "16-3", and "16-4". I'd like the page range to appear in bibtex as '16-1--16-4'. Unfortunately, if the Pages field in Zotero is '16-1-16-4', then all hyphens are converted to en dashes and the corresponding bibtex field is '16--1--16--4'. Is there any way to escape hyphens here, or alternatively force '16-1' and '16-4' to be interpreted as strings?

@tom-a-horrocks
Copy link
Author

tom-a-horrocks commented Apr 5, 2018

After a bit more reading I've discovered this is more of a bibtex issue. I've tried including an @string definition for a hyphen, but unfortunately that is also converted to an en dash. I have found one solution is to include a command in the .bib's preamble:

\documentclass{article}
\begin{filecontents}{test.bib}
@preamble{{\providecommand*\hyphen{-}}}

@article{test,
  author  = "Other, A. N.",
  journal = "J. Irrep. Res.",
  title   = "Some things I did",
  pages   = "081401\hyphen 1--081401\hyphen4",
  year    = "2011"
}
\end{filecontents}
\begin{document}
\nocite{*}
\bibliography{test}
\bibliographystyle{ieeetr}
\end{document}

https://tex.stackexchange.com/questions/21773/hyphenating-a-number-in-the-bibtex-pages-field

Is it at all possible to do this within zotero/better-bibtex? I'd like to avoid editing the .bib directly if possible.

retorquere added a commit that referenced this issue Apr 5, 2018
retorquere added a commit that referenced this issue Apr 5, 2018
retorquere added a commit that referenced this issue Apr 5, 2018
retorquere added a commit that referenced this issue Apr 5, 2018
@blip-bloop
Copy link
Collaborator

🤖 this is your friendly neighborhood build bot announcing test build 5.0.116.6221.issue-943 ("adjust test cases for #943").

@retorquere
Copy link
Owner

retorquere commented Apr 5, 2018

OK so the hyphen issue is partly my fault, as BBT was a little zealous in changing anything dash-y into en-dashes. 6221 changes that. That should make what you want to do easier. Not trivial though.

There are two ways to get \hyphens in that field:

  • Insert the \hyphens using BBTs "raw inserts", which would look like 081401<pre>\hyphen</pre> 1--081401<pre>\hyphen</pre>4. Mind that the <pre> bits will show up as-is if you use Zotero for non-BibTeX (ie Word) citations. This is strictly a BBT thing and Zotero doesn't know about it so will treat it as if you wanted the <pre> to show up as text in the bibliography in Word.
  • Change the generated bibtex (which should look like 081401-1--081401-4 in 6221) using a postscript.

For the preamble you'll have to use a postscript in any case as it stands. I am considering adding a preamble field, but I think I'd have to add two (what works for BibTeX will not necessarily work for BibLaTeX). The postscript would look like

if (Translator.BetterBibLaTeX) {
  if (!Translator.preambleWritten) {
    Zotero.write('@preamble{{\\providecommand*\\hyphen{-}}}\n');
    Translator.preambleWritten = true;
  }

  if (this.has.pages) this.has.pages.bibtex = this.has.pages.bibtex.replace(/([0-9])-([0-9])/g, '$1\\hyphen$2');
}

which means:

  • if no preamble was written yet, do so now
  • if the pages field has <number>-<number>, replace that hyphen with \hyphen.

@retorquere
Copy link
Owner

really need that feedback.

@bothide
Copy link

bothide commented Apr 9, 2018

As far as I recall, a page range in a bib file should always be given as "1-3", i.e., with a single hypen. Depending on the .bst file, the single hypen for page range in the .bib file will be expanded to an em-dash or, in some rare cases, to an en-dash.

@retorquere
Copy link
Owner

I think that's mostly what it does now, right? Have you tested the new behavior?

@bothide
Copy link

bothide commented Apr 10, 2018

I have not tested it, but I believe you. My comment was meant as just that. Another comment is that the page range "16-1 -- 16-4" is in many journals written as "16(4)".

@tom-a-horrocks
Copy link
Author

tom-a-horrocks commented Apr 10, 2018

Thanks for your work on this. Note that in the meantime I've simply used 16:1-4, which should be fine for me.

The page numbers '16-1',...'16-4' are what are printed on the conference abstract itself. What's happening is that '16' is an electronic article identifier (separate to DOI). What complicated matters is that there's no field for this identifier except perhaps for issue, which isn't available for conference abstracts (@inproceedings) -- and sometimes journal articles have an issue number AND an electronic identifier anyway. I guess writing 16(1-4) in the page field may be a realistic compromise?

Note that these identifiers can change significantly. For example, I have another which is We MIN 06, and I'm yet to settle on a principled way to get these into the bibliography.

@retorquere
Copy link
Owner

@njbart, is it correct I should use a single hyphen for page ranges? This is mostly related to import, because I'm going to pass on what's in the pages field as-is on output, only translating a unicode en-dash to --, and unicode m-dashes to --- for output.

@bothide
Copy link

bothide commented Apr 10, 2018 via email

@retorquere
Copy link
Owner

I'm not always using --, I'm just translating U+2013 to -- and U+2014 to ---. Hyphens (regardless of how many you have) will be left untouched.

@bothide
Copy link

bothide commented Apr 10, 2018

I was referring to the average user who uses "12--17" instead of the more preferable "12-17" in his/her .bib file.

@retorquere
Copy link
Owner

If I can be sure that a user never wants a double-hash in the pages field (@njbart?) then perhaps I could replace them, but it seems iffy.

In some cases, I need some work to be left for cleanup by the user; can't algorithmically catch them all ¯\_(ツ)_/¯. A postscript is always an option.

@bothide
Copy link

bothide commented Apr 13, 2018 via email

@retorquere
Copy link
Owner

Except if @njbart 's interpretation of the biblatex wiki is correct, any number of non-braced consecutive dash symbols of various kinds would constitute a \bibrangedash. biblatex has it's own parsing and interpretation rules, and will output TeX code as a result, but the input isn't necessarily interpreted as (La)TeX itself would.

@njbart offered a heuristic to determine what dash-like things to brace and which not, but "longest" to me is ambiguous on whether it means pre-processing length (in which case double-hyphen would be longer than em-dash) or post-processing length (in which case em-dash is longer than double-hyphen).

Not at all sure I'm going to do this yet, as I'd have to do further parsing of the pages field for multiple ranges, and parsing of Zotero input is brittle. But I'm considering doing it.

@njbart
Copy link
Contributor

njbart commented Apr 13, 2018

What I had in mind was post-processing length, i.e., en-dash=double-hyphen longer than single-hyphen (and em-dash=triple-hyphen longer than double-hyphen – though I’m not sure the latter situation ever occurs in the wild).

@retorquere
Copy link
Owner

Neither have I, but nothing surprises me at this point. The state of references ready-to-import for Zotero is not stellar, and all kinds of stuff ends up in the database.

@moewew
Copy link

moewew commented Apr 13, 2018

Hope you don't mind me butting in here. I can only say things with confidence for the biblatex side. BibTeX as you know is an inhomogeneous realm of .bst files that do not always follow the same line.

@bothide is right when they say that the dash can be considered a kind of meta character in the pages field. For the standard BibTeX styles as far as I can see what happens is simply that single -s are doubled up to become -- (this is done using the function $substring that treats braces and macro construct simply as ASCII chars, so no amount of brace protection can help here).
However, the BibTeX documentation states (http://mirrors.ctan.org/biblio/bibtex/base/btxdoc.pdf, p. 11):

pages One or more page numbers or range of numbers, such as 42--111 or 7,41,73--97 or 43+ (the ‘+’ in this last example indicates pages following that don’t form a simple range). To make it easier to maintain Scribe-compatible databases, the standard styles convert a single dash (as in 7-33) to the double dash used in TeX to denote number ranges (as in 7--33).

So it seems that back when BibTeX was devised the preferred way was actually a double dash and the single dash was only used for backwards compatibility reasons. I don't know if there are any more authoritative sources nowadays that recommending - over --, but popular use may simply have made - the more prevalent and the de-facto standard: It's simpler to type, after all.

For biblatex the 'meta' capacity of - is made clearer by the fact that pages is not a literal field that is largely left as is, but a range field that is parsed by Biber.

I do, however, not agree with the sentiment that numeric fields should always be written without braces. It is a feature of the .bib file syntax that "numerical values" do not need braces (or quotes):

For numerical values, curly braces and double quotes can be omitted.

(Nicolas Markey: Tame the BeaST, p. 20, http://mirrors.ctan.org/info/bibtex/tamethebeast/ttb_en.pdf)

But this is clearly worded as optional here and I haven't seen anyone else endorsing leaving out the braces. In fact pages = 1-45, will fail, so pages = 1, is risky if you want to add something later on. The risk is lower for export tools such as yours here, but I still think it is better to go with the braces. Still the only advice I have seen with regards to number fields and braces is to always write the braces even if they are not required.

biblatex actually has two levels at which it can deal with page ranges: Biber parses page ranges in the pages field, but pages as given in the optional postnote argument to \cite and friends are not passed on to Biber and are parsed by biblatex with (La)TeX code.

  1. Biber parses the pages field as a range field and tries to make sense of it from that perspective using Perl RegEx.

    Roughly, Biber splits the field at , and ; and then treats each bit separately. At first a RegExp that matches "(non-dash chars)(dash chars)(non-dash chars)" tries to read off the start and end of a page range. If that does not match, a fallback pattern "(any char)(at least two dash chars)(any char)" tries to find the start and end of the range. The range is then written to the .bbl as <start>\bibrangedash <end>.

    Note that brace protection does not do anything for Biber. Furthermore, any number and all kinds of dashes are treated equally as long as RegExp recognises the character as dash-like, the only exception being the fallback pattern that specifically needs at least two dash-like characters to match (so pages = {16-1--16-4}, with double ASCII dash works, but pages = {16-1–16-4}, with U+2013 does not; adding braces in the obvious position changes nothing for Biber).

    If all else fails, the field is read as literal and just dumped to the .bbl file without digestion. A warning is issued in that case.

    https://github.com/plk/biber/blob/d88ad8e580cffb1f4dc4a676e9a794a0b9e9b06b/lib/Biber/Input/file/bibtex.pm#L994-L1033

  2. biblatex also parses pages and other fields potentially containing page ranges on a LaTeX level.
    The passage of the biblatex Wiki @njbart quotes is referring specifically not to the pages field, but rather to postnote and friends that do not get pre-chewed, normalised input from Biber. Ideally the pages field would still be formatted in a way that it can also be parsed by the LaTeX range parser since custom styles may well apply the range parser also for pages. This will prove difficult due to an unforeseen interference in biblatex's macros, so need not be your primary aim at the moment.

    The LaTeX range parser builds on low-level LaTeX and can only deal with Unicode characters if a Unicode engine is used (XeTeX, LuaTeX). With pdfTeX only ASCII chars are gracefully handled. So it is a good idea to only export ASCII chars to the pages field if possible (I believe you are already doing that).

    The range parsing then works similar to Biber's routine. It splits at ;, , and \bibrangessep. Each chunk is then split up at the first occurrence \bibrangedash, -- or - (-- is never matched only as -). The command then prints the start and end of the range with \bibranmgedash in between.

    Certain characters can be hidden in this step by wrapping them in curly braces. Unfortunately this only works theoretically at the moment, because the \ifpages test can't deal with these hidden characters and the braces surrounding them. This means that presently a hyphen needs to be hidden with a command \newcommand*{\pagehyphen}{-} that can be made invisible itself with \NumCheckSetup{\let\pagehyphen\@empty}: then \cite[16\pagehyphen 1-16\pagehyphen 14]{sigfridsson} gives the expected output. I'll have a look if \cite[6{-}1-6{-}14]{sigfridsson} can be salvaged, but that looks really tough.

What does that mean for you?

  • You don't need to add braces. They don't have a benefit for the cases you have considered so far.

  • Converting U+2013/U+2014 makes sense and should be unproblematic.

  • There is no value in 'over-normalising' -- back to - for the average biblatex user. The same holds for the BibTeX standard styles.

    Converting -- to - might be a good or bad idea (depending on how you look at it) for BibTeX styles that do not convert - to -- internally (I seem to remember the French prefer a - in page ranges and not --): It is good if people are somehow used to typing -- in Zotero and actually want their BibTeX style to determine the dash regardless of their input. It is bad if people explicitly want an en-dash in styles that (intentionally or not) do not convert - to --. My money is on not converting -- back to -, this makes the next step easier.

  • I would do nothing about 6-1-6-14 if I were you, even for a human it is almost incomprehensible what that ought to mean. With biblatex and Biber 6-1--6-14 will give the expected output and is easier on then eye for humans as well: A user can be expected to input this instead of 6-1-6-14 - no BBT intervention needed. For BibTeX one would have to resort to \hyphen from above. I would find it a bit too intrusive, though, if BBT were to do the @preamble stuff by default.

@bothide
Copy link

bothide commented Apr 16, 2018

Let me just repeat my comment that a convenient (and, seemingly, nearly a de facto standard) way of writing a page range of the type 6-1 through 6-14 is 6(14). This is used by, e.g., the American Physical Society publications such as the Physical Review journals.

@retorquere
Copy link
Owner

IOW the current behavior in the regular release is OK as-is?

@moewew
Copy link

moewew commented Apr 17, 2018

I don't use BBT (or Zotero for that matter), so verification would have to come from someone who does. But from what I can read here things should be fine if BBT does not change - to -- any more (I think you mentioned that build 6221 does not do this any more, is that part of the regular release now?).

I had a look at normalizeDashes and

.replace(/([0-9])\s-\s([0-9])/g, '$1--$2') // treat space-hyphen-space like an en-dash when it's between numbers

still seems to convert some - to --.

normalzeDashes also seems to replace U+2012 (figure dash) with an em-dash

.replace(/[\u2012\u2014\u2015]/g, '---') // em-dash

I'd probably go for an en-dash or even a hyphen instead.

@retorquere
Copy link
Owner

I'll get those changed later today.

@blip-bloop
Copy link
Collaborator

🤖 this is your friendly neighborhood build bot announcing test build 5.0.129.6395.issue-943 ("adjust tests for #943").

@tolot27
Copy link

tolot27 commented May 8, 2018

I still get the following warning in my BBT exported BibLaTeX file: @% ? hyphen found in pages field, did you mean to use an en-dash?

I thought - will now be kept as is and it is not required to put -- between pages. What did I miss?

@label-gun label-gun bot reopened this May 8, 2018
@retorquere
Copy link
Owner

Fixed, will be in the next release.

@blip-bloop
Copy link
Collaborator

🤖 this is your friendly neighborhood build bot announcing test build 5.0.137.6668.master ("re-fixes #943").

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 17, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants