Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latex_to_unicode produces problematic filename #3920

Closed
bdcaf opened this issue Apr 4, 2018 · 16 comments · Fixed by #6706
Closed

latex_to_unicode produces problematic filename #3920

bdcaf opened this issue Apr 4, 2018 · 16 comments · Fixed by #6706
Assignees
Labels
bug Confirmed bugs or reports that are very likely to be bugs cleanup-ops

Comments

@bdcaf
Copy link

bdcaf commented Apr 4, 2018

I encountered this when retrieving information for doi:10.1088/1752-7155/7/1/017106.
Which retrieves the authors: Patrik {\v{S}}pan{\v{e}}l and Kseniya Dryahina and David Smith.

A cleanup/rename pdf gives following filename: S}pane{l}EtAl/Spanel2013 - A quantitative study.pdf (note the braces are produced like this) which not only is wrong, but also gives an error: Could not save file. Error in field 'file': Braces don't match.

I have directory pattern set to: [authEtAl:latex_to_unicode] and file format pattern set to [bibtexkey] - [shorttitle:latex_to_unicode]. And bibtex key pattern is: [auth:latex_to_unicode][year]. Also note that the bibtex key doesn't contain any }.

@Siedlerchr Siedlerchr added bug Confirmed bugs or reports that are very likely to be bugs cleanup-ops labels Apr 6, 2018
@Siedlerchr
Copy link
Member

I tried it locally and can confirm the behaviour. In fact it seems like the file directory pattern is not interpreted correctly.

        String fileNamePattern = "[bibtexkey] - [shorttitle:latex_to_unicode]";
        String directoryPattern = "[authEtAl:latex_to_unicode]";
  1. Rename PDF is correct . Sample file Toot.pdf -> Toot - A quantitative study.pdf
  2. Move Files Cleanup breaks it then -> S}pan{e}lEtAl/Toot - A quantitative study.pdf

@Siedlerchr
Copy link
Member

Siedlerchr commented Apr 6, 2018

Okay, the problem is not the latex2unicode itself, but our Authorlist parser.

@Siedlerchr
Copy link
Member

@JabRef/developers Does anyone of you know why the first brace is removed?
That actually is the underlying root problem here:

/**
* Returns the name as "Last, Jr, F." omitting the von-part and removing
* starting braces.
*
* @return "Last, Jr, F." as described above or "" if all these parts
* are empty.
*/
public String getNameForAlphabetization() {
StringBuilder res = new StringBuilder();
getLast().ifPresent(res::append);
getJr().ifPresent(jr -> res.append(", ").append(jr));
getFirstAbbr().ifPresent(firstA -> res.append(", ").append(firstA));
while ((res.length() > 0) && (res.charAt(0) == '{')) {
res.deleteCharAt(0);
}
return res.toString();
}

@tobiasdiez
Copy link
Member

Probably to sort {{JabRef}} under J and not {.

@stefan-kolb
Copy link
Member

stefan-kolb commented Apr 16, 2018

This should not be done in the model, but only inside the model for the UI, so we need to move this part of the code @Siedlerchr
Or at least the filename generation should not depend on it.

@stefan-kolb
Copy link
Member

stefan-kolb commented May 25, 2018

Tests:

    @Test
    public void testAuthEtAlBraces() {
        assertEquals("{\v{S}}pan{\v{e}}l",
                BibtexKeyGenerator.authEtal("Patrik {\\v{S}}pan{\\v{e}}l and Kseniya Dryahina and David Smith", "", "EtAl"));
        assertEquals("\\v{S}pan\\v{e}lEtAl",
                BibtexKeyGenerator.authEtal("Patrik \\v{S}pan\\v{e}l and Kseniya Dryahina and David Smith", "", "EtAl"));
    }

It is actually problemetic what to expect here.

  • Making the braces unbalanced is leading to problems in any code except sorting!
  • Keeping the braces produces problematic output for cases like key generation imho (not sure if key generation should only produce alphanumeric keys?!)

@bdcaf
Copy link
Author

bdcaf commented May 25, 2018

FWIW - as user I would expect the first. Or one using the correct unicode symbols.

@koppor
Copy link
Member

koppor commented Jun 1, 2018

DevCall:

  • AuthorClass: Strangest method to get names is taken.
  • Remove non-ASCII characters to ensure compatibility with pdflatex

@Siedlerchr Siedlerchr removed this from the v4.4 milestone Oct 2, 2018
@Siedlerchr Siedlerchr added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Apr 20, 2019
@Siedlerchr
Copy link
Member

Still present in 5.0 dev

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

I can try to fix this one.

@koppor koppor removed the good first issue An issue intended for project-newcomers. Varies in difficulty. label Jul 9, 2020
@koppor
Copy link
Member

koppor commented Jul 9, 2020

This is a hard one, but just goahead!

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Jul 10, 2020

@koppor right now directory names are allowed to contain unicode. Unless there have been complaints, shouldn't that remain the case?

I currently believe there are two issues the solution depends on

  1. Is unicode allowed in the directory path?
  2. Is it too much of a performance hit to call the LatexToUnicodeAdapter on all auth... patterns? (the "latex-free" string will not have been cached)

I prefer to use the LatexToUnicodeAdapter because I think it is a better user experience if both Gödel and G{\"o}del generate the same directory name, and changing the directory structure is an infrequent event. I'd guess that it would be very hard to generate the same directory name for both Gödels without using LatexToUnicodeAdapter.

@Siedlerchr
Copy link
Member

Unicode path names are perfectly valid.
you can even use emoji on windows 10.

https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file#file-and-directory-names

Regarding performance. Don't know but I would also go for the latex2unicode adapter. Seems reasonable to me. Don't you have the latex free author already in the author list class?
The author patterns are just equivalent to methods for getting the authors

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

@Siedlerchr , @koppor earlier pointed out that,

  • Remove non-ASCII characters to ensure compatibility with pdflatex

but perhaps that is only relevant for the bibtex key? I am not entirely sure about the use case (except for organizing pdfs). Does people use it to organize plots, etc. that they later import into a .tex file?

Don't you have the latex free author already in the author list class?

No, the latex-free methods cache full "patterns" (e.g., authorsLastOnly), unfortunately not individual authors or this particular pattern.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Jul 10, 2020

Actually, you could use AuthorList#getAsLastNamesLatexFree and split that string. That would remove the performance bottle-neck at the cost of having a "hacky" solution.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Jul 10, 2020

Never mind. Unless the user has the exact right preferences AuthorList#getAsLastNamesLatexFree would amount to the exact same solution, with an extra split operation at the end.

Authors fields are currently not latex-free. I'd consider it an option to change that, and cache latex-free Authors instead of AuthorLists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Confirmed bugs or reports that are very likely to be bugs cleanup-ops
Projects
Archived in project
7 participants