Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New quality check and cleanup for & #585

Closed
2 tasks done
koppor opened this issue Jul 4, 2022 · 25 comments · Fixed by JabRef/jabref#9758 or JabRef/jabref#10419
Closed
2 tasks done

New quality check and cleanup for & #585

koppor opened this issue Jul 4, 2022 · 25 comments · Fixed by JabRef/jabref#9758 or JabRef/jabref#10419
Assignees

Comments

@koppor
Copy link
Member

koppor commented Jul 4, 2022

Addition to JabRef#8948

@KonstantinEger
Copy link

Hi, seems like this is still free to take. We would like to take this as a first issue and post a PR soon :)

@koppor koppor moved this from Free to take to Reserved in Candidates for University Projects Mar 29, 2023
@koppor
Copy link
Member Author

koppor commented Mar 29, 2023

Thank you for choosing JabRef! Looking forward to your PR! - Would it be possible to tell us which course recommended JabRef? 🤩 - Here or via email (which you find on my profile)

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

@Siedlerchr
Copy link
Member

Quality check is now implemented: JabRef#9758

@Zylence
Copy link

Zylence commented Apr 22, 2023

Regarding the Cleanup action, would you like the UI component as checkbox or entry inside 'field formatters'? (We currently use a checkbox for testing purposes)

Image

@koppor
Copy link
Member Author

koppor commented Apr 24, 2023

This is a part of JabRef#8712. When working on this, maybe completely fix JabRef#8712?

@Zylence
Copy link

Zylence commented Apr 26, 2023

Hello, looking into this, I just realized that the cleanup action for unescaped ampersands is already implemented in jabref. So I am definitely up for a new task. :)
I have to add though that I will not be able to work as much on it the next weeks - so if this is a time critical task, someone else should take it. 

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign. 

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

@Siedlerchr
Copy link
Member

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

@koppor
Copy link
Member Author

koppor commented Apr 26, 2023

As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template.

While you are talking about, it is more complicated:

%: Yes, there needs to be an escape check!

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

Example: Title with a math formula: $x=x+1$.

Now, the thing with _ gets even more difficult: It is perfectly valid in math formulas as follows: $a_2+b_2=c_2$.

Maybe, this detailed parsing is too much.

Idea: A LaTeX parser should be used to check if the title parses correctly. SnuggleTeX (Example: https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/java/uk/ac/ed/ph/snuggletex/samples/MinimalExample.java) seems to be good. A fork of it can be retrieved from maven central: https://github.com/rototor/snuggletex. Note: Playing around with that will take some time... In the long run, this could replace our latex2unicode converter. -- In the short run, it should only be checked if the content can be parsed using LaTeX.

We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign.

This is a hard one. As Christoph stated: # is typically used to concatenate strings: https://docs.jabref.org/advanced/strings. The current entry editor (and all the logic code) has no good support for that. The "magic" is happening at reading/writing a field. Not sure whether you really want to dive into that. -- Maybe it's as simple as checking the entry editor content similar to the $ above: # must appear in pairs.

Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities).

Cleanup for % is good.

No straight-forward cleanup for # possible. One could search for defined strings. If a matched string is found, then corrected. Example: User inputs #kopp and #kubovy, which is corrected to #kopp# and #kubovy#, because kopp and kubovy are defined strings. Maybe, its also possible to just correct without checking the strings. Needs some more thougt/testing.

And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part)

I would leave that part as is. No checking whatever.

I could neverthless try to understand it: Take an arbitrary command of the table. If it renders in JabRef correctly: No worries. If it does not render correctly: Add it to a check. -- However, this heavily depends on the latex2unicode converter. Instead of checking for things it cannot handle, the converter should be made better (or replaced by SnuggleTeX as outlined above). -- In other words: The user should not adapt their bibtex library because JabRef cannot handle it.

@Zylence
Copy link

Zylence commented Apr 27, 2023

Thank you for the detailed response. I will look into it and try out snuggletex this weekend. 

$ can be an indicator of math formulas. Can you try the following: If $ appear in pairs, and there is a space before the first $ and a space after the second $ (or beginning/end of the string), it is ok?

In case we decide against snuggletex and need to check it manually, we can surely do that. But why check for spaces at the beginning and end? As far as I know they will not effect the functionality and the user may just leave them out. In any case, we can use the BibStringChecker (#) for orientation, it already implements the desired behaviour (just without the checks for spaces).

@Zylence
Copy link

Zylence commented Apr 27, 2023

Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings

I am sorry, initially I missunderstood the task: I thought the task was simply to warn the users whenever they use latex special characters, but now I realized that this would be very restricting and that the real task is to check whether it's valid latex to guarantee a pain free workflow for the user when integrating the bib files into his own documents. 

@Siedlerchr
Copy link
Member

It's a mix of both. But I would start with the low hanging fruits:

  1. a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)
  2. Check for balanced $ (e.g. always pairs)
  3. Check if underscore is only inside a group of $$ (can be probaly done easily with regex)
  4. If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

@Zylence
Copy link

Zylence commented May 1, 2023

It's a mix of both. But I would start with the low hanging fruits:

  1. a) Check and b) cleanup for % and & ( I think % is already there -> LatexCleanup)
  2. Check for balanced $ (e.g. always pairs)
  3. Check if underscore is only inside a group of $$ (can be probaly done easily with regex)
  4. If an underscore is outside $$ offer a cleanup either escaping it with \ or replacing it with \textunderscore

For the moment I would not consider # and the more complex cases. That should be fine.

Okay, but I would like to use a proper parser like snuggletex "from the get go", since the code I write without using it would need to be replaced eventually anyways. Further I just now had some time to try it out and really like it! :-)

I have already tried some integration testing with snuggletex in jabref - it does not seem to immediately break any Tests:

Before:
before

After:
after

Using snuggletex it could be as simple as this (basicly takes care of all integrity checks):

SnuggleEngine engine = new SnuggleEngine();
SnuggleSession session = engine.createSession();
SnuggleInput input = new SnuggleInput(field.getValue()); //field is a BibEntryField
session.parseInput(input);
if (!session.getErrors().isEmpty()) {
    ErrorCode code = session.getErrors().get(0).getErrorCode();
     // filter only for tokenization errors
    if (code.getErrorGroup().equals(CoreErrorGroup.TTE)) { 
        //todo use custom translation
        String message = CorePackageDefinitions.getPackage().getErrorMessageBundle().getString(code.getName()); 
        //todo "handle" parsing error error
        session.getErrors().clear();
    }
}

This is jsut a mock and has to be improved but just for fun I edited one of the IntegrityChecks to use Snuggletex, seems promising:

works_as_expected

We'd need to copy and slightly adjust the libraries properties file into our own for custom translation, but that's no big deal. And Since CoreErrorCode is exposed, we could also easily filter for specific errors if we did not want to catch them all.

If you want to try how snuggletex integrates in jabref yourself, you can add these lines to the build.gradle:

implementation "de.rototor.snuggletex:snuggletex:1.3.0"
implementation ("de.rototor.snuggletex:snuggletex-jeuclid:1.3.0") {
    exclude group: "org.apache.xmlgraphics"
}

and this line to module-info.java:

requires snuggletex.core;

If you @koppor , @Siedlerchr and the other maintainers allow it, I'd like to integrate it. That way, the integrity checks are easily implemented, and we do not rely on handwritten comparisons or checking edge cases, but instead use a proper parser. The only thing left to do would be cleanup actions.

PS: I have not checked if all functions of it work, but tokenization and parsing seem fine.

@Siedlerchr
Copy link
Member

Siedlerchr commented May 1, 2023

Thanks for the investigation. We can try, but I think we run into problems with jlink + jpackage and the method too large error. You can try this locally, run a gradlew jpackage and see if it works. https://bugs.openjdk.org/browse/JDK-8240567

If this doesn't work we need your "manual" fix.

  1. Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)
  2. Create a branch without the external dependency reliance

@koppor
Copy link
Member Author

koppor commented May 2, 2023

The error messages are really great! Looking forward to see it integrated.

Regarding the jlink issue, we need to spend energy again to get openjdk/jdk#10704 done.

@Zylence
Copy link

Zylence commented May 13, 2023

If this doesn't work we need your "manual" fix.

It did indeed fail, unfortunately. :( 

I see you are working on this fix since quite a while, just out of curiosity when do you hope to have this resolved?

Regarding the snuggletex implementation, unfortunately it can not be used to handle all our use cases (for example it can not check for # out of the box) - but we have access to the parsed tokens which means we can implement our own functionality on top of that. (This also raises the question, whether we should consider rewriting the cleanup actions based on that.)

 Create a branch with the snuggletex implementation (if this fails due to the jdk issue we can still keep the code and put it here at koppor's repo)

I am currently working on getting it cleaned up (as of now it's barely a hack).

Create a branch without the external dependency reliance

Just started doing checks for _, %, $ and cleanup for $, I am aiming to get this ready over the upcoming holidays (~2 weeks)

@koppor
Copy link
Member Author

koppor commented May 14, 2023

No worries regarding method too large. More motivation to get it fixed.

Regarding #, maybe it's a SnuggleTex limitation? Is it able to parse field = a # b # {text}?

@Zylence
Copy link

Zylence commented May 15, 2023

Is it able to parse field = a # b # {text}

No matter the constellation, if there is at least one # in the text, Snuggletex does treat it as error TTEG04, with this message: "Argument placeholder tokens (e.g. # 1) may only appear in command and environment definitions".

It can parse it, but it treats # differently than we do. We can easily work around this since we have access to the tokens. For starters, since the # integrity check is already implemented, we may as well just exclude TTEG04.

@koppor
Copy link
Member Author

koppor commented May 15, 2023

Excluding TTEG04 is OK for me.

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

@Zylence
Copy link

Zylence commented May 17, 2023

Can you file an issue at https://github.com/davemckain/snuggletex/issues?

I've looked into the library, there is no bibtex reference anywhere. It does not seem like it was ever ment to support including / validating bibtex files. It's build just for plain latex and hence only allows # in "inside command/environment definitions" LaTeXTokeniser.java.

Their website states: "SnuggleTeX’s parser pretends that TeX never happened and may behave slightly differently to what experienced LaTeX users might expect." And since this is a bibtex and no latex feature, I do not think this change should be made to the source, nor that the maintainer would do it.

Should I still file the issue?

@Zylence
Copy link

Zylence commented May 17, 2023

Just created a PR with what is ready so far.

It is by no means perfect and does break some unit tests (because of localization). But its functional. The error messages are ported from the original snuggletex repo, we may need to rewrite them and can probably exclude some.

@koppor
Copy link
Member Author

koppor commented May 17, 2023

Ah, thank you, I was too quick in writing my thought. THank you for digging deeper!

You must not pass the field value directly to SnuggleTeX.

The handling of # is solved not very well in JabRef.

See test cases for examples

Seeing the code of org.jabref.logic.bibtex.FieldWriter#formatAndResolveStrings, you should just ignore checking if the field value contains #.

Future work can provide a more proper data model for a BibTeX field

@koppor koppor mentioned this issue Jun 12, 2023
6 tasks
@ThiloteE
Copy link
Member

"method too large" problem is fixed in the current development version.

@Zylence
Copy link

Zylence commented Jul 25, 2023

"method too large" problem is fixed in the current development version.

Thats good news! The requested test cases in the koppor repo for the snuggletex integration where also added. So I guess we are good to go?

@koppor
Copy link
Member Author

koppor commented Sep 26, 2023

I tried the escape ampersands:

image

Works!

@Siedlerchr
Copy link
Member

But we need to disable it for file field

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
5 participants