Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IEEE Xplore title em dash cleanup #11078

Closed
wants to merge 12 commits into from
Closed

IEEE Xplore title em dash cleanup #11078

wants to merge 12 commits into from

Conversation

InAnYan
Copy link
Member

@InAnYan InAnYan commented Mar 22, 2024

Closes JabRef#286.

I added a test to DoiFetcherTest for doi https://doi.org/10.1109/PERCOMW.2015.7133989.

The parsed title of the article should be:

Towards situation-aware adaptive workflows: SitOPT — A general purpose situation-aware workflow management system

But previously it was:

Towards situation-aware adaptive workflows: SitOPT — A general purpose situation-aware workflow management system

In order to make the cleanup, I added some code to the format method of the FieldContentFormatter. (Is that the right file for cleanup?).

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@InAnYan InAnYan changed the title Fix for issue 286 IEEE Xplore title em dash cleanup Mar 22, 2024
@InAnYan
Copy link
Member Author

InAnYan commented Mar 22, 2024

Oops, sorry. The test should be in field format test, but not in DoiFetchTester. Moved

@InAnYan
Copy link
Member Author

InAnYan commented Mar 22, 2024

I've created a new formatter, but can't understand where to add it to format fields. Currently, the title for article https://doi.org/10.1109/PERCOMW.2015.7133989 is not formatted. But I've made the class and tests. Maybe I should've not added it to the Formatters class (NormalizeEnDashesFormatter is not added there too)

@Siedlerchr
Copy link
Member

There should be a method doPostcleaup where you can add the formatter
If you want to apply for all fields, you can use InternalField. INTERNAL_ALL_FIELD

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


@Override
public String format(String value) {
return value.replaceAll("—", "—");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing the input, isnt that a double encoded unicode character?

Should the importer run https://github.com/JabRef/jabref/blob/main/src/main/java/org/jabref/logic/formatter/bibtexfields/HtmlToUnicodeFormatter.java twice for the title field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that will leave the ampersand.

I've tried to add additional HtmlToUnicodeFormatter.java to DoiFetcher.doPostCleanup. Is that the right place?

It replaced the & with ampersand, but #x2014; was left as is. So, probably there should be a separate formatter?

I read the original issue and you have provided a sketch of the solution. You said to add a new cleanup class and integrate it with IEEE class. But how to add the cleanup to the IEEE? There seems to be no doPostCleanup method. What if we just change the parseJsonResponse and directly cleanup from there?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the IEEE doPostCleanup method, it's just a single method that is called manually because only the EntryBasedParserFetcher has it defined as interface method.
See thee DoiFetcher for example, it defines it manually

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the JabRef doesn't use IEEE class when importing the DOI https://doi.org/10.1109/PERCOMW.2015.7133989

@InAnYan
Copy link
Member Author

InAnYan commented Mar 23, 2024

I've changed the class for the formatter.

Calling HtmlToUnicodeFormatter twice doesn't solve the problem completely.

And it seems IEEE class is not used for fetching https://doi.org/10.1109/PERCOMW.2015.7133989. (Maybe that's because this is an old issue, and the way JabRef works has changed?).

I've added the formatter to DoiFetcher.postCleanup. It works.

Siedlerchr
Siedlerchr previously approved these changes Mar 23, 2024
@Siedlerchr
Copy link
Member

Pleease the failing tests here
grafik
just click on details
for openrewrite you can execute the rewriteRun task

# Conflicts:
#	src/main/resources/l10n/JabRef_en.properties

@Override
public String format(String value) {
return value.replace("—", "—");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is Unicode or plain ascii? I think, we fetch in plain BibTeX, thus, it should be --, shouldn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source code is in Unicode.

In NormalizeEnDashesFormatter en dash was replaced with --, em dash should be replaced with it too or something longer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep --.

Normally, it would be ---, but this is very strange in Paper titles. Never saw it. Therefore, I proposed --.

@koppor
Copy link
Member

koppor commented Mar 24, 2024

Calling HtmlToUnicodeFormatter twice doesn't solve the problem completely.

There needed to be a small fix. I coded it in #11091. Since the new code is much more smaller than the code proposed in this PR (and also handles more cases), I close this PR.

Thank you for digging out where to call the formatter! 👍

@koppor koppor closed this Mar 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New cleanup for em dash
3 participants