Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare for 2.0 release #43

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

anthonyvdotbe
Copy link
Contributor

@anthonyvdotbe anthonyvdotbe commented Aug 17, 2020

Closes #26, closes #25, closes #14.

To build, a JDK 11+ is required. To build the Javadoc, a JDK 15+ is required (otherwise it fails on the xom module. There's already an RC available at http://jdk.java.net/15/, and the final release is due next month).

The commits should be self-explanatory.

W.r.t. the cleanup: which of these commits must be reverted, and what can effectively be removed (also considering that it's trivial to resurrect anything from a Git repo)?

Edit: to clarify: before starting out with this PR, I deleted everything that isn't required to build the final JARs, purely for my own convenience. Obviously, I'm aware that some of it needs to stay. However, given the reorganization of the project's structure, I believe it would be useful to go over these commits, and decide what to do with each of them:

  • retain them, given that anything that's deleted can trivially be resurrected (e.g. gwt-src seems to have been added at the time in order to enable HTML parsing in env.js. However, env.js is effectively dead, and GWT itself is mostly irrelevant as well nowadays)
  • revert them, either as is, or by additionally moving the affected files. Due to the separation in multiple Maven modules, there are now 3 top-level directories containing sources (htmlparser, saxtree, xom). So to avoid clutter in the top-level directory, I believe it would be best to move any other remaining top-level directories elsewhere, typically into the new htmlparser module's directory (e.g. translator-src would move to htmlparser/src/translator).

So for the record: I'm not putting any of this up for debate: if you say "revert all commits as is", then I'll do so without further ado.

W.r.t. "Make jchardet & ICU4J heuristics a no-op": jchardet is problematic. When relying on its automatic module name, Maven gives the following warning:

[WARNING] ****************************************************************************************************************************************
[WARNING] * Required filename-based automodules detected: [jchardet-1.0.jar]. Please don't publish this project to a public artifact repository! *
[WARNING] ****************************************************************************************************************************************

So I really believe we should eliminate the dependency altogether. The easiest way to do so, is by simply making the related heuristic a no-op. In fact, I believe the same should be done with the ICU4J dependency. The usage of optional Maven dependencies and requires static clauses are indications that the current situation isn't quite right.

The current commit is the minimal change required to eliminate the dependencies.

Now there are 3 options:

  1. remove Heuristics altogether, given that

    • it's only used if: it's explicitly enabled, and the given InputSource doesn't wrap a Reader, and both of the basic sniffers (Bom & Meta) fail
    • it's trivial for users to do the encoding detection themselves, using any library of their choosing
    • none of the current sniffers are state of the art
  2. use an SPI: htmlparser should introduce an interface EncodingSniffer and its module-info.java should say uses EncodingSniffer;. Then the current sniffers should be moved into their own Maven module(s), implement EncodingSniffer, and be made available as services to htmlparser

  3. merge as is: this is no more than a behavioral regression (except for the removal of the sniffers, but I don't see how direct usage of these classes could possibly be justified), so we could just wait for someone to report it. Then once (if ever) it happens, one of the former 2 options can be implemented

What's your opinion?

(Edit: @carlosame I edited this post to clarify the "Clean up"-commits. And contrary to what you're saying, normalization checking no longer requires ICU4J. So please remove any comments that are not/no longer relevant, or at least bundle all your comments into a single one.)
(Edit2: @carlosame ICU4J is over 30x the size of htmlparser, and both its encoding detection (which "isn't very good") and normalization checking are disabled by default. Anyway, I get your point. And again: 3 of your comments are irrelevant at this point, and the other 2 could easily be merged into 1. So please clean up your existing comments & stop adding new comments (just to be clear: add as much feedback as you want, but do so by editing an existing comment, not adding new ones all the time).)

@carlosame
Copy link

The easiest way to do so, is by simply making the related heuristic a no-op

I thought that there was already an agreement about not introducing backward incompatibilities at this point.

The usage of optional Maven dependencies and requires static clauses are indications that the current situation isn't quite right.

On ICU4J, the requires need not be static, and the Maven dependency could be marked as not optional. Saying "I'm going to use setCheckingNormalization(boolean) so I need to add ICU4J to the path" isn't exactly obvious or straight-forward.

Copy link

@carlosame carlosame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks like something used by the C++ translator (d6df8ad).

Copy link

@carlosame carlosame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The htmlparser project without the C++ translation? Have you talked with @hsivonen about that?

pom.xml Outdated Show resolved Hide resolved
Copy link

@carlosame carlosame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too bad that you did this (2928550) after changing the repository layout, otherwise this could have been a good candidate to cherry-pick just now.

@carlosame
Copy link

normalization checking no longer requires ICU4J.

There is no compelling need to remove ICU4J, and now you are using an ICU4J snapshot in COMPOSING_CHARACTERS which may be out of sync with the JDK's unicode. The unicode data used by the JDK changes over releases.

Not a huge issue, but I do not see the need to remove ICU4J.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants