-
-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: merge Nokogumbo into Nokogiri #2204
Comments
Closes #170 A future version of Nokogiri will provide Nokogumbo's API (see sparklemotion/nokogiri#2204). This change will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API and become a "shim" -- gracefully defer to Nokogiri by refusing to load itself. Some contractual assumptions I'm making about Nokogiri: - Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton method, module, and namespace (including classes `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and `Nokogiri::HTML5::DocumentFragment`) - Nokogiri will not provide a `::Nokogumbo` module/namespace, but will provide a similar `::Nokogiri::Gumbo` module which will provide the same constants and singleton methods as `::Nokogumbo`: - `Nokogumbo.parse()` will be provided as `Nokogiri::Gumbo.parse()` - `Nokogumbo.fragment()` → `Nokogiri::Gumbo.fragment()` - `Nokogumbo::DEFAULT_MAX_ATTRIBUTES` → `Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES` - `Nokogumbo::DEFAULT_MAX_ERRORS` → `Nokogiri::Gumbo::DEFAULT_MAX_ERRORS` - `Nokogumbo::DEFAULT_MAX_TREE_DEPTH` → `Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH` This change checks for the existence of `Nokogiri::HTML5`, `Nokogiri::Gumbo`, and an expected singleton method on each. We could do a more- or less-thorough check here. This change also provides an "escape hatch" using an environment variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to avoid the "shim" behavior. This escape hatch might be unnecessary, but this change is invasive enough to make me want to be cautious. In "shim" mode, `Nokogumbo.parse()` and `.fragment()` will be forwarded to the Nokogiri implementation. The `Nokogumbo::DEFAULT*` constants will always be defined, but when in "shim" mode will be set to the `Nokogiri`-provided values. Nokogumbo will emit a single warning message at `require`-time when it is in "shim" mode. This message points users to sparklemotion/nokogiri#2205 which will explain what's going on and help people migrate their applications (but is an empty placeholder right now). I did not include deprecation warning messages in `Nokogumbo.parse` and `.fragment`. If you feel strongly that we should, let me know.
Closes #170 A future version of Nokogiri will provide Nokogumbo's API (see sparklemotion/nokogiri#2204). This change will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API and whether to use Nokogiri's implementation or Nokogumbo's implementation. Some contractual assumptions I'm making about Nokogiri: - Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton method, module, and namespace (including classes `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and `Nokogiri::HTML5::DocumentFragment`) - Nokogiri will not provide a `::Nokogumbo` module/namespace, but will provide a similar `::Nokogiri::Gumbo` module which will provide the same public API as `::Nokogumbo`. This change checks for the existence of `Nokogiri::HTML5`, `Nokogiri::Gumbo`, and an expected singleton method on each. We could do a more- or less-thorough check here. This change also provides an "escape hatch" using an environment variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force Nokogumbo to use its own implementation. This escape hatch might be unnecessary, but this change is invasive enough to make me want to be cautious. Nokogumbo will emit a single warning message at `require`-time when it is uses Nokogiri's implementation. This message points users to sparklemotion/nokogiri#2205 which will explain what's going on and help people migrate their applications (but is an empty placeholder right now).
Closes #170 A future version of Nokogiri will provide Nokogumbo's API (see sparklemotion/nokogiri#2204). This change will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API and whether to use Nokogiri's implementation or Nokogumbo's implementation. Some contractual assumptions I'm making about Nokogiri: - Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton method, module, and namespace (including classes `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and `Nokogiri::HTML5::DocumentFragment`) - Nokogiri will not provide a `::Nokogumbo` module/namespace, but will provide a similar `::Nokogiri::Gumbo` module which will provide the same public API as `::Nokogumbo`. This change checks for the existence of `Nokogiri::HTML5`, `Nokogiri::Gumbo`, and an expected singleton method on each. We could do a more- or less-thorough check here. This change also provides an "escape hatch" using an environment variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force Nokogumbo to use its own implementation. This escape hatch might be unnecessary, but this change is invasive enough to make me want to be cautious. Nokogumbo will emit a single warning message at `require`-time when it is uses Nokogiri's implementation. This message points users to sparklemotion/nokogiri#2205 which will explain what's going on and help people migrate their applications (but is an empty placeholder right now).
This is great news! Thank you! |
Closes #170 A future version of Nokogiri will provide Nokogumbo's API (see #2204). This change will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API and whether to use Nokogiri's implementation or Nokogumbo's implementation. Some contractual assumptions I'm making about Nokogiri: - Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton method, module, and namespace (including classes `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and `Nokogiri::HTML5::DocumentFragment`) - Nokogiri will not provide a `::Nokogumbo` module/namespace, but will provide a similar `::Nokogiri::Gumbo` module which will provide the same public API as `::Nokogumbo`. This change checks for the existence of `Nokogiri::HTML5`, `Nokogiri::Gumbo`, and an expected singleton method on each. We could do a more- or less-thorough check here. This change also provides an "escape hatch" using an environment variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force Nokogumbo to use its own implementation. This escape hatch might be unnecessary, but this change is invasive enough to make me want to be cautious. Nokogumbo will emit a single warning message at `require`-time when it is uses Nokogiri's implementation. This message points users to #2205 which will explain what's going on and help people migrate their applications (but is an empty placeholder right now).
Punch list updates:
Also see #2217 which merges the code and commit history (parked on a branch until I release v1.11.3). |
merge nokogumbo history --- **What problem is this PR intended to solve?** This is one step of many to merge Nokogumbo into Nokogiri (see [Epic: merge Nokogumbo into Nokogiri · Issue #2204 · sparklemotion/nokogiri](#2204)). - Commit history for Nokogumbo is preserved in the Nokogiri repository - Nokogumbo contributors are added to the Nokogiri gemspec, README, and copyright declarations - All nokogumbo files should mention they are originally licensed under Apache 2.0 (an interpretation of APL2.0 clause 4.c) and mention that they have been changed (clause 4.b)
This comment was marked as off-topic.
This comment was marked as off-topic.
Just a quick update! Work on completing this integration got held up because I spent some time moving Nokogiri's CI to Github Actions (#2207, #2244, #2247, #2226) and improving the test coverage around gem packaging and installation (#1718, #2151, #2152, #2153). Now that that's done (see https://github.com/sparklemotion/nokogiri/actions), I'm in a good place to pick this back up and merge Nokogumbo's actions-based test suite. Hopefully this will go quickly, since the bulk of the work is done and I'd love to ship a release candidate and get some feedback. |
@stevecheckoway You mentioned in another thread that you'd like the html5lib-tests to be imported from your fork+branch. Is that still what you'd prefer? An alternative (or intermediate step) might be to submodule it. Another question: it looks like html5lib-tests is still active, meaning that it's drifted since your fork. Are you thinking about rebasing at all? What's your mental model for how those tests should be maintained going forward? |
A submodule seems fine. It'd be great if they would respond to my pull requests either accept the changes or reject them. But based on my reading of the errors (a part of the spec that's in flux), the tests were wrong. I'll look at rebasing. |
Submodule sounds good to me, too. See #2273. |
I just rebased my There are also some new tree-building tests that don't pass. I looked at one and I don't see how the test matches the spec (but does agree with the browser). I'll follow up on that and see if I can figure out what the issue is. Here's an example of the new tests that are failing:
but my reading (and gumbo's code) produce
That is, the |
I asked about it on the pull request that introduced the tests. html5lib/html5lib-tests#135 |
@stevecheckoway OK, thanks for looking into it. I have #2273 which I should finish today, and once that's done it'll be on |
Everything from the "Functional Merger" section of this issue's description is done and on |
@rubys and @stevecheckoway - I'd like to "deprecate" the method For context, my reasoning is that fetching data over the network is a problem that is orthogonal to what I'd like Nokogiri to focus on: parsing and manipulation. There are quite a few networking libraries and HTTP clients that solve the networking problem already. I do want to make sure that Nokogiri integrates will with them (e.g., supporting IO objects), but I'd prefer not to maintain a responsibility in that domain if it's possible to avoid. If you feel strongly it should be a method in Nokogiri, can you help me understand your mental model? |
- ci pipeline and scripts have been replicated, or intentionally dropped (e.g., gentoo and libxml2 build variations) - Rakefile and Gemfile no longer needed - CHANGELOG not needed - LICENSE recognized in a few ways (see #2204) [skip ci]
It predates my involvement with the project. I don't use it myself. |
…l4-namespace introduce html4 namespace --- **What problem is this PR intended to solve?** As the Nokogumbo merger progresses (see #2204), we now have an `HTML5` module and namespace, but the previous libxml2-(and nekohtml-) based functionality is parked under the ambiguous `HTML` module and namespace. I'd like to disambiguate, and also introduce an opportunity for us to use `HTML` for more general use in the future (e.g., perhaps detection of HTML doc format and choosing the right DOM parser). This PR moves everything currently under `HTML` to `HTML4`, and makes `HTML` an alias for `HTML4`. It updates doc strings and class names. Some changes in behavior that I want to note: - objects will report a class of `Nokogiri::HTML4::XXX` where they previously reported `Nokogiri::HTML::XXX` - some of the exported C symbols have been renamed (e.g., `mNokogiriHTML` is now `mNokogiriHTML4`) which might impact anyone writing C code and linking against Nokogiri's dylib **Have you included adequate test coverage?** I've left the tests alone (except for the addition of some "HTML/HTML4 equivalence" tests) to demonstrate there's no behavioral breakage. **Does this change affect the behavior of either the C or the Java implementations?** Notably, I've updated the Java files to rename classes and variable, and use the proper module and class names, so that it stays in sync with CRuby despite not having an `HTML5` module/namespace.
This should ensure that we don't break anybody who has upgraded Nokogiri but hasn't dropped Nokogumbo yet. Related to #2204
OK, everything except updating the tutorials is done, which I think means we should ship a release candidate and ask people to give feedback (likely in a "Discussion" thread). @sparklemotion/nokogiri-core Let me know if there's anything y'all think we need to do before cutting that release candidate, or if there's anything missing from the list in this issue. Otherwise, I'll cut a release in the next day or so. |
Perhaps just delete it as it never shipped with Nokogiri? As to mental model: I have a number of apps that parse HTML. Guess where the largest source of HTML would be. :-). You guessed it: the internet. So this was a common enough use case to factor out of the application and into a gem. If it doesn't belong in the nokogiri gem, a new gem could always be created to include this function. This is a common pattern with Rails: things that were once core but no longer are considered core are moved out into gems. P.S. Sorry for the delay in responding. |
Yeah,I thought about it, but it's been in Nokogumbo since 2013, and the pledge we're making with this integration is that apps can drop the dependency on Nokogumbo and everything will Just Work™. I don't want to support the method, but I also don't want to introduce any unnecessary hurdles to upgrade/adoption. In 4ac9350 I introduced a warning, and I think that's sufficient for now. |
I've shipped a release candidate at v1.12.0.rc1, and started a discussion. Please let us know any feedback either here or in that discussion thread! |
Planning to ship v1.12 final this coming weekend, around July 31st. |
Nokogiri v1.12.0 has shipped. Closing. https://github.com/sparklemotion/nokogiri/releases/tag/v1.12.0 |
The maintainers of Nokogiri and Nokogumbo are planning to merge the two gems together so that Nokogiri assimilates Nokogumbo's HTML5 parsing functionality.
This issue is intended to be a "parent" issue which can be followed to understand the plan and how it's progressing, as the work will likely take O(weeks) and will be on a feature branch driven by multiple PRs.
This description will be edited as we go to reflect current state and progress.
Background
This work has previously been discussed:
Goals
Here's what success looks like:
For more specific objectives, see the "Punchlist" section below.
Note: No JRuby support at this time
The Nokogumbo code relies on a parser implemented in C, and a C extension that is tightly coupled to libxml2. As a result, the
Nokogiri::HTML5
module will not be immediately available on JRuby, which uses Xerces in place of libxml2.We ask that all downstream libraries be aware of this platform limitation as they consider using
HTML5
parsing methods post-merger.Risks
This work doesn't feel very risky, but if I had to name the riskiest bits:
Some things that could have been risky but aren't:
Frequently Asked Questions
Why is this going into v1.12 and not v2.0?
This is not a breaking change. We want everyone on v1.11 to upgrade, and will be making efforts to make that upgrade painless.
Will Nokogiri's current HTML API or parsing behavior change?
No. Nokogiri's existing HTML parsing functionality, available under the
Nokogiri::HTML
module/namespace, will not change in this release. The Nokogumbo additions to Nokogiri are all contained under theNokogiri::HTML5
module/namespace, and do not conflict with existing Nokogiri functionality.In the future, we may explore how Nokogiri might be smarter about HTML4 versus HTML5 parsing, but those changes will be introduced carefully. See some more thoughts on this topic at #2064 (comment)
Will JRuby support HTML5?
Not initially, though we hope to work on this for a future release. See the section above for more information.
Punchlist
Some finer-grained objectives (which will be modified over time as we discover new work to be done):
Pre-merger
Team Merger
Legal/License Merger
gumbo-parser/src/README.md
LICENSE-DEPENDENCIES.md
to include gumbo and nokogumbo under Apache License 2.0 (APL2.0 clause 4.a)Functional Merger
rake test
lib/nokogiri
and C file moved intoext/nokogiri
nokogiri.so
Forward-looking Changes
Nokogiri::HTML5.get
Nokogiri::HTML4
as an alias forNokogiri::HTML
so that at some point we can deprecate and change theHTML
functionality (introduce html4 namespace #2278)Pre-release
/nokogumbo-import
contents should either be deleted or moved somewhere elsegumbo_
symbols. do these need to be exported?Release candidate
Final release
/cc @rubys @stevecheckoway
The text was updated successfully, but these errors were encountered: