Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: merge Nokogumbo into Nokogiri #2204

Closed
33 tasks done
flavorjones opened this issue Mar 13, 2021 · 20 comments
Closed
33 tasks done

Epic: merge Nokogumbo into Nokogiri #2204

flavorjones opened this issue Mar 13, 2021 · 20 comments
Milestone

Comments

@flavorjones
Copy link
Member

flavorjones commented Mar 13, 2021

The maintainers of Nokogiri and Nokogumbo are planning to merge the two gems together so that Nokogiri assimilates Nokogumbo's HTML5 parsing functionality.

This issue is intended to be a "parent" issue which can be followed to understand the plan and how it's progressing, as the work will likely take O(weeks) and will be on a feature branch driven by multiple PRs.

This description will be edited as we go to reflect current state and progress.

Background

This work has previously been discussed:

Goals

Here's what success looks like:

  • Nokogiri v1.12 supports HTML5 parsing using Nokogumbo's API, code, and fork of Gumbo.
  • The Nokogumbo maintainers feel welcomed into the Nokogiri maintainer team.
  • Upgrade paths for users are easy to navigate, and users don't get "stuck" on old versions.
  • The Gumbo parser ships as a precompiled library in all the native platform gems supported by Nokogiri v1.11.
  • The end-state of Nokogumbo is a (mostly) empty gem that prompts users to replace their dependency on Nokogumbo with Nokogiri.

For more specific objectives, see the "Punchlist" section below.

Note: No JRuby support at this time

⚠️ JRuby support is not going to be addressed as part of this work. However, we feel it is important to think about and we hope to work on this in the future. If you're interested in helping with HTML5 support on JRuby, please comment on this issue or ping the maintainers on the mailing list or the Discord channel.

The Nokogumbo code relies on a parser implemented in C, and a C extension that is tightly coupled to libxml2. As a result, the Nokogiri::HTML5 module will not be immediately available on JRuby, which uses Xerces in place of libxml2.

We ask that all downstream libraries be aware of this platform limitation as they consider using HTML5 parsing methods post-merger.

Risks

This work doesn't feel very risky, but if I had to name the riskiest bits:

  • Precompiling Gumbo for all our platforms might be challenging.
  • Gumbo is no longer supported upstream by Google, so we are assuming responsibility for keeping it secure, performant, and up-to-date with the HTML5 spec.
  • There's no way to prevent using a new Nokogiri with an old Nokogumbo and seeing lots of annoying "constant redefined" warning messages.

Some things that could have been risky but aren't:

  • License compatibility.
    • The Nokogumbo license is APL2.0. APL2.0 is compatible with MIT, and so we will maintain that license for specific files, preserving the notices required by APL2.0. We are already doing this with several files in the JRuby Java implementation.

Frequently Asked Questions

Why is this going into v1.12 and not v2.0?

This is not a breaking change. We want everyone on v1.11 to upgrade, and will be making efforts to make that upgrade painless.

Will Nokogiri's current HTML API or parsing behavior change?

No. Nokogiri's existing HTML parsing functionality, available under the Nokogiri::HTML module/namespace, will not change in this release. The Nokogumbo additions to Nokogiri are all contained under the Nokogiri::HTML5 module/namespace, and do not conflict with existing Nokogiri functionality.

In the future, we may explore how Nokogiri might be smarter about HTML4 versus HTML5 parsing, but those changes will be introduced carefully. See some more thoughts on this topic at #2064 (comment)

Will JRuby support HTML5?

Not initially, though we hope to work on this for a future release. See the section above for more information.

Punchlist

Some finer-grained objectives (which will be modified over time as we discover new work to be done):

Pre-merger

  • Release Nokogumbo version which no-ops if Nokogiri is providing HTML5 functionality

Team Merger

  • Nokogumbo contributors are added to the Nokogiri gemspec, README, and copyright declarations.
  • Nokogumbo contributors are added to @sparklemotion/nokogiri-core
  • Send some welcome gifts to the Nokogumbo maintainers.

Legal/License Merger

  • All nokogumbo files should mention they are originally licensed under Apache 2.0 (an interpretation of APL2.0 clause 4.c) and mention that they have been changed (clause 4.b)
  • Changed gumbo-parser files should "carry prominent notices stating that You changed the files" (APL2.0 clause 4.b)
    • I believe this intention is covered by gumbo-parser/src/README.md
  • Update LICENSE-DEPENDENCIES.md to include gumbo and nokogumbo under Apache License 2.0 (APL2.0 clause 4.a)

Functional Merger

  • Commit history for Nokogumbo is preserved in the Nokogiri repository.
  • Unit tests include an integration contract for feat: Nokogumbo detects Nokogiri's HTML5 API rubys/nokogumbo#171
  • Nokogumbo unit tests integrated into rake test
  • Nokogumbo's HTML5 files moved into lib/nokogiri and C file moved into ext/nokogiri
  • extconf.rb builds Nokogumbo and Gumbo into nokogiri.so
  • Nokogumbo unit tests are green on a dev system
  • Valgrind memory testing is green on HTML5 functionality
  • Gumbo can be precompiled for linux 64-bit and 32-bit.
  • Gumbo can be precompiled for windows 64-bit and 32-bit.
  • Gumbo can be precompiled for darwin x64-64 and arm64.
  • Gumbo can be compiled from the vanilla gem on platforms without a native gem.
  • Import the html5lib tests into the repository and the test suite, and get it green in CI
  • Run the gumbo parser tests on all platforms
  • Integrate test coverage platform breadth from Nokogumbo's CI

Forward-looking Changes

  • deprecate Nokogiri::HTML5.get
  • introduce Nokogiri::HTML4 as an alias for Nokogiri::HTML so that at some point we can deprecate and change the HTML functionality (introduce html4 namespace #2278)

Pre-release

Release candidate

Final release

  • :shipit:

/cc @rubys @stevecheckoway

@flavorjones flavorjones added this to the v1.12.0 milestone Mar 13, 2021
flavorjones added a commit to rubys/nokogumbo that referenced this issue Mar 14, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
sparklemotion/nokogiri#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and become a "shim" -- gracefully defer to Nokogiri by refusing to
load itself.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same constants and singleton methods as `::Nokogumbo`:

  - `Nokogumbo.parse()` will be provided as `Nokogiri::Gumbo.parse()`
  - `Nokogumbo.fragment()` → `Nokogiri::Gumbo.fragment()`
  - `Nokogumbo::DEFAULT_MAX_ATTRIBUTES` → `Nokogiri::Gumbo::DEFAULT_MAX_ATTRIBUTES`
  - `Nokogumbo::DEFAULT_MAX_ERRORS` → `Nokogiri::Gumbo::DEFAULT_MAX_ERRORS`
  - `Nokogumbo::DEFAULT_MAX_TREE_DEPTH` → `Nokogiri::Gumbo::DEFAULT_MAX_TREE_DEPTH`

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to avoid
the "shim" behavior. This escape hatch might be unnecessary, but this
change is invasive enough to make me want to be cautious.

In "shim" mode, `Nokogumbo.parse()` and `.fragment()` will be
forwarded to the Nokogiri implementation. The `Nokogumbo::DEFAULT*`
constants will always be defined, but when in "shim" mode will be set
to the `Nokogiri`-provided values.

Nokogumbo will emit a single warning message at `require`-time when it
is in "shim" mode. This message points users to
sparklemotion/nokogiri#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).

I did not include deprecation warning messages in `Nokogumbo.parse`
and `.fragment`. If you feel strongly that we should, let me know.
flavorjones added a commit to rubys/nokogumbo that referenced this issue Mar 15, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
sparklemotion/nokogiri#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and whether to use Nokogiri's implementation or Nokogumbo's
implementation.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same public API as `::Nokogumbo`.

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force
Nokogumbo to use its own implementation. This escape hatch might be
unnecessary, but this change is invasive enough to make me want to be
cautious.

Nokogumbo will emit a single warning message at `require`-time when it
is uses Nokogiri's implementation. This message points users to
sparklemotion/nokogiri#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).
flavorjones added a commit to rubys/nokogumbo that referenced this issue Mar 15, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
sparklemotion/nokogiri#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and whether to use Nokogiri's implementation or Nokogumbo's
implementation.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same public API as `::Nokogumbo`.

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force
Nokogumbo to use its own implementation. This escape hatch might be
unnecessary, but this change is invasive enough to make me want to be
cautious.

Nokogumbo will emit a single warning message at `require`-time when it
is uses Nokogiri's implementation. This message points users to
sparklemotion/nokogiri#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).
@SamSaffron
Copy link

This is great news! Thank you!

flavorjones added a commit that referenced this issue Apr 7, 2021
Closes #170

A future version of Nokogiri will provide Nokogumbo's API (see
#2204). This change
will allow Nokogumbo to detect whether Nokogiri provides the HTML5 API
and whether to use Nokogiri's implementation or Nokogumbo's
implementation.

Some contractual assumptions I'm making about Nokogiri:

- Nokogiri will faithfully reproduce the `::Nokogiri::HTML5` singleton
  method, module, and namespace (including classes
  `Nokogiri::HTML5::Node`, `Nokogiri::HTML5::Document`, and
  `Nokogiri::HTML5::DocumentFragment`)

- Nokogiri will not provide a `::Nokogumbo` module/namespace, but will
  provide a similar `::Nokogiri::Gumbo` module which will provide the
  same public API as `::Nokogumbo`.

This change checks for the existence of `Nokogiri::HTML5`,
`Nokogiri::Gumbo`, and an expected singleton method on each. We could
do a more- or less-thorough check here.

This change also provides an "escape hatch" using an environment
variable `NOKOGUMBO_IGNORE_NOKOGIRI_HTML5` which can be set to force
Nokogumbo to use its own implementation. This escape hatch might be
unnecessary, but this change is invasive enough to make me want to be
cautious.

Nokogumbo will emit a single warning message at `require`-time when it
is uses Nokogiri's implementation. This message points users to
#2205 which will
explain what's going on and help people migrate their
applications (but is an empty placeholder right now).
@flavorjones
Copy link
Member Author

Punch list updates:

Also see #2217 which merges the code and commit history (parked on a branch until I release v1.11.3).

flavorjones added a commit that referenced this issue Apr 8, 2021
merge nokogumbo history

---

**What problem is this PR intended to solve?**

This is one step of many to merge Nokogumbo into Nokogiri (see [Epic: merge Nokogumbo into Nokogiri · Issue #2204 · sparklemotion/nokogiri](#2204)).

- Commit history for Nokogumbo is preserved in the Nokogiri repository
- Nokogumbo contributors are added to the Nokogiri gemspec, README, and copyright declarations
- All nokogumbo files should mention they are originally licensed under Apache 2.0 (an interpretation of APL2.0 clause 4.c) and mention that they have been changed (clause 4.b)
@flavorjones flavorjones pinned this issue Apr 17, 2021
@hino-ryu

This comment was marked as off-topic.

@flavorjones
Copy link
Member Author

Just a quick update!

Work on completing this integration got held up because I spent some time moving Nokogiri's CI to Github Actions (#2207, #2244, #2247, #2226) and improving the test coverage around gem packaging and installation (#1718, #2151, #2152, #2153).

Now that that's done (see https://github.com/sparklemotion/nokogiri/actions), I'm in a good place to pick this back up and merge Nokogumbo's actions-based test suite. Hopefully this will go quickly, since the bulk of the work is done and I'd love to ship a release candidate and get some feedback.

@flavorjones
Copy link
Member Author

@stevecheckoway You mentioned in another thread that you'd like the html5lib-tests to be imported from your fork+branch. Is that still what you'd prefer? An alternative (or intermediate step) might be to submodule it.

Another question: it looks like html5lib-tests is still active, meaning that it's drifted since your fork. Are you thinking about rebasing at all? What's your mental model for how those tests should be maintained going forward?

@stevecheckoway
Copy link
Contributor

A submodule seems fine. It'd be great if they would respond to my pull requests either accept the changes or reject them. But based on my reading of the errors (a part of the spec that's in flux), the tests were wrong. I'll look at rebasing.

@flavorjones
Copy link
Member Author

Submodule sounds good to me, too. See #2273.

@stevecheckoway
Copy link
Contributor

I just rebased my all-error-fixes branch of html5lib-tests and discovered that a new test was added that changes the test format. Previously, comments always appeared on one line, now they can appear on multiple lines.

There are also some new tree-building tests that don't pass. I looked at one and I don't see how the test matches the spec (but does agree with the browser). I'll follow up on that and see if I can figure out what the issue is.

Here's an example of the new tests that are failing: <math></p><foo> This should produce the DOM:

<html>
  <body>
    <math>
    <p>
    <foo>

but my reading (and gumbo's code) produce

<html>
  <body>
    <math>
      <p>
      <foo>

That is, the p and foo elements should be children of body according to the test but are being parsed as children of math.

@stevecheckoway
Copy link
Contributor

I asked about it on the pull request that introduced the tests. html5lib/html5lib-tests#135

@flavorjones
Copy link
Member Author

@stevecheckoway OK, thanks for looking into it. I have #2273 which I should finish today, and once that's done it'll be on main for you to do with as you please.

@flavorjones
Copy link
Member Author

Everything from the "Functional Merger" section of this issue's description is done and on main.

@stevecheckoway
Copy link
Contributor

I dug into it a little more. Safari, Firefox, and Chrome all agree with the test. html5ever agrees with Gumbo and my reading of the spec. More details here. This could be a spec bug, so I filed an issue. It's more likely I'm just missing something subtle.

flavorjones added a commit that referenced this issue Jun 20, 2021
flavorjones added a commit that referenced this issue Jun 20, 2021
@flavorjones
Copy link
Member Author

@rubys and @stevecheckoway - I'd like to "deprecate" the method Nokogiri::HTML5.get in v1.12.0, by which I mean I'd like to emit a warning when it's used, letting users know it will be removed in a future release.

For context, my reasoning is that fetching data over the network is a problem that is orthogonal to what I'd like Nokogiri to focus on: parsing and manipulation. There are quite a few networking libraries and HTTP clients that solve the networking problem already. I do want to make sure that Nokogiri integrates will with them (e.g., supporting IO objects), but I'd prefer not to maintain a responsibility in that domain if it's possible to avoid.

If you feel strongly it should be a method in Nokogiri, can you help me understand your mental model?

flavorjones added a commit that referenced this issue Jun 21, 2021
- ci pipeline and scripts have been replicated, or intentionally
  dropped (e.g., gentoo and libxml2 build variations)
- Rakefile and Gemfile no longer needed
- CHANGELOG not needed
- LICENSE recognized in a few ways (see #2204)

[skip ci]
@stevecheckoway
Copy link
Contributor

It predates my involvement with the project. I don't use it myself.

flavorjones added a commit that referenced this issue Jun 21, 2021
…l4-namespace

introduce html4 namespace

---

**What problem is this PR intended to solve?**

As the Nokogumbo merger progresses (see #2204), we now have an `HTML5` module and namespace, but the previous libxml2-(and nekohtml-) based functionality is parked under the ambiguous `HTML` module and namespace.

I'd like to disambiguate, and also introduce an opportunity for us to use `HTML` for more general use in the future (e.g., perhaps detection of HTML doc format and choosing the right DOM parser).

This PR moves everything currently under `HTML` to `HTML4`, and makes `HTML` an alias for `HTML4`. It updates doc strings and class names.

Some changes in behavior that I want to note:

- objects will report a class of `Nokogiri::HTML4::XXX` where they previously reported `Nokogiri::HTML::XXX`
- some of the exported C symbols have been renamed (e.g., `mNokogiriHTML` is now `mNokogiriHTML4`) which might impact anyone writing C code and linking against Nokogiri's dylib


**Have you included adequate test coverage?**

I've left the tests alone (except for the addition of some "HTML/HTML4 equivalence" tests) to demonstrate there's no behavioral breakage.

**Does this change affect the behavior of either the C or the Java implementations?**

Notably, I've updated the Java files to rename classes and variable, and use the proper module and class names, so that it stays in sync with CRuby despite not having an `HTML5` module/namespace.
flavorjones added a commit that referenced this issue Jun 23, 2021
This should ensure that we don't break anybody who has upgraded
Nokogiri but hasn't dropped Nokogumbo yet.

Related to #2204
@flavorjones
Copy link
Member Author

OK, everything except updating the tutorials is done, which I think means we should ship a release candidate and ask people to give feedback (likely in a "Discussion" thread).

@sparklemotion/nokogiri-core Let me know if there's anything y'all think we need to do before cutting that release candidate, or if there's anything missing from the list in this issue. Otherwise, I'll cut a release in the next day or so.

@rubys
Copy link
Contributor

rubys commented Jun 24, 2021

I'd like to "deprecate" the method Nokogiri::HTML5.get

Perhaps just delete it as it never shipped with Nokogiri?

As to mental model: I have a number of apps that parse HTML. Guess where the largest source of HTML would be. :-). You guessed it: the internet. So this was a common enough use case to factor out of the application and into a gem.

If it doesn't belong in the nokogiri gem, a new gem could always be created to include this function. This is a common pattern with Rails: things that were once core but no longer are considered core are moved out into gems.

P.S. Sorry for the delay in responding.

@flavorjones
Copy link
Member Author

Perhaps just delete it as it never shipped with Nokogiri?

Yeah,I thought about it, but it's been in Nokogumbo since 2013, and the pledge we're making with this integration is that apps can drop the dependency on Nokogumbo and everything will Just Work™. I don't want to support the method, but I also don't want to introduce any unnecessary hurdles to upgrade/adoption.

In 4ac9350 I introduced a warning, and I think that's sufficient for now.

@flavorjones
Copy link
Member Author

I've shipped a release candidate at v1.12.0.rc1, and started a discussion.

Please let us know any feedback either here or in that discussion thread!

@flavorjones
Copy link
Member Author

Planning to ship v1.12 final this coming weekend, around July 31st.

@flavorjones
Copy link
Member Author

Nokogiri v1.12.0 has shipped. Closing.

https://github.com/sparklemotion/nokogiri/releases/tag/v1.12.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
@rubys @SamSaffron @flavorjones @stevecheckoway @hino-ryu and others