Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: Test examples that recreate lang analyzers #29535

Merged
merged 28 commits into from
May 9, 2018

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Apr 16, 2018

We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built an analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.

This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
_analyze API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to _anlayze takes about 20ms on my laptop which
seems fine.

Related to #29499

We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built *an* analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.

This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
`_analyze` API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to `_anlayze` takes about 20ms on my laptop which
seems fine.
@nik9000 nik9000 added >docs General docs changes >test Issues or PRs that are addressing/adding tests :Delivery/Build Build or test infrastructure v7.0.0 v6.3.0 labels Apr 16, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@nik9000
Copy link
Member Author

nik9000 commented Apr 16, 2018

I've filed this as ":core/build" because ":core/build" includes testing infrastructure. It might be more correct to file it as ":search/anlaysis" but the bulk of the work is in the testing infrastructure.

I thought of three ways to do this:

  1. Write a unit test that parses the docs looking for these snippets and runs the tests.
  2. Create special syntax in the docs tests that kicks off these test.
  3. Allow the docs tests to write things in YAML and have that kick off the test.

I opted for option 3 because it seemed the simplest thing at the time. Option two would have been fairly simple as well. Option 1 might produce faster tests but would require a lot of extra complexity that reproduces some of the docs testing infrastructure. Since option 3 takes ~20 milliseconds per language I think we're ok as far as speed is concerned.

@nik9000
Copy link
Member Author

nik9000 commented Apr 16, 2018

We also talking about doing this using some mechanism to force us to manually check the lucene analyzers when we upgrade lucene. Given how quickly these tests run I don't think that is worth it.

for (int i = 0; i < size; i++) {
testText.add(randomRealisticUnicodeOfCodepointLength(between(1, 15))
// Don't look up stashed values
.replace("$", "\\$"));
Copy link
Contributor

@mayya-sharipova mayya-sharipova Apr 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do strings generated from randomRealisticUnicodeOfCodepointLength also contain spaces, punctuation marks etc, so that we can test a tokenizer part of analyzers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like they do. It'd be cool to insert spaces. I don't think I could easily use the same unicode page for both space separated strings though. That might not be too bad though.

"arabic_keywords",
"arabic_normalization",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for correcting documentation on all analyzers

@nik9000
Copy link
Member Author

nik9000 commented Apr 19, 2018

@mayya-sharipova, I've pushed a patch to add spaces. It isn't perfect, but it does add spaces. Have a look at the comment I sent for a more thorough explanation of what I mean.

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nik9000 Thanks for the change, Nik! For the search side (analyzers and a test to test that tokens are similar), I have left a small comment. Other than that everything looks fine.
May be somebody from the Core/Infra team can review it as well for the testing infrastructure part.

if (false == secondTokens.hasNext()) {
fail(second + " has fewer tokens than " + first + ". "
+ first + " has [" + firstTokens.next() + "] but " + second + " is out of tokens. "
+ first + "'s last token was [" + previousFirst + "] and "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where you assign something to previousFirst and previousSecond besides null?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I used to have it working. I'll fix.

testText.add(b.toString()
// Don't look up stashed values
.replace("$", "\\$"));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for this change

@nik9000 nik9000 added v6.4.0 and removed v6.3.0 labels Apr 25, 2018
@nik9000
Copy link
Member Author

nik9000 commented Apr 27, 2018

I found a few more bugs in the configurations by running with -Dtests.iters=50 and pushed fixes.

@nik9000
Copy link
Member Author

nik9000 commented Apr 27, 2018

@elasticmachine recheck this please.

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, I left really minor nits, but nothing that needs another review

private static CompareAnalyzers parse(XContentParser parser) throws IOException {
XContentLocation location = parser.getTokenLocation();
CompareAnalyzers section = PARSER.parse(parser, location);
assert parser.currentToken() == Token.END_OBJECT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a message here so it'll be helpful if someone accidentally misses a closing token?

*/
NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(Arrays.asList(
List<NamedXContentRegistry.Entry> DEFAULT_EXECUTABLE_CONTEXTS = unmodifiableList(Arrays.asList(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be final

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out declaring it public static final out of habit but checkstyle failed because they are all forced on that field because it is in an interface.

* {@link NamedXContentRegistry} that parses the default list of
* {@link ExecutableSection}s available for tests.
*/
NamedXContentRegistry XCONTENT_REGISTRY = new NamedXContentRegistry(DEFAULT_EXECUTABLE_CONTEXTS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, this could be final I think?

@nik9000 nik9000 merged commit f9dc868 into elastic:master May 9, 2018
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 9, 2018
…or-you

* elastic/master: (22 commits)
  Docs: Test examples that recreate lang analyzers  (elastic#29535)
  BulkProcessor to retry based on status code (elastic#29329)
  Add GET Repository High Level REST API (elastic#30362)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Add `coordinating_only` node selector (elastic#30313)
  Stop forking groovyc (elastic#30471)
  Avoid setting connection request timeout (elastic#30384)
  Use date format in `date_range` mapping before fallback to default (elastic#29310)
  Watcher: Increase HttpClient parallel sent requests (elastic#30130)
  Mute ML upgrade test (elastic#30458)
  Stop forking javac (elastic#30462)
  Client: Deprecate many argument performRequest (elastic#30315)
  Docs: Use task_id in examples of tasks (elastic#30436)
  Security: Rename IndexLifecycleManager to SecurityIndexManager (elastic#30442)
  [Docs] Fix typo in cardinality-aggregation.asciidoc (elastic#30434)
  Avoid NPE in `more_like_this` when field has zero tokens (elastic#30365)
  Build: Switch to building javadoc with html5 (elastic#30440)
  Add a quick tour of the project to CONTRIBUTING (elastic#30187)
  Reindex: Use request flavored methods (elastic#30317)
  Silence SplitIndexIT.testSplitIndexPrimaryTerm test failure. (elastic#30432)
  ...
nik9000 added a commit that referenced this pull request May 9, 2018
We have a pile of documentation describing how to rebuild the built in
language analyzers and, previously, our documentation testing framework
made sure that the examples successfully built *an* analyzer but they
didn't assert that the analyzer built by the documentation matches the
built in anlayzer. Unsuprisingly, some of the examples aren't quite
right.

This adds a mechanism that tests that the analyzers built by the docs.
The mechanism is fairly simple and brutal but it seems to be working:
build a hundred random unicode sequences and send them through the
`_analyze` API with the rebuilt analyzer and then again through the
built in analyzer. Then make sure both APIs return the same results.
Each of these calls to `_anlayze` takes about 20ms on my laptop which
seems fine.
nik9000 added a commit to nik9000/elasticsearch that referenced this pull request May 9, 2018
Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in elastic#29535.

Closes elastic#29499
dnhatn added a commit that referenced this pull request May 10, 2018
* master:
  Upgrade to Lucene-7.4-snapshot-6705632810 (#30519)
  add version compatibility from 6.4.0 after backport, see #30319 (#30390)
  Security: Simplify security index listeners (#30466)
  Add proper longitude validation in geo_polygon_query (#30497)
  Remove Discovery.AckListener.onTimeout() (#30514)
  Build: move generated-resources to build (#30366)
  Reindex: Fold "with all deps" project into reindex (#30154)
  Isolate REST client single host tests (#30504)
  Solve Gradle deprecation warnings around shadowJar (#30483)
  SAML: Process only signed data (#30420)
  Remove BWC repository test (#30500)
  Build: Remove xpack specific run task (#30487)
  AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests
  LLClient: Add setJsonEntity (#30447)
  Expose CommonStatsFlags directly in IndicesStatsRequest. (#30163)
  Silence IndexUpgradeIT test failures. (#30430)
  Bump Gradle heap to 1792m (#30484)
  [docs] add warning for read-write indices in force merge documentation (#28869)
  Avoid deadlocks in cache (#30461)
  Test: remove hardcoded list of unconfigured ciphers (#30367)
  mute SplitIndexIT due to #30416
  Docs: Test examples that recreate lang analyzers  (#29535)
  BulkProcessor to retry based on status code (#29329)
  Add GET Repository High Level REST API (#30362)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Add `coordinating_only` node selector (#30313)
  Stop forking groovyc (#30471)
  Avoid setting connection request timeout (#30384)
  Use date format in `date_range` mapping before fallback to default (#29310)
  Watcher: Increase HttpClient parallel sent requests (#30130)

# Conflicts:
#	x-pack/plugin/core/src/test/java/org/elasticsearch/xpack/core/LocalStateCompositeXPackPlugin.java
dnhatn added a commit that referenced this pull request May 10, 2018
* 6.x:
  Upgrade to Lucene-7.4-snapshot-6705632810 (#30519)
  Remove Discovery.AckListener.onTimeout() (#30514)
  Build: move generated-resources to build (#30366)
  Reindex: Fold "with all deps" project into reindex (#30154)
  Isolate REST client single host tests (#30504)
  Remove BWC repository test (#30500)
  Build: Remove xpack specific run task (#30487)
  AwaitsFix IntegTestZipClientYamlTestSuiteIT#indices.split tests
  LLClient: Add setJsonEntity (#30447)
  [docs] add warning for read-write indices in force merge documentation (#28869)
  Avoid deadlocks in cache (#30461)
  BulkProcessor to retry based on status code (#29329)
  Avoid setting connection request timeout (#30384)
  Test: remove hardcoded list of unconfigured ciphers (#30367)
  Add GET Repository High Level REST API (#30362)
  mute SplitIndexIT due to #30416
  Docs: Test examples that recreate lang analyzers  (#29535)
  add a comment explaining the need for RetryOnReplicaException on missing mappings
  Pass the task to broadcast actions (#29672)
  Stop forking groovyc (#30471)
  Add `coordinating_only` node selector (#30313)
  Fix accidental error in changelog
  Use date format in `date_range` mapping before fallback to default (#29310)
  Watcher: Increase HttpClient parallel sent requests (#30130)
  [Security][Tests] Azeri(Turkish) locale tripps opensaml dependency
nik9000 added a commit that referenced this pull request May 14, 2018
Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in #29535.

Closes #29499
nik9000 added a commit that referenced this pull request May 14, 2018
Adds documentation for how to rebuild all the built in analyzers and
tests for that documentation using the mechanism added in #29535.

Closes #29499
@mark-vieira mark-vieira added the Team:Delivery Meta label for Delivery team label Nov 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure >docs General docs changes Team:Delivery Meta label for Delivery team >test Issues or PRs that are addressing/adding tests v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants