Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Dreyfus" (Erlang side of Search) to CouchDB #2037

Merged
merged 83 commits into from
Jun 21, 2019

Conversation

kocolosk
Copy link
Member

@kocolosk kocolosk commented May 25, 2019

Overview

This contribution pulls in the "Dreyfus" OTP application, which teaches CouchDB how to communicate with the "Clouseau" Java application for Lucene-powered Search and Mango text indexes. Once this PR is merged it will no longer be necessary to use a custom fork of CouchDB in order to enable Search.

Testing recommendations

In order to fully test this PR you'll need a build of Clouseau. Those can be found at

There's also a public Maven repo:

I've gone ahead and built a Docker image which can be used with Kubernetes:

and updated the upstream Helm chart to use it. If you want a one-liner to install CouchDB+Search on a Kube cluster of your choice you can try:

helm install incubator/couchdb --set image.repository=kocolosk/couchdb,image.tag=dreyfus-by-default,clusterSize=1,enableSearch=true --name test-search

Related Issues or Pull Requests

apache/couchdb-documentation#418
helm/charts#14822

Checklist

  • Code is written and works correctly;
  • Changes are covered by tests;
  • Documentation reflects the changes;

Posting a draft PR for discussion. Will lift the draft flag once documentation and a Clouseau build are available.

rnewson and others added 30 commits July 20, 2015 12:09
This is the first public release of Dreyfus under the Apache Software
License, version 2.
When loading design docs with the ID, we want to add ejson_body
to the list of options so the body is converted to term from binary.
This makes dreyfus compatible with COUCHDB2.0

BugzID:50663
Previous if the anonymous opening process died it would cause
dreyfus_index_manager to exit and kill all other dreyfus_index
processes. This lead to bad things happening in multiple places.

This fix just tracks the anonymous opener processs in the ?BY_PID table
so we know about it when it dies. The exit status is then reused the
same as would have been for an error returned when opening the index.

Fixes: apache#2
BugzID: 52884
This metric is different from what we track in clouseau. This basically will track
the overall time it took for the search request, where as the one in Clouseau will
only track the search latency at the shard level.

This will come in handy for tracking the search latency issues as seen  by the end users
and also allow us to add pager triggers based on this.
…fus-httpd-search

Add new metric to track the "search request" latency.
Rexi receives an error message that does not match any of the tuples
for the first argument of dreyfus_fabric:handle_error_message/6. This
leads to a function_clause error that bubbles up to chttpd which gets
logged because there is a stack trace involved. By adding an extra
clause that matches all error messages, we attempt to see if progress
can be made by calling handle_error. If progress can succeed, then
nothing gets logged. If the progress cannot succeed, an error will be
sent back to the user again via chttpd:send_error/2. However, if no
stack trace is involved in the error, we won't log the message. Adding
an extra couch_log:error here assures we have something in the logs to
make note of the error message to help us debug.

BugzId: 75421
If a database is deleted very soon after creation, it's possible for
the open_int call to return {not_found, no_db_file}. Expect this and
error cleanly out of gen_server initialisation.

BugzID: 77650
Counters were not stored as proper orddict (in a sorted manner).
Because of this when some Workers replied, they could not be found in
Counters and updated. dreyfus_fabric_search will keep waiting for
these Counters, which would lead to timeout on a request.

BugzID: 84146
Currently the only way to get the disk size information for the search
index is to use the _search_info end point. But using this point would
lead to opening the search index which is non trivial as lucene has an
overhead for opening the index.

These changes would add a new end point _search_disk_size to get the
disk size information without opening the search index. Sarnie can use
this new end point and avoid opening the search index.

BugzId:87336
…nd-point

Add new end point to get disk size information for search index
@wohali
Copy link
Member

wohali commented Jun 14, 2019 via email

@kocolosk
Copy link
Member Author

Yep understood and agreed.

@rnewson
Copy link
Member

rnewson commented Jun 14, 2019 via email

kocolosk added a commit to kocolosk/charts that referenced this pull request Jun 15, 2019
There is a companion open source project to CouchDB called Clouseau
that provides Lucene-powered fulltext search to CouchDB. This patch
provides a way to include Clouseau as a sidecar in each CouchDB Pod.

The latest release of CouchDB does not yet support this sidecar out of
the box; see apache/couchdb#2037. This integration is therefore
disabled by default.

Signed-off-by: Adam Kocoloski <kocolosk@apache.org>
Previously we had been using module_loaded(dreyfus_index) as a check
for the presence of the Search system. There are two issues with this
approach going forward:

1. Dreyfus is going to be included in every build
2. An Erlang release loads modules lazily and so this check could
   accidentally fail even on a Search-enabled system.

This patch changes the check to one that makes an RPC request
to the Clouseau (Java) subsystem. This should be a low-cost operation,
but I haven't benchmarked it.
@kocolosk
Copy link
Member Author

kocolosk commented Jun 15, 2019

I made a small change to replace the check for the dreyfus_index module with a test message to the Clouseau node. Here's a summary how CouchDB behaves when Clouseau is not connected:

Search

  • Updates to a Search design document (i.e., one with a top-level "indexes" field) succeed. I doubt we want to change this as it would have significant implications for replication.
  • Search requests fail after ~30 seconds with a 500 status code and {"error":"timeout","reason":"The request could not be processed in a reasonable amount of time."}. This could be improved.

Mango

  • Attempts to create a Mango text index via the _index endpoint fail ~immediately with a 503 status and the following response: {"error":"required index service unavailable","reason":"text"}. I think this is the right behavior.
  • The Mango query optimizer will skip any existing "text" indexes if the Clouseau subsystem is not available at query time. This could result in a previously fast query falling back to a full database scan. I'm OK with this.

I think there are three followups:

  1. Reject the Search request immediately instead of timing out with a generic error
  2. Consider some retry logic to mask Clouseau reboots during upgrades as suggested by @rnewson. I could see doing a short bounded exponential backoff for a couple of seconds.
  3. Quantify the cost of the RPC-based check for Clouseau's presence; both the amount of time to do the check, and the number of times the check is invoked in the request path for a healthy system.

@rnewson
Copy link
Member

rnewson commented Jun 15, 2019

for ref this is what dreyfus does for retries: https://github.com/cloudant-labs/dreyfus/blob/master/src/dreyfus_httpd.erl#L585 but obviously that is only for direct _search http requests, not via mango.

@chintan-mishra
Copy link

As a user, I have a question.

Why not include Clouseau as well?

fooka03 pushed a commit to fooka03/charts that referenced this pull request Jun 17, 2019
…m#14822)

* Bump to latest CouchDB release

Signed-off-by: Adam Kocoloski <kocolosk@apache.org>

* Add experimental integration with Lucene search

There is a companion open source project to CouchDB called Clouseau
that provides Lucene-powered fulltext search to CouchDB. This patch
provides a way to include Clouseau as a sidecar in each CouchDB Pod.

The latest release of CouchDB does not yet support this sidecar out of
the box; see apache/couchdb#2037. This integration is therefore
disabled by default.

Signed-off-by: Adam Kocoloski <kocolosk@apache.org>
Signed-off-by: Nigel Foucha <nigel.foucha@gmail.com>
@kocolosk
Copy link
Member Author

@chintan-mishra good question.

The Closeau codebase is a very different thing. It's all Scala/Java, so the release packaging machinery that we use for the main Erlang codebase wouldn't be applicable. It runs as a completely separate process, so the scripts that we use to start/stop/manage CouchDB wouldn't apply. It's also a relatively mature project and not one where the current IBM maintainers foresee a lot of additional contributions.

I'd like to hear your opinion as a user. I can imagine you want the Search add-on to be enabled as simply as possible. I'd like to see that too, but personally I think the right approach is to ensure that when the Clouseau service is installed using the convenience binaries CouchDB recognizes it and uses it automatically, rather than pulling the whole codebase "in-tree".

@AlexanderKaraberov
Copy link
Contributor

AlexanderKaraberov commented Jun 18, 2019

Hello @kocolosk

It's also a relatively mature project and not one where the current IBM maintainers foresee a lot of additional contributions.

The most fantastic thing would be to update Clouseau to Java 8+ and the latest Lucene . I've already opened an issue for this: cloudant-labs/clouseau#22
but I'm still not sure about the approximate timeframe for this to land. Are there any plans in the Cloudant roadmap to update Clouseau? JDK8 compatibility would be a major boost for this project as well as Lucene upgrade.

@rnewson
Copy link
Member

rnewson commented Jun 18, 2019

Updating to Java 8 is non-trivial as several dependencies need updating and there is no upstream support for them anymore. a Pull Request for all that work would move things forward, of course, but an Issue reporting what we already know isn't going to.

As the author of dreyfus/clouseau, I should point out that those projects apply to CouchDB 2.0 specifically, they are tightly coupled/aligned with that architecture.

Future versions of CouchDB that use FoundationDB as the base layer will need an alternative solution, which I, and my team, are working on right now (http://github.com/cloudant-labs/fdblucene is the fundamental piece). This work is pure Java and targets JDK 8, though I personally develop against JDK 11.

@kocolosk
Copy link
Member Author

The latest commit preserves the exponential backoff for Search requests, but if the final error message indicates an inability to connect to Clouseau the error message will be

{"error":"ou_est_clouseau","reason":"Could not connect to the Clouseau Java service at clouseau@127.0.0.1"}

That seems like a reasonable improvement to me.

I also did an ad hoc measurement of the time required to do the clouseau_rpc:connected() check. Unfortunately, on a Kubernetes deployment with CouchDB and Clouseau in the same Pod this check cost more than 3 milliseconds, which seems an unacceptably long amount of time to waste on the happy path. I will think a bit more about how to guard those Mango requests in a way that doesn't introduce too much latency to a well-configured system.

@kocolosk
Copy link
Member Author

OK, so now the connected() check first looks at the hidden nodes to see if we already have a connection to Clouseau. This check is much cheaper, around 0.1 ms. If no connection exists already, it tries the version() RPC to see if a connection can be made.

Personally I'm satisfied with the PR at this point.

@chintan-mishra
Copy link

chintan-mishra commented Jun 19, 2019

@kocolosk yes I would prefer Full-Text Search to be one of the config options in the CouchDB config.

Edit: The approach mentioned by you looks easy. I am up for that type of setup.

Copy link
Member

@wohali wohali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't the foggiest of how to start reviewing this, but with the 30s backoff that fails if you don't have Clouseau running sounds fine to me.

I asked @kocolosk to note the new configurable parameter(s) in rel/overlay/etc/default.ini, but @rnewson if there are any other dreyfus specific ones, would you mind bringing those over too? Thanks.

Those configurables should also be documented in the documentation PR, if you please. Thanks.

@janl
Copy link
Member

janl commented Jun 20, 2019

very happy to see this, thanks everyone!

kocolosk added 2 commits June 20, 2019 15:32
The clouseau_rpc:version() call actually takes a few milliseconds to
complete, so instead we first check for a hidden clouseau node already
connected to our node. If we don't find it, we do the version() RPC.
@kocolosk kocolosk force-pushed the dreyfus-by-default branch from 9a31987 to 0d32708 Compare June 20, 2019 19:34
@kocolosk
Copy link
Member Author

Doing a traditional merge commit here to be consistent with our past process on git-subtree merges. Thanks for the review everyone!

@kocolosk kocolosk merged commit cbf8804 into apache:master Jun 21, 2019
anasinnyk pushed a commit to MacPaw/charts that referenced this pull request Jun 29, 2019
…m#14822)

* Bump to latest CouchDB release

Signed-off-by: Adam Kocoloski <kocolosk@apache.org>

* Add experimental integration with Lucene search

There is a companion open source project to CouchDB called Clouseau
that provides Lucene-powered fulltext search to CouchDB. This patch
provides a way to include Clouseau as a sidecar in each CouchDB Pod.

The latest release of CouchDB does not yet support this sidecar out of
the box; see apache/couchdb#2037. This integration is therefore
disabled by default.

Signed-off-by: Adam Kocoloski <kocolosk@apache.org>
Signed-off-by: Andrii Nasinnyk <anasinnyk@macpaw.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.