Skip to content
This repository has been archived by the owner on Jun 3, 2020. It is now read-only.

Latest commit

 

History

History
161 lines (128 loc) · 9.96 KB

DeveloperDocs_SearchServer.md

File metadata and controls

161 lines (128 loc) · 9.96 KB

SearchServer - Developer Notes

Overview

The spacewalk webui performs searches through a XMLRPC interface to a separate java process that uses Lucene for the search engine. We call this separate process the "SearchServer".

A typical search involves

  • WebUI forming a query in LuceneQueryParser syntax.
  • WebUI sends the query to SearchServer through XMLRPC
  • SearchServer looks to see what index should be searched: Errata, Packages, System, Documentation
  • Lucene searches index files, and a list of results are obtained.
  • Result consists of the object ID and the score from lucene (float 1.0 max value, 0.0 lowest)
  • SearchServer trims the returned list so only values which are above a threshold score (set in configuration file) are returned.
  • Additionally the max number of results returned is limited by the configuration setting of "search.max_hits_returned"
  • WebUI uses the returned ids to flesh out the info it wants to display
  • This step is also responsible for filtering the data to retain proper user/org permission rules.
  • Fleshed out objects displayed

Basic Operations

Index

  • Upon start up SearchServer reads the database table, "rhnIndexerWork" to determine what has been indexed previously.

  • Each row corresponds to a different object_type, e.g.: errata, packages, systems.

  • Once SearchServer knows the last time the index operation ran, and the last id it indexed, it asks the database for objects which are new and/or modified since then.

  • These new objects are passed into Lucene so they may be indexed for later searching.

  • After startup, SearchServer will poll the database for updated changes. Typical polling period is 5 minutes, but this configurable.

Retrieval

When a request comes in through XMLRPC and matches the "index" namespace, lucene is used to search the indexes for a match.

  • Format of message:
  • Session ID
  • Index name - controls which index to search, e.g.: docs, errata, hwdevice, package, server
  • This corresponds to the directory name for the index, typically stored at: /usr/share/rhn/search/indexes
  • Query - this is what we are searching for, could be a package name, system name, phrase for doc search etc
  • In our code, the entry point for search functionality is defined here: "com.redhat.satellite.search.index.IndexManager::search"

XMLRPC

  • By default we listen on 127.0.0.1, port 2828

  • The address to listen on is configured by : "search.rpc_address"

  • The port number is configured by : "search.rpc_port"

  • Namespace: index

  • Handler: com.redhat.satellite.search.rpc.handlers.IndexHandler

  • Description: This handles the basic searches through lucene indexes. Most searches use this namespace.

  • Namespace: db

  • Handler: com.redhat.satellite.search.rpc.handlers.DatabaseHandler

  • Description: This handles searches which ONLY look at the database, it will not search any lucene indexes. It's mainly used for errata search by a date range.

  • Namespace: admin

  • Handler: com.redhat.satellite.search.rpc.handlers.AdminHandler

  • Description: This is the administration interface, currently is supports triggering a "reindex" for any index type we know about. This is useful if you want the index to be updated immediately,

  • Example: this is invoked by the python backend after a server is registered and from webui when a new custom errata is created.

Lucene

  • Index files are stored at: /usr/share/rhn/search/indexes

  • Documentation "doc" indexes are delivered through a RPM "spacewalk-doc-indexes".

  • All other indexes are generated based on the database SearchServer is connected to.

  • How can I see what is indexed

  • You can use Luke to open a lucene index in a GUI and see what is available

  • We include a luke jar in our git repo, this is not bundled for delivery, it is only for development * spacewalk/search-server/spacewalk-search/scripts/lukeall-0.8.1.jar

  • Example run: * java -jar spacewalk/search-server/scripts/lukeall-0.8.1.jar * Select index to open: "/usr/share/rhn/search/indexes/package"
    * Click OK * Select the "Documents" view, then browse through the items.

    • Note: Lucene does not store all values in a way we can view, only fields which were marked as "Stored" will be displayed.
      * Searches will likely not work as desired from Luke. Since we are using a custom NGramAnalyzer, we need to import this into Luke in order for our searches to function.
  • How to Clean

  • If you want to clean the existing indexes you can run a script: /etc/init.d/rhn-search cleanindex

  • REMEMBER there are 2 parts to cleaning the indexes * Delete the directory containing the indexes * Adjust the database 'rhnIndexerWork' entry for the specified 'object_type'. This is what will trigger the data to be re-read. * NEVER delete the doc indexes, they come pre-packaged in a RPM. (spacewalk-doc-indexes)

Database

  • iBatis is the ORM SearchServer uses

  • Location of queries

  • spacewalk/search-server/src/config/com/redhat/satellite/search/db

Howto Build/Run

Build

Run

  • You can run search-server as a service by calling service rhn-search start, alternatively run rhn-search console to see the search-server running in the foreground
  • You can interact with the SearchServer through spacewalk/search-server/spacewalk-search/scripts/search.py
  • ./search.py --help
  • ./search.py --username admin --password spacewalk --package firefox
  • When is --serverAddr helpful * During development you may want to run a spacewalk instance and db on a remote machine while running search-server on your local box. * --serverAddr allows you to specify the spacewalk instance you want to log into for retrieving the session-id
    • NOTE: The search-server must be running locally (to search.py), unless you open up the rpc binding to occur on the network interface and not 127.0.0.1

Configuration

  • Configuration values from: /etc/rhn/rhn.conf are inherited by SearchServer

  • SearchServer specific configuration resides in git at: spacewalk/search-server/spacewalk-search/src/config/search/rhn_search/

  • Example

    search.index_work_dir=/usr/share/rhn/search/indexes search.rpc_handlers=index:com.redhat.satellite.search.rpc.handlers.IndexHandler,db:com.redhat.satellite.search.rpc.handlers.DatabaseHandler,admin:com.redhat.satellite.search.rpc.handlers.AdminHandler search.max_hits_returned=500 search.connection.driver_class=oracle.jdbc.driver.OracleDriver search.score_threshold=.10 search.system_score_threshold=.01 search.errata_score_threshold=.20 search.errata.advisory_score_threshold=.30 search.min_ngram = 1 search.max_ngram = 5 search.doc.limit_results = false search.schedule.interval = 300000 search.log.explain.results = false

  • search.index_work_dir : Specifies where Lucene indexes are kept

  • search.rpc_handlers : semi-colon separated list of classes to act as handlers for XMLRPC calls.

  • search.max_hits_returned : maximum number of results which will be returned to the caller

  • search.connection.driver_class : JDBC class

  • search.score_threshold : minimum score a result needs to be returned back to caller

  • search.system_score_threshold : minimum score a system search result needs to be returned back to caller

  • search.errata_score_threshold : minimum score an errata search result needs to be returned back to caller

  • search.errata.advisory_score_threshold : minimum score an errata advisory result needs to be returned back to caller

  • search.min_ngram : minimum length of n-gram characters (any change to this value requires 'clean-index' to be run, plus doc-indexes need to be modified and rebuilt)

  • search.max_ngram : maximum length of n-gram characters (any change to this value requires 'clean-index' to be run, plus doc-indexes need to be modified and rebuilt)

  • search.doc.limit_results : true means we limit the number of results both on search.score_threshold and restrict max hits to be below search.max_hits_returned, false means to return all doc search matches

  • search.schedule.interval : value is in milliseconds, controls the interval SearchServer polls the database for changes, default is 5 minutes (300000)

  • search.log.explain.results : used during development/debugging. If set to true, will log additional info hinting at what influenced the score of each result.

Documentation Search

  • Documentation search requires pre-packaged indexes to be installed, these indexes will not be re-generated if they are deleted.

  • rpm -q spacewalk-doc-indexes

  • The 'clean-index' script is intelligent enough to leave document indexes alone

  • To build updated doc indexes

  • cd /spacewalk/search-server/spacewalk-doc-indexes

  • run ./crawl_www.sh

  • rpmbuild -ba ./spacewalk-doc-indexes.spec

  • The build process involves running "nutch" to crawl the online documents and generated index files which are then packaged into the spacewalk-doc-indexes rpm

Misc Notes

  • NGram

  • We are using a ngram analyzer. This gives us a "fuzzy" like search behavior, it allows spelling errors to be forgiven and likely matches to be found.

  • Drawback - people are often confused initially when they search on a specific term and see unrelated matches.

  • Since results are presented with highest score first, users will be presented with best choices first, followed by potentially undesired results.

  • Security Limitations

  • SearchServer does little limiting of results to user permissions, it is expected that the webui/api will handle user/org permission filtering. Typically this is handled automatically in the webui/api code when we "flesh" out a DTO from the list of ids.

Further Questions

Send an email to jmatthews AT redhat DOT com