Skip to content

Tuning for large code bases

Vladimir Kotal edited this page Apr 21, 2021 · 43 revisions

JVM

In general it is recommended to run both the indexer and web application with -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/some/sensible/place/to/store/jvm/dumps in order to capture the JVM dumps in case of out-of-memory exception so that is possible to analyze the dumps with tools like jhat or http://www.eclipse.org/mat/

Indexer

If you run the Indexer via the opengrok-indexer script, keep in mind that by default it does not set Java heap size so it will use the default value. This might not be enough, especially for large projects such as AOSP or when indexing lots of mid sized projects.

Git Merge commits

It is possible to disable handling of merge commits in Git via global/per-project configuration. If you have repository with rich history, this might help.

Lucene flush buffer size

Lucene 4.x sets indexer defaults:

DEFAULT_RAM_PER_THREAD_HARD_LIMIT_MB = 1945;
DEFAULT_MAX_THREAD_STATES = 8;
DEFAULT_RAM_BUFFER_SIZE_MB = 16.0;
  • which might grow as big as 16GB (though DEFAULT_RAM_BUFFER_SIZE_MB shouldn't really allow it, but keep it around 1-2GB)

  • the Lucene RAM_BUFFER_SIZE_MB can be tuned now using indexer parameter -m, so running a 8GB 64 bit server JDK indexer with tuned docs flushing (assuming the indexer is being run from the Python wrapper. Otherwise pass the indexer options directly.):

    $ opengrok-indexer -J=-Xmx8g -J=-server --jar opengrok.jar -- \
         -m 256 -s /source -d /data ...

For Solaris you might want to use also -J=-d64

Open File and processes hard and soft limits

The initial index creation process is resource intensive and often the error java.io.IOException: error=24, Too many open files appears in the logs. To avoid this increase the ulimit value to a higher number.

It is noted that the hard and soft limit for open files of 10240 works for mid sized repositories and so the recommendation is to start with 10240.

If you get a similar error, but for threads: java.lang.OutOfMemoryError: unable to create new native thread it might be due to strict security limits and you need to increase the limits.

Web application

The heap size limit for web application should be derived from the size of data generated by the indexer and also to reflect the size of WFST structures generated by the Suggester in the web application. The former will create memory pressure especially for multi-project searches. Thus, for precise tuning it might be prudent to estimate memory footprint of single all-project search (using memory profiler), determine how many requests the web application can serve simultaneously, multiply these 2 values and make sure the heap limit is bigger than that.

For Suggester data, it should be sufficient to compute the sum of lengths of all *.wfst files under the data root and bump the heap limit by that value.

Tomcat threads

The web application utilizes several thread pools. These are usually sized based on the number of on-line CPUs (cores) in the system. By default Tomcat allows only 200 or so threads for the basic Connector. The more CPUs (cores) you have in the system, the higher chance the limit will be reached. So, it might be necessary to bump the limit.

Also, when using the per project workflow, there is usually many indexer processes running in parallel. Each of these uses several RESTful API calls. These combined can lead to many threads created in the web application.

Configuration snippet example:

    <Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               maxThreads="1024" />

There is also maxConnections variable.

Tomcat heap

Tomcat by default supports only small deployments. For bigger ones you might need to increase its heap (assuming 64-bit Java). It will most probably be the same for other containers as well. For Tomcat you can easily get this done by creating $CATALINA_BASE/bin/setenv.sh:

# cat $CATALINA_BASE/bin/setenv.sh
JAVA_OPTS="$JAVA_OPTS -server"

# OpenGrok memory boost to cover all-project searches
# (7 MB * 247 projects + 300 MB for cache should be enough)
# 64-bit Java allows for more so let's use 8GB to be on the safe side.
# We might need to allow more for concurrent all-project searches.
JAVA_OPTS="$JAVA_OPTS -Xmx8g"

export JAVA_OPTS

Tomcat/Apache tuning for HTTP headers

For tomcat you might also hit a limit for HTTP header size (we use it to send the project list when requesting search results):

For Tomcat increase(add) in conf/server.xml, for example:

  <Connector port="8888" protocol="HTTP/1.1"
             connectionTimeout="20000"
             maxHttpHeaderSize="65536"
             redirectPort="8443" />

Refer to docs of other containers for more info on how to achieve the same.

Failure to do so will result in HTTP 400 errors after first query - with the error "Error parsing HTTP request header".

The same tuning to Apache (handy in case you are running Apache in reverse proxy mode to Tomcat) can be done with the LimitRequestLine directive:

LimitRequestLine 65536
LimitRequestFieldSize 65536

Multi-project search speed tip

If multi-project search is performed frequently, it might be good to warm up file system cache after each reindex. This can be done e.g. with https://github.com/hoytech/vmtouch

Suggester

It is recommended to store the Suggester data on SSD/Flash. This benefits both the suggester rebuild operation (that happens during reindex and also periodically) as well as web application startup (performs suggester initialization).