On large AEM installations, reindexing (a pre-existing index) can be very slow. Reindexing is slow due to the text-extraction that occurs on binaries such as PDF files, MS Office docs, images and movie files. To speed up reindex, in the steps below, we pre-extract the text from the old index copy.
Example fulltext enabled lucene indexes in AEM - /oak:index/ntBaseLucene, /oak:index/damAssetLucene, and /oak:index/lucene.
Disabling text extraction on some or all binary files not only greatly speeds up reindexing but reduces the overall size of the index. However, there is a trade-off, for each file type that is excluded from text-extraction, the contents of those files would not be searchable. For example, if you exclude "application/pdf" (PDF files), you wouldn't be able to search on words contained in the PDF.
Note that if you completely disable all text-extraction then step 2 below isn't necessary.
- (Only applicable if, step 2, pre-extraction was done) Go to http://host/system/console/configMgr/org.apache.jackrabbit.oak.plugins.blob.datastore.DataStoreTextProviderService and set the Path configuration value to /mnt/preExtraction/store.
- Go to http://host/crx/de/index.jsp (enable CRXDE if not enabled) and log in as admin
- Browse to each of the index nodes that you want to reindex and set property reindex=true. Here's one example index: /oak:index/lucene
- Click Save All on the top left
- Monitor reindexing via the error.log file. See here for how to monitor indexing.
A. Create the checkpoint using the JMX console (if AEM is running)
- Go to this URL on the host: http://host/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3DSegment+node+store+checkpoint+management%2Ctype%3DCheckpointManager
- Click "createCheckpoint(long p1)"
- Enter 864000000 and click "Invoke"
- Copy the checkpoint id to a text file
B. OR, stop AEM and create a checkpoint using oak-run console
- Run the command below to open an oak-run console shell
java -Xmx2g -jar -oak-run-1.22.4.jar console /mnt/crx/author/crx-quickstart/repository/segmentstore
- When the oak-run console shell opens then run this command to create a checkpoint
checkpoint 864000000
- Copy the checkpoint id to a text file
- Enter command
:exit
or hit [Ctrl]+c to close the shell
- S3 DataStore systems (include S3 DS jars in the classpath):
nohup java -Xmx2g -classpath ./oak-run-1.22.4.jar:/mnt/preExtraction/jackson-core-2.9.5.jar:/mnt/preExtraction/jackson-annotations-2.9.5.jar:/mnt/preExtraction/jackson-databind-2.9.5.jar:/mnt/crx/author/crx-quickstart/install/15/aws-java-sdk-osgi-1.10.27.jar \
org.apache.jackrabbit.oak.run.Main index -\
-reindex --read-write \
--pre-extracted-text-dir /mnt/preExtraction/store \
--index-paths=/oak:index/socialLucene,/oak:index/authorizables,/oak:index/commerceLucene,/oak:index/cqProjectLucene,/oak:index/cqPageLucene,/oak:index/damAssetLucene,/oak:index/ntBaseLucene,/oak:index/slingeventJob,/oak:index/workflowDataLucene,/oak:index/versionStoreIndex \
--checkpoint=890f552c-d7b5-459f-8097-8964b3905efd \
--s3ds=/mnt/crx/author/crx-quickstart/install/org.apache.jackrabbit.oak.plugins.blob.datastore.SharedS3DataStore.config \
/mnt/crx/author/crx-quickstart/repository/segmentstore &
- Azure DS systems (include S3 DS jars in the classpath):
nohup java -Xmx2g -classpath ./oak-run-1.22.4.jar:/mnt/preExtraction/jackson-core-2.9.5.jar:/mnt/preExtraction/jackson-annotations-2.9.5.jar:/mnt/preExtraction/jackson-databind-2.9.5.jar:/mnt/crx/author/crx-quickstart/install/15/aws-java-sdk-osgi-1.10.27.jar \
org.apache.jackrabbit.oak.run.Main index \
--reindex --read-write \
--pre-extracted-text-dir /mnt/preExtraction/store \
--index-paths=/oak:index/socialLucene,/oak:index/authorizables,/oak:index/commerceLucene,/oak:index/cqProjectLucene,/oak:index/cqPageLucene,/oak:index/damAssetLucene,/oak:index/ntBaseLucene,/oak:index/slingeventJob,/oak:index/workflowDataLucene,/oak:index/versionStoreIndex \
--checkpoint=890f552c-d7b5-459f-8097-8964b3905efd \
--azureblobds=/mnt/crx/author/crx-quickstart/install/org.apache.jackrabbit.oak.plugins.blob.datastore.AzureDataStore.config \
/mnt/crx/author/crx-quickstart/repository/segmentstore &
- The offline indexing cycle should automatically import the new index to the source repository at the end. However, if it fails with an error or starts reindexing when you start AEM, then follow the steps documented here