Added a 1 TB corpus for the big5 workload. (opensearch-project#278)

vpehkone · Apr 23, 2024 · 8a1ef8d · 8a1ef8d
1 parent 1adbc68
commit 8a1ef8d
Show file tree

Hide file tree

Showing 2 changed files with 33 additions and 0 deletions.
diff --git a/big5/README.md b/big5/README.md
@@ -45,6 +45,7 @@ This workload allows the following parameters to be specified using `--workload-
 * `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
 * `bulk_size` (default: 5000): The number of documents in each bulk during indexing.
 * `cluster_health` (default: "green"): The minimum required cluster health.
+* `corpus_size` (default: "100"): The size of the data corpus to use in GiB.  The currently provided sizes are 100, 1000 and 60.  Note that there are [certain considerations when using the 1 TB data corpus](#considerations-when-using-the-1-tb-data-corpus).
 * `document_compressed_size_in_bytes`: If specifying an alternate data corpus, the compressed size of the corpus.
 * `document_count`: If specifying an alternate data corpus, the number of documents in that corpus.
 * `document_file`: If specifying an alternate data corpus, the file name of the corpus.
@@ -179,6 +180,24 @@ Running range-auto-date-histo-with-metrics                                     [
 ------------------------------------------------------
 ```
 
+### Considerations when Using the 1 TB Data Corpus
+
+*Caveat*: This corpus is being made available as a feature that is currently being alpha tested.  Some points to note when carrying out performance runs using this corpus:
+
+  * Use a load generation host with sufficient disk space to hold the corpus.
+  * Ensure the target cluster has adequate storage and at least 3 data nodes.
+  * Specify an appropriate shard count and number of replicas so that shards are evenly distributed and appropriately sized.
+  * Running the workload requires an instance type with at least 8 cores and 32 GB memory.
+  * Install the `pbzip2` decompressor to speed up decompression of the corpus.
+  * Allow sufficient time for the workload to run.  _Approximate_ times for the various steps involved, using an 8-core loadgen host:
+    - 15 minutes to download the corpus
+    - 4 hours to decompress the corpus (assuming `pbzip2` is available) and pre-process it
+    - 4 hours to index the data
+    - 30 minutes for the force-merge
+    - 8 hours to run the set of included queries
+
+More details will be added in due course.
+
 ### License
 
 Please see the included LICENSE.txt file for details about the license applicable to this workload and its associated artifacts.

diff --git a/big5/workload.json b/big5/workload.json
@@ -28,6 +28,20 @@
 	    "compressed-bytes": 6023614688,
 	    "uncompressed-bytes": 107321418111
 	  }
+        {% elif corpus_size == 1000 %}
+          {
+            "source-file": "documents-1000.json.bz2",
+            "document-count": 1020000000,
+            "compressed-bytes": 53220934846,
+            "uncompressed-bytes": 943679382267
+          }
+        {% elif corpus_size == 1000-full %}
+          {
+            "source-file": "documents-1000-full.json.bz2",
+            "document-count": 1160800000,
+            "compressed-bytes": 60567183163,
+            "uncompressed-bytes": 1073936121222
+          }
 	{% elif corpus_size == 60 %}
 	  {
 	    "source-file": "documents-60.json.bz2",