From fd51d9ad43d33c4a9df27f4ad13c1d285232e1b9 Mon Sep 17 00:00:00 2001 From: "opensearch-trigger-bot[bot]" <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com> Date: Wed, 10 Jul 2024 10:01:12 -0700 Subject: [PATCH] Updated README to recommend use of an external data store with large corpora. (#336) (#337) (cherry picked from commit 4ea81b9716214548d8ab5928de6bd5f16aed65aa) Signed-off-by: Govind Kamat Signed-off-by: github-actions[bot] Co-authored-by: github-actions[bot] --- big5/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/big5/README.md b/big5/README.md index a35e9d14..a755c7d1 100755 --- a/big5/README.md +++ b/big5/README.md @@ -182,14 +182,16 @@ Running range-auto-date-histo-with-metrics [ ### Considerations when Using the 1 TB Data Corpus -*Caveat*: This corpus is being made available as a feature that is currently being alpha tested. Some points to note when carrying out performance runs using this corpus: +*Caveat*: This corpus is being made available as a feature that is currently in beta test. Some points to note when carrying out performance runs using this corpus: * Due to CloudFront download size limits, the uncompressed size of the 1 TB corpus is actually 0.95 TB (~0.9 TiB). This [issue has been noted](https://github.com/opensearch-project/opensearch-benchmark/issues/543) and will be resolved in due course. + * Use an external data store to record metrics. Using the in-memory store will likely result in the system running out of memory and becoming unresponsive, resulting in inaccurate performance numbers. * Use a load generation host with sufficient disk space to hold the corpus. * Ensure the target cluster has adequate storage and at least 3 data nodes. * Specify an appropriate shard count and number of replicas so that shards are evenly distributed and appropriately sized. * Running the workload requires an instance type with at least 8 cores and 32 GB memory. * Install the `pbzip2` decompressor to speed up decompression of the corpus. + * Set the client timeout to a sufficiently large value, since some queries take a long time to complete. * Allow sufficient time for the workload to run. _Approximate_ times for the various steps involved, using an 8-core loadgen host: - 15 minutes to download the corpus - 4 hours to decompress the corpus (assuming `pbzip2` is available) and pre-process it