Skip to content

Commit

Permalink
Added a 1 TB corpus for the big5 workload. (opensearch-project#278)
Browse files Browse the repository at this point in the history
  • Loading branch information
gkamat authored Apr 23, 2024
1 parent 1adbc68 commit 8a1ef8d
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 0 deletions.
19 changes: 19 additions & 0 deletions big5/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ This workload allows the following parameters to be specified using `--workload-
* `bulk_indexing_clients` (default: 8): Number of clients that issue bulk indexing requests.
* `bulk_size` (default: 5000): The number of documents in each bulk during indexing.
* `cluster_health` (default: "green"): The minimum required cluster health.
* `corpus_size` (default: "100"): The size of the data corpus to use in GiB. The currently provided sizes are 100, 1000 and 60. Note that there are [certain considerations when using the 1 TB data corpus](#considerations-when-using-the-1-tb-data-corpus).
* `document_compressed_size_in_bytes`: If specifying an alternate data corpus, the compressed size of the corpus.
* `document_count`: If specifying an alternate data corpus, the number of documents in that corpus.
* `document_file`: If specifying an alternate data corpus, the file name of the corpus.
Expand Down Expand Up @@ -179,6 +180,24 @@ Running range-auto-date-histo-with-metrics [
------------------------------------------------------
```

### Considerations when Using the 1 TB Data Corpus

*Caveat*: This corpus is being made available as a feature that is currently being alpha tested. Some points to note when carrying out performance runs using this corpus:

* Use a load generation host with sufficient disk space to hold the corpus.
* Ensure the target cluster has adequate storage and at least 3 data nodes.
* Specify an appropriate shard count and number of replicas so that shards are evenly distributed and appropriately sized.
* Running the workload requires an instance type with at least 8 cores and 32 GB memory.
* Install the `pbzip2` decompressor to speed up decompression of the corpus.
* Allow sufficient time for the workload to run. _Approximate_ times for the various steps involved, using an 8-core loadgen host:
- 15 minutes to download the corpus
- 4 hours to decompress the corpus (assuming `pbzip2` is available) and pre-process it
- 4 hours to index the data
- 30 minutes for the force-merge
- 8 hours to run the set of included queries

More details will be added in due course.

### License

Please see the included LICENSE.txt file for details about the license applicable to this workload and its associated artifacts.
Expand Down
14 changes: 14 additions & 0 deletions big5/workload.json
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,20 @@
"compressed-bytes": 6023614688,
"uncompressed-bytes": 107321418111
}
{% elif corpus_size == 1000 %}
{
"source-file": "documents-1000.json.bz2",
"document-count": 1020000000,
"compressed-bytes": 53220934846,
"uncompressed-bytes": 943679382267
}
{% elif corpus_size == 1000-full %}
{
"source-file": "documents-1000-full.json.bz2",
"document-count": 1160800000,
"compressed-bytes": 60567183163,
"uncompressed-bytes": 1073936121222
}
{% elif corpus_size == 60 %}
{
"source-file": "documents-60.json.bz2",
Expand Down

0 comments on commit 8a1ef8d

Please sign in to comment.