Benchmarking Use-cases #16

gmourier · 2022-08-08T08:44:55Z

gmourier
Aug 8, 2022
Maintainer

Discussion goals

Discuss how we idealize the benchmarking technical stack/tooling, metrics we need to track, and data needs.
Determine the jobs to be done.

Why do we want to benchmark use cases?

We want to effectively measure the performance losses/gains (and relevancy in the future) in executing use-cases as close to reality as possible.
Sharing the results in public can be interesting, especially to reassure a prospect who does not dare to take the plunge between several solutions and communicate more tangibly on our engineering efforts.
The benchmarks will help us to learn about the limitations of Meilisearch on large volumes/throughput of data so that we can begin to work and prepare the targeting of enterprise accounts needs.

First use case to benchmark

As you can see here, there are many use-cases that we have identified we would like to test. As we have never implemented this benchmark, we must already choose the first use-case and define a technical stack allowing us to host others later to focus our efforts efficiently.

Why does it make sense to choose the SaaS use case to start?

For several reasons, it seems wise to choose the SaaS use case; here they are.

Unpredictable indexing throughput

Improved indexing speed is currently one of our main concerns. This is the noisiest feedback, and we have seen users move on to other solutions for this.

Interestingly, indexing problems are mostly mentioned when updates or additions are made to a large database and must be searchable as soon as possible.

In the case of SaaS, document additions are mostly emitted by end-user actions and are difficult to predict or delay at a specific point in time.

For example, let's imagine the case of a CRM product; if I modify information for a contact, I expect this change to be visible as soon as possible for the sake of my business operations.

Choosing this use case will allow us to measure changes more effectively in this aspect.

A competitive target

We see that Algolia has focused mainly on the E-commerce use case. Focusing on the SaaS use-case at first would allow us to cannibalize market shares by proposing value to the teams that don't necessarily have any subject matter expert to implement a search, the financial budget to use Algolia, and that expect to scale their business.

The analysis of the cloud assistant pricing results shows that the SaaS use case is rather prominent.

Private source: https://analytics.amplitude.com/meili/chart/he0b3z0?source=slack&source+detail=link

Next Steps

Determine the metrics to benchmark (priority to indexing given the context, but we would also like to measure search metrics). During a meeting where we initiated this project among the core team, It was suggested to measure the average or median TTS (Time To be Searchable) of a document as the database grows as a starter.
Determine a dataset and scenarios matching the use-case reality to measure the defined metrics.
Determine the technical stack/tools to implement such a benchmarking process.

Next Meeting

Determine the metrics to benchmark
- Thursday, August 11⋅2:00 – 2:45pm CET - [@irevoire, @ManyTheFish, @loiclec, @gmourier]

loiclec · 2022-08-11T10:56:01Z

loiclec
Aug 11, 2022

I have thought a bit more about it and wrote some stuff below in preparation for the meeting 🙂

Work to be done on milli

We should add profiling instrumentation into milli so that we can measure the time taken by each indexing phase, for each task. There should be a setting to save this profiling information to a .csv file which can be easily plotted and analysed.

Datasets

I propose using at least two datasets to start with.

1. Scidocs

https://github.com/beir-cellar/beir/wiki/Datasets-available
Sample:

{"_id": "632589828c8b9fca2c3a59e97451fde8fa7d188d", "title": "A hybrid of genetic algorithm and particle swarm optimization for recurrent network design", "text": "An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing phenomenon in nature. These enhanced elites constitute half of the population in the new generation, whereas the other half is generated by performing crossover and mutation operation on these enhanced elites. HGAPSO is applied to recurrent neural/fuzzy network design as follows. For recurrent neural network, a fully connected recurrent neural network is designed and applied to a temporal sequence production problem. For recurrent fuzzy network design, a Takagi-Sugeno-Kang-type recurrent fuzzy network is designed and applied to dynamic plant control. The performance of HGAPSO is compared to both GA and PSO in these recurrent networks design problems, demonstrating its superiority.", "metadata": {"authors": ["1725986"], "year": 2004, "cited_by": ["93e1026dd5244e45f6f9ec9e35e9de327b48e4b0", "870cb11115c8679c7e34f4f2ed5f469badedee37", "etc (very many)" "21e58c2114c2e33d7792881f95dd73ed4532e916"], "references": ["57fdc130c1b1c3dd1fd11845fe86c60e2d3b7193", "etc (a few)"]}}

This dataset has many things that we want to test:

Short text in title, long text in text
Nested fields in metadata
Multiple facets:
1. author : a string
2. year : a number, results can be sorted by it
3. cited_by and references: long strings, and there are many

By disk space, it is medium sized (257MB) but by number of documents, it is very small (25K).

2. Tweets (twitter_cikm_2010)

https://archive.org/details/twitter_cikm_2010

A dataset of timestamped tweets from geolocated users.

Sample tweets:

22077441	10538487904	Ok today I have to find something to wear for fri cuz I don't think I have time any other day this week.. I'm thinking all black and pearls!	2010-03-15 17:35:58
22077441	10536835844	I am glad I'm having this show but I can't wait till it is over so I can rest and stop worrying !!	2010-03-15 16:53:44
22077441	10536809086	Honestly I don't even know what's going on anymore	2010-03-15 16:52:59
22077441	10534149786	@<censored by me> hey sorry I'm sitting infront of this sewing mching ... @<censored by me> should be calling u soon :)	2010-03-15 15:42:07

Sample geolocation:

22077441	UT: 43.009815,-83.710408

Very short texts with mispellings and smileys
Timestamped
Geolocated

It is medium-sized by disk space (~600MB) and number of documents (~5M)

Dataset 3: e-commerce?

An e-commerce dataset should have many filterable fields. Some fields, such as stock or price, should be modified regularly.

Design of the test

A fixed number of parallel meilisearch clients send requests to a single meilisearch instance. Most of the requests will be search queries, sometimes with filters. Some of the requests will add or remove documents.

To determine the documents to add, we use the two datasets described above. We start from an empty database and add the documents one by one, in an arbitrary order.

Each operation (search/add/delete) is scheduled in advance at a precise timestamp. For example:

{
	"type": "search",
	"query": "Sal",
	"filter": "author = Stephen King"
	"time": "13:24:00.000"
	"client": 2,
}
{
	"type": "search",
	"query": "Sale",
	"filter": "author = Stephen King"
	"time": "13:24:00.150"
	"client": 2,
}
{
	"type": "add",
	"document": { ... },
	"time": "13:24:00.150"
	"client": 17,
}

The clients do not wait for the responses before sending a new request. They measure the time it took to receive the response and save it somewhere. The time-to-be-searchable is not measured by the clients but instead by the meilisearch instance.

The meilisearch instance could also log the time it takes to process the search queries, but that time wouldn't include the time taken by the web server.

The number of clients could be as high as 2000, all sending simultaneous requests.

Creating the operations

The operations are created from the dataset.

Adding documents:

Shuffle the documents
Send each one by one

Deleting documents:

Select a few random documents
Send a delete operation one-by-one for each selected document
Add them again to the list of documents to add later

Deleting documents (alternative):

At the end, delete each document one by one.

Search:

Select a few random words from the documents that were already added more than N seconds ago
Send queries, typing the words letter by letter. Each query is separated by ~50ms from the next one.
Filters: TODO

Timestamps:

Find the estimated duration for the entire benchmark
Create a distribution over this duration. It shouldn't be uniform, it should have some peaks.
For each event, take the next timestamp from that distribution

Plots

We want to construct these plots:

A plot showing the time taken to index a single document
1. X-axis: either
  1. nbr of indexed documents, in 5% groups
  2. timestamp, in 5% groups
2. Y-axis: time taken to index the new document
3. Type: box plot
Repeat Plot (1) but for each indexing phase
Maybe the same plot but for search queries.
1. X-axis: either
  1. timestamp
  2. nbr of indexed documents; or
  3. length of the search query; or
  4. complexity of the filter
2. Y-axis: time taken to return the search result
3. Type: box plot

Saved Data

The plots can be drawn from this raw data:

Indexing time

id, nbr_docs, added, total (ms) , phase1 (ms), phase2 (ms) , timestamp_finished

0,    0     , 1    , 20         , 5          , 15          , <timestamp0>
1,    1     , 3    , 230        , 50         , 180         , <timestamp1>

Search time

id    , (some info about the query?) , total (ms), milli (ms) , (info about db?)

0     ,                              , 15        , 12

0 replies

ManyTheFish · 2022-08-11T15:59:34Z

ManyTheFish
Aug 11, 2022
Maintainer

11/08/2022 - Step 1: Determine the metrics to benchmark

Indexing

Main metrics:

request date: the date of the HTTP request made by the user, the enqueuedAt value of a task can be a first approximation
Time to be searchable (TTS) or latency: defined by the time gap between request date and the finish date of the task, represented by finishedAt.
maximum Throughput

team-core metrics:

cpu
ram
disk
granular milli time for each indexing step

Search

Main metrics:

request date: the date of the HTTP request made by the user
maximum Throughput
latency

team-core metrics:

cpu
ram
disk
granular milli time for each search step

TBD @ManyTheFish

Tools

use Prometheus to gather information
wrk
drill
https://grafana.com/grafana/plugins/bsull-console-datasource/

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Use-cases #16

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Benchmarking Use-cases #16

gmourier Aug 8, 2022 Maintainer

Discussion goals

Why do we want to benchmark use cases?

First use case to benchmark

Why does it make sense to choose the SaaS use case to start?

Unpredictable indexing throughput

A competitive target

Next Steps

Next Meeting

Replies: 2 comments

loiclec Aug 11, 2022

Work to be done on milli

Datasets

1. Scidocs

2. Tweets (twitter_cikm_2010)

Dataset 3: e-commerce?

Design of the test

Creating the operations

Plots

Saved Data

ManyTheFish Aug 11, 2022 Maintainer

11/08/2022 - Step 1: Determine the metrics to benchmark

Indexing

Search

Tools

gmourier
Aug 8, 2022
Maintainer

loiclec
Aug 11, 2022

ManyTheFish
Aug 11, 2022
Maintainer