Replies: 2 comments
-
I have thought a bit more about it and wrote some stuff below in preparation for the meeting 🙂 Work to be done on milliWe should add profiling instrumentation into milli so that we can measure the time taken by each indexing phase, for each task. There should be a setting to save this profiling information to a .csv file which can be easily plotted and analysed. DatasetsI propose using at least two datasets to start with. 1. Scidocshttps://github.com/beir-cellar/beir/wiki/Datasets-available {"_id": "632589828c8b9fca2c3a59e97451fde8fa7d188d", "title": "A hybrid of genetic algorithm and particle swarm optimization for recurrent network design", "text": "An evolutionary recurrent network which automates the design of recurrent neural/fuzzy networks using a new evolutionary learning algorithm is proposed in this paper. This new evolutionary learning algorithm is based on a hybrid of genetic algorithm (GA) and particle swarm optimization (PSO), and is thus called HGAPSO. In HGAPSO, individuals in a new generation are created, not only by crossover and mutation operation as in GA, but also by PSO. The concept of elite strategy is adopted in HGAPSO, where the upper-half of the best-performing individuals in a population are regarded as elites. However, instead of being reproduced directly to the next generation, these elites are first enhanced. The group constituted by the elites is regarded as a swarm, and each elite corresponds to a particle within it. In this regard, the elites are enhanced by PSO, an operation which mimics the maturing phenomenon in nature. These enhanced elites constitute half of the population in the new generation, whereas the other half is generated by performing crossover and mutation operation on these enhanced elites. HGAPSO is applied to recurrent neural/fuzzy network design as follows. For recurrent neural network, a fully connected recurrent neural network is designed and applied to a temporal sequence production problem. For recurrent fuzzy network design, a Takagi-Sugeno-Kang-type recurrent fuzzy network is designed and applied to dynamic plant control. The performance of HGAPSO is compared to both GA and PSO in these recurrent networks design problems, demonstrating its superiority.", "metadata": {"authors": ["1725986"], "year": 2004, "cited_by": ["93e1026dd5244e45f6f9ec9e35e9de327b48e4b0", "870cb11115c8679c7e34f4f2ed5f469badedee37", "etc (very many)" "21e58c2114c2e33d7792881f95dd73ed4532e916"], "references": ["57fdc130c1b1c3dd1fd11845fe86c60e2d3b7193", "etc (a few)"]}} This dataset has many things that we want to test:
By disk space, it is medium sized (257MB) but by number of documents, it is very small (25K). 2. Tweets (twitter_cikm_2010)https://archive.org/details/twitter_cikm_2010 A dataset of timestamped tweets from geolocated users. Sample tweets:
Sample geolocation:
It is medium-sized by disk space (~600MB) and number of documents (~5M) Dataset 3: e-commerce?An e-commerce dataset should have many filterable fields. Some fields, such as stock or price, should be modified regularly. Design of the testA fixed number of parallel meilisearch clients send requests to a single meilisearch instance. Most of the requests will be search queries, sometimes with filters. Some of the requests will add or remove documents. To determine the documents to add, we use the two datasets described above. We start from an empty database and add the documents one by one, in an arbitrary order. Each operation (search/add/delete) is scheduled in advance at a precise timestamp. For example: {
"type": "search",
"query": "Sal",
"filter": "author = Stephen King"
"time": "13:24:00.000"
"client": 2,
}
{
"type": "search",
"query": "Sale",
"filter": "author = Stephen King"
"time": "13:24:00.150"
"client": 2,
}
{
"type": "add",
"document": { ... },
"time": "13:24:00.150"
"client": 17,
} The clients do not wait for the responses before sending a new request. They measure the time it took to receive the response and save it somewhere. The time-to-be-searchable is not measured by the clients but instead by the meilisearch instance. The meilisearch instance could also log the time it takes to process the search queries, but that time wouldn't include the time taken by the web server. The number of clients could be as high as 2000, all sending simultaneous requests. Creating the operationsThe operations are created from the dataset. Adding documents:
Deleting documents:
Deleting documents (alternative):
Search:
Timestamps:
PlotsWe want to construct these plots:
Saved DataThe plots can be drawn from this raw data: Indexing time
Search time
|
Beta Was this translation helpful? Give feedback.
-
11/08/2022 - Step 1: Determine the metrics to benchmarkIndexingMain metrics:
team-core metrics:
SearchMain metrics:
team-core metrics:
TBD @ManyTheFish Tools
|
Beta Was this translation helpful? Give feedback.
-
Discussion goals
Why do we want to benchmark use cases?
First use case to benchmark
As you can see here, there are many use-cases that we have identified we would like to test. As we have never implemented this benchmark, we must already choose the first use-case and define a technical stack allowing us to host others later to focus our efforts efficiently.
Why does it make sense to choose the SaaS use case to start?
For several reasons, it seems wise to choose the SaaS use case; here they are.
Unpredictable indexing throughput
Improved indexing speed is currently one of our main concerns. This is the noisiest feedback, and we have seen users move on to other solutions for this.
Interestingly, indexing problems are mostly mentioned when updates or additions are made to a large database and must be searchable as soon as possible.
In the case of SaaS, document additions are mostly emitted by end-user actions and are difficult to predict or delay at a specific point in time.
For example, let's imagine the case of a CRM product; if I modify information for a contact, I expect this change to be visible as soon as possible for the sake of my business operations.
Choosing this use case will allow us to measure changes more effectively in this aspect.
A competitive target
We see that Algolia has focused mainly on the E-commerce use case. Focusing on the SaaS use-case at first would allow us to cannibalize market shares by proposing value to the teams that don't necessarily have any subject matter expert to implement a search, the financial budget to use Algolia, and that expect to scale their business.
The analysis of the cloud assistant pricing results shows that the SaaS use case is rather prominent.
Private source: https://analytics.amplitude.com/meili/chart/he0b3z0?source=slack&source+detail=link
Next Steps
Next Meeting
Beta Was this translation helpful? Give feedback.
All reactions