Why quickwit does not support multiple indexer with http source write? #3336

geek-frio · 2023-05-15T08:16:59Z

geek-frio
May 15, 2023

I have briefly reviewed the Quickwit code, and based on my understanding, Quickwit processes data obtained from an HTTP or grpc service through batching, transforming it into splits that are uploaded to OSS (Object Storage Service). Once a split enters the "Published" state, it can be queried by a Searcher based on OSS. Given this logic, I also don't fully understand why Quickwit doesn't support deploying multiple Indexer nodes to serve HTTP/grpc write requests. The official documentation states that multiple Indexer nodes are supported only in the context of Kafka-based data ingestion. I have performed a simple test of multiple Indexer nodes writing data via HTTP locally, and I haven't encountered any issues. Could you explain why Quickwit does not support multiple Indexer nodes for writing, apart from Kafka-based scenarios?

Answered by fmassot

May 15, 2023

Hi @geek-frio

Your analysis is correct. You can indeed send HTTP requests to several indexers. But you need to distribute the requests yourself (note that you have to use a postgresql metastore in this case, the file/s3 backed metastore does not support concurrent writes).

One thing also that could bite you with the HTTP API is that, if the indexer's disk crashes, you will lose your data (by default it will be around 1 min of data, the time to upload the data to S3).

The Kafka source has the advantage of exactly-once semantics, the scaling is also easier as new indexers will just be assigned partitions and will start indexing them after the assignment.

We are currently working on the dist…

View full answer

fmassot · 2023-05-15T09:07:28Z

fmassot
May 15, 2023
Maintainer

Hi @geek-frio

Your analysis is correct. You can indeed send HTTP requests to several indexers. But you need to distribute the requests yourself (note that you have to use a postgresql metastore in this case, the file/s3 backed metastore does not support concurrent writes).

One thing also that could bite you with the HTTP API is that, if the indexer's disk crashes, you will lose your data (by default it will be around 1 min of data, the time to upload the data to S3).

The Kafka source has the advantage of exactly-once semantics, the scaling is also easier as new indexers will just be assigned partitions and will start indexing them after the assignment.

We are currently working on the distributed ingest API, this means that you will not have to take care of the distribution of requests to indexers internally, we will also make the ingest API durable so that you can't lose data if a disk crashed.
It should be available in a few months (end of summer I would say).

0 replies

fmassot · 2023-05-15T09:20:56Z

fmassot
May 15, 2023
Maintainer

@geek-frio I saw your trace-db work related to Apache SkyWalking. I think Quickwit could be a pretty good fit (perf example on GitHub archive dataset)
Let me know if this is something want to explore.

0 replies

geek-frio · 2023-05-15T11:06:02Z

geek-frio
May 15, 2023
Author

@geek-frio I saw your trace-db work related to Apache SkyWalking. I think Quickwit could be a pretty good fit (perf example on GitHub archive dataset) Let me know if this is something want to explore.

Yes, I have already used version 0.3 quickwit as the database of my skywalking tracing db after I found quickwit in the last year. Now I am currently using v0.4 of Quickwit to explore replacing our existing Alibaba Cloud Logstore service in order to reduce service costs. I am using the Kafka source as the source for Quickwit. However, recently I have encountered some issues where some Quickwit Indexer nodes periodically have their CPU reduced to 0 and stop consuming. I am currently investigating the cause of this issue, so I am considering whether using multiple indexers through HTTP ingest is feasible.

0 replies

geek-frio · 2023-05-15T11:17:04Z

geek-frio
May 15, 2023
Author

Hi @geek-frio

Your analysis is correct. You can indeed send HTTP requests to several indexers. But you need to distribute the requests yourself (note that you have to use a postgresql metastore in this case, the file/s3 backed metastore does not support concurrent writes).

One thing also that could bite you with the HTTP API is that, if the indexer's disk crashes, you will lose your data (by default it will be around 1 min of data, the time to upload the data to S3).

The Kafka source has the advantage of exactly-once semantics, the scaling is also easier as new indexers will just be assigned partitions and will start indexing them after the assignment.

We are currently working on the distributed ingest API, this means that you will not have to take care of the distribution of requests to indexers internally, we will also make the ingest API durable so that you can't lose data if a disk crashed. It should be available in a few months (end of summer I would say).

I understand your explanation, which means that theoretically, I can use Quickwit's multiple Indexer mode as long as I can handle the load balancing and disk crash issues properly. From my perspective, disk crashes are rare occurrences because existing cloud services often have RAID backups for their disks. Additionally, considering the lower data security requirements for tracing and log data, even if such an event occurs, a small amount of data loss would be acceptable. What can be confirmed is that Quickwit's storage cost offers significant cost savings compared to existing cloud service logs or ES storage.

0 replies

fmassot · 2023-05-15T12:46:07Z

fmassot
May 15, 2023
Maintainer

However, recently I have encountered some issues where some Quickwit Indexer nodes periodically have their CPU reduced to 0 and stop consuming.

This is not normal at all. We fixed one or two issues on this recently. Can you share some logs on this?

What can be confirmed is that Quickwit's storage cost offers significant cost savings compared to existing cloud service logs or ES storage.

Well, it's hard to get the right number without knowing more about the workload. On logs use cases, we would expect a 10x reduction but could be less or more...
That being said, this example should give you an idea: you can query with one machine 16 TB JSON / 3 billion documents stored on S3 and still get a very good performance. Worst case, you are querying all the dataset, and you still get a result in under 6 seconds.

0 replies

geek-frio · 2023-05-15T13:17:10Z

geek-frio
May 15, 2023
Author

However, recently I have encountered some issues where some Quickwit Indexer nodes periodically have their CPU reduced to 0 and stop consuming.

This is not normal at all. We fixed one or two issues on this recently. Can you share some logs on this?

What can be confirmed is that Quickwit's storage cost offers significant cost savings compared to existing cloud service logs or ES storage.

Well, it's hard to get the right number without knowing more about the workload. On logs use cases, we would expect a 10x reduction but could be less or more... That being said, this example should give you an idea: you can query with one machine 16 TB JSON / 3 billion documents stored on S3 and still get a very good performance. Worst case, you are querying all the dataset, and you still get a result in under 6 seconds.

@fmassot Unfortunately, I didn't save the error logs from the previous occurrence. I will find a log snippet when it's appropriate. Based on my recollection of troubleshooting the issue last time, I noticed some interesting patterns. The problem occurs when the gc collection task is triggered, after which there are no more logs printed for new splits. Additionally, some splits with new split IDs are not included in the subsequent pipeline processing steps. I have two initial suspicions: 1. Some operation following the new split actor is causing the actor's runtime to freeze, resulting in no further consumption (unlikely to be a panic because I haven't seen any panic logs). 2. When multiple Indexer gc operations are performed simultaneously on the Postgresql meta database, there might be some issues causing a freeze. You mentioned earlier that you recently fixed several issues related to similar problems. Could you provide me with the corresponding issue links?

0 replies

fmassot · 2023-05-15T19:51:17Z

fmassot
May 15, 2023
Maintainer

@geek-frio I have these two issues in my mind:

this one that was fixed before the 0.4 release
and a still opened issue that we will fixed for the 0.6 release. This can happen if you delete the index and recreate one just afterward.

Finally, we optimized the postgresql metastore queries in the 0.4 and 0.5, some queries were locking some rows, and that increased the latencies and potentially led to a pipeline failures.

When multiple Indexer gc operations are performed simultaneously on the Postgresql meta database, there might be some issues causing a freeze.

Since 0.4, the GC collection is run only on one node that runs the Janitor service.

Globally I would say we fixed several issues mostly for the 0.4 version. And we have still this issue opened but I don't think you were impacted by this one.

If you rerun Quickwit, don't hesitate to share your logs/metrics here or on discord, I would be happy to help you on debugging/choosing the right Quickwit setup.

0 replies

fmassot · 2023-05-15T22:20:09Z

fmassot
May 15, 2023
Maintainer

@geek-frio I converted the issue into a GitHub discussion as it's more appropriate and could be useful for other readers.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why quickwit does not support multiple indexer with http source write? #3336

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Why quickwit does not support multiple indexer with http source write? #3336

geek-frio May 15, 2023

Replies: 8 comments

fmassot May 15, 2023 Maintainer

fmassot May 15, 2023 Maintainer

geek-frio May 15, 2023 Author

geek-frio May 15, 2023 Author

fmassot May 15, 2023 Maintainer

geek-frio May 15, 2023 Author

fmassot May 15, 2023 Maintainer

fmassot May 15, 2023 Maintainer

geek-frio
May 15, 2023

fmassot
May 15, 2023
Maintainer

fmassot
May 15, 2023
Maintainer

geek-frio
May 15, 2023
Author

geek-frio
May 15, 2023
Author

fmassot
May 15, 2023
Maintainer

geek-frio
May 15, 2023
Author

fmassot
May 15, 2023
Maintainer

fmassot
May 15, 2023
Maintainer