-
I have briefly reviewed the Quickwit code, and based on my understanding, Quickwit processes data obtained from an HTTP or grpc service through batching, transforming it into splits that are uploaded to OSS (Object Storage Service). Once a split enters the "Published" state, it can be queried by a Searcher based on OSS. Given this logic, I also don't fully understand why Quickwit doesn't support deploying multiple Indexer nodes to serve HTTP/grpc write requests. The official documentation states that multiple Indexer nodes are supported only in the context of Kafka-based data ingestion. I have performed a simple test of multiple Indexer nodes writing data via HTTP locally, and I haven't encountered any issues. Could you explain why Quickwit does not support multiple Indexer nodes for writing, apart from Kafka-based scenarios? |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
Hi @geek-frio Your analysis is correct. You can indeed send HTTP requests to several indexers. But you need to distribute the requests yourself (note that you have to use a postgresql metastore in this case, the file/s3 backed metastore does not support concurrent writes). One thing also that could bite you with the HTTP API is that, if the indexer's disk crashes, you will lose your data (by default it will be around 1 min of data, the time to upload the data to S3). The Kafka source has the advantage of exactly-once semantics, the scaling is also easier as new indexers will just be assigned partitions and will start indexing them after the assignment. We are currently working on the distributed ingest API, this means that you will not have to take care of the distribution of requests to indexers internally, we will also make the ingest API durable so that you can't lose data if a disk crashed. |
Beta Was this translation helpful? Give feedback.
-
@geek-frio I saw your trace-db work related to Apache SkyWalking. I think Quickwit could be a pretty good fit (perf example on GitHub archive dataset) |
Beta Was this translation helpful? Give feedback.
-
Yes, I have already used version 0.3 quickwit as the database of my skywalking tracing db after I found quickwit in the last year. Now I am currently using v0.4 of Quickwit to explore replacing our existing Alibaba Cloud Logstore service in order to reduce service costs. I am using the Kafka source as the source for Quickwit. However, recently I have encountered some issues where some Quickwit Indexer nodes periodically have their CPU reduced to 0 and stop consuming. I am currently investigating the cause of this issue, so I am considering whether using multiple indexers through HTTP ingest is feasible. |
Beta Was this translation helpful? Give feedback.
-
I understand your explanation, which means that theoretically, I can use Quickwit's multiple Indexer mode as long as I can handle the load balancing and disk crash issues properly. From my perspective, disk crashes are rare occurrences because existing cloud services often have RAID backups for their disks. Additionally, considering the lower data security requirements for tracing and log data, even if such an event occurs, a small amount of data loss would be acceptable. What can be confirmed is that Quickwit's storage cost offers significant cost savings compared to existing cloud service logs or ES storage. |
Beta Was this translation helpful? Give feedback.
-
This is not normal at all. We fixed one or two issues on this recently. Can you share some logs on this?
Well, it's hard to get the right number without knowing more about the workload. On logs use cases, we would expect a 10x reduction but could be less or more... |
Beta Was this translation helpful? Give feedback.
-
@fmassot Unfortunately, I didn't save the error logs from the previous occurrence. I will find a log snippet when it's appropriate. Based on my recollection of troubleshooting the issue last time, I noticed some interesting patterns. The problem occurs when the gc collection task is triggered, after which there are no more logs printed for new splits. Additionally, some splits with new split IDs are not included in the subsequent pipeline processing steps. I have two initial suspicions: 1. Some operation following the new split actor is causing the actor's runtime to freeze, resulting in no further consumption (unlikely to be a panic because I haven't seen any panic logs). 2. When multiple Indexer gc operations are performed simultaneously on the Postgresql meta database, there might be some issues causing a freeze. You mentioned earlier that you recently fixed several issues related to similar problems. Could you provide me with the corresponding issue links? |
Beta Was this translation helpful? Give feedback.
-
@geek-frio I have these two issues in my mind:
Finally, we optimized the postgresql metastore queries in the 0.4 and 0.5, some queries were locking some rows, and that increased the latencies and potentially led to a pipeline failures.
Since 0.4, the GC collection is run only on one node that runs the Janitor service. Globally I would say we fixed several issues mostly for the 0.4 version. And we have still this issue opened but I don't think you were impacted by this one. If you rerun Quickwit, don't hesitate to share your logs/metrics here or on discord, I would be happy to help you on debugging/choosing the right Quickwit setup. |
Beta Was this translation helpful? Give feedback.
-
@geek-frio I converted the issue into a GitHub discussion as it's more appropriate and could be useful for other readers. |
Beta Was this translation helpful? Give feedback.
Hi @geek-frio
Your analysis is correct. You can indeed send HTTP requests to several indexers. But you need to distribute the requests yourself (note that you have to use a postgresql metastore in this case, the file/s3 backed metastore does not support concurrent writes).
One thing also that could bite you with the HTTP API is that, if the indexer's disk crashes, you will lose your data (by default it will be around 1 min of data, the time to upload the data to S3).
The Kafka source has the advantage of exactly-once semantics, the scaling is also easier as new indexers will just be assigned partitions and will start indexing them after the assignment.
We are currently working on the dist…