Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous Multi Site Ingester Replication #4600

Closed
M3r1 opened this issue Oct 30, 2021 · 15 comments
Closed

Asynchronous Multi Site Ingester Replication #4600

M3r1 opened this issue Oct 30, 2021 · 15 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@M3r1
Copy link

M3r1 commented Oct 30, 2021

We are trying to deploy on-premise Loki architecture in 2 data centers and we want our deployment to be resilient to site-level malfunctions.

We configured our Promtail agents across all data centers to send logs to a single centralized Loki deployment hosted in a single datacenter.

We want to deploy a backup Loki on a different site that we will use instead of the main one at the time of an outage. We use Cassandra for both chunk and index store and we can easily replicate Cassandra across 2 data centers so that data it is available in the backup site when needed, however data in the "main" Loki stored in the ingesters won't be available to the backup site until it is flushed to the backend storage.

Now we understand that the data won't truly be lost due to the WAL, but it can lead to the lost of important log lines, originating from the site and going into the site, that might be extremely useful when figuring out the root cause of the outage and might hold back application hosted on the failed datacenter from getting a clear picture of the state of their application. Potentially a few hours of logs might be unavailable to us at the time of an outage despite the effort we put into setting up a backup deployment.

We haven't fully thought about the technical details of the solution, but we think that providing a set of ingesters a way to "fetch" for themselves chunks stored on a remote set of ingesters can increase Loki's resilience. We think that a mechanism to prevent ingesters from flushing the same set of data to the backend storage when the sites are up and especially when the master site is down is key to this solution and would be challenging to execute.

An alternative solution would be to provide a set of ingesters with some kind of a site identifier. In that case queriers would turn to multiple sites to fetch logs, depending on the query, and a single site outage would lead to the lose of logs originating only from that site. We realize this solution introduces some complications, a set of ingesters and distributors would have to be deployed in each site and fetching logs might be performed in inconsistent latencies.

Any help or feedback would be greatly appreciated.

@taisho6339
Copy link
Contributor

I don't know whether Loki should support so much high-level reliability but I'll comment my just idea as a user.

If I were you, I should use nginx request mirroing like the following image.

nginx instances in multiple datacenters  ------> distributors in datacenter A -------> ingesters in datacenter A
                        (request mirroring) | -----> distributors in datacenter B -------> ingesters in datacenter B
  • nginx processes receive user requests and they are deployed over your multiple datacenters
  • when a nginx receives a request, it routes it to the distributors in datanceter A and the ones in datacenter B
  • the distributors and ingesters in each datacenter would be saved chunks to their storages independently.
  • If the datacenter A is down, you can see your logs from datacenter B

@M3r1
Copy link
Author

M3r1 commented Oct 31, 2021

First of all thank you for the help!

If I were you, I should use nginx request mirroing like the following image.

nginx instances in multiple datacenters  ------> distributors in datacenter A -------> ingesters in datacenter A
                        (request mirroring) | -----> distributors in datacenter B -------> ingesters in datacenter B
``

We are definitely considering a solution in that direction. If we were to execute the solution you offered how would we replicate the data that was written to Loki when one site was down to that site the moment he is back on?

I don't know whether Loki should support so much high-level reliability but I'll comment my just idea as a user.
Our organization is extremely sensitive to downtime, I suppose most organizations are. However, i'd say we are probably more sensitive than others we perform resilience tests by deliberately bringing down a site every other month.

With that said I do understand that our use case is probably unique, and maybe Loki moving forward better prioritize adding support and improving in different things, however I think a lot of users would appreciate if this subject would be addressed in the documentation.

Again, thanks for the help we don't take it for granted.

@taisho6339
Copy link
Contributor

taisho6339 commented Oct 31, 2021

I'm just a one of the Loki's users but your topic is interesting to me so let me discuss.
I want to clarify what the problem is here.

You mean that a site failure causes "all of ingesters down in the site as same as other servers" so you can't see logs on them while the outage, right?
Also, you can't recover logs from wal while the outage and the logs in wal might be lost due to the site failure.

@M3r1
Copy link
Author

M3r1 commented Oct 31, 2021

I'm just a one of the Loki's users but your topic is interesting to me so let me discuss. I want to clarify what the problem is here.

You mean that a site failure causes "all of ingesters down in the site as same as other servers" so you can't see logs on them while the outage, right? Also, you can't recover logs from wal while the outage and the logs in wal might be lost due to the site failure.

The end goal is to keep Loki up 100% of the time.

Loki is hosted in a single datacenter and receives logs originating from the datacenter he is hosted in and others.

If the site hosting Loki fails, Loki is unavailable across our entire ecosystem.

The solution is to deploy a backup Loki in a different site, when the main Loki site fails we will move to using the backup Loki.

Loki stores data in ingesters and in long term storage. Making the data in the long term storage available in 2 sites is easy, we use Cassandra and we can set up a multisite Cassandra cluster and replicate data to the other site.

However, the data in the ingesters won't be available in the backup site because it gets replicated to the other site only after it is flushed to the backend storage.

Ingester data missing during a site failure means in our eyes that Loki is fairly limited in his multi-site resilience capabilities and it is the problem we are trying to solve or at least mitigate.

A side note... I mentioned WAL to emphasize that we understand data won't be lost forever, just unavailable during the site downtime:

  1. If the malfunction is a complete site shutdown when we bring the ingesters back up they will recover their chunks from WAL.
  2. If the malfunction is a network problem keeping us from reaching the datacenter the ingester data will be flushed to that site backend storage and saved.

Again, thanks for the help.

@taisho6339
Copy link
Contributor

Thank you for clarification.
I understood.

My idea is the following image.

loki_failure_design

  • Prepare two completely independent Loki clusters
  • Logs are "constantly" replicated by Nginx to two clusters
    • It means that you can see completely same logs from both of your clusters anytime
  • If one site is down, you can see from the other site Loki cluster
  • If the site is recovered, you need to do nothing.
  • You may not even need to use the replication feature of Cassandra.

@M3r1
Copy link
Author

M3r1 commented Nov 1, 2021

Thanks for the detailed response.

Your idea is definitely something we are heavily considering.

However in this architecture, since there is no replication mechanism between the two Loki clusters, all the logs written to the backup site when the master site is down won't be available in the master site when he comes back up.

This creates a "hole" in the data available in one site so both clusters potentially won't really be fully synchronized.

This fact is significant to us since our master site has significantly better hardware and we want to be using it in every second he is up. This means we want to provide our Loki clients with a single Loki data source pointing to the master site and when the master site failed the Loki DNS address will change and point to the backup site.

In fact we want users to have a single Grafana data source for reasons other then the one I just mentioned, we actually prioritized a single Grafana data source as the main attribute we want to maintain in the multi site architecture.
Our ecosystem also contains Prometheus. We used to have 2 Prometheus data sources, one for each site, which introduced a great inconvenient for clients. To overcome this issue implemented the Thanos product in our ecosystem that helped us achieve a single Grafana data-source. Since we made that changed Prometheus usage increased dramatically and we can't imagine going back now.

I guess if we were to use nginx mirroring, since both Loki data sources will be identical it is not the same use-case like we had with Prometheus but we actually prioritize having a single Loki data source higher than maintaining ingester data during a site failure.

@taisho6339
Copy link
Contributor

taisho6339 commented Nov 1, 2021

I see.
How is the following?

loki_failure_design (3)

  • Replication in Cassandra side across multiple datacenters
  • Request queries via Nginx for querier
  • Prepare L7LB in front of multiple Nginx instances
    • Basically this LB routes to your mater site and if the site is failed, it routes the other

This architecture don't create "hole" because Cassandra replication is enabled and data source is only one, I think.
(but I don't know about Cassandra replication very much)

If you need more complex routing in LB side for failover, you could use envoy instead of Nginx

@M3r1
Copy link
Author

M3r1 commented Nov 1, 2021

Interesting, this solution definitely answers our resilience and convenient needs. We had a similar one come up in our discussions.

This solution bares a significant drawback, since the ingesters are separate they have no way of knowing if an ingester in a different site has written down to Cassandra the chunks it holds in memory. This means that each line of log will be written to Cassandra twice.

Aside from the additional storing costs we fear our querying performance would take a significant hit since double the data would be pulled in each query.

@taisho6339
Copy link
Contributor

taisho6339 commented Nov 1, 2021

This means that each line of log will be written to Cassandra twice.

Right.
Also, I've found that a chunk in one site can be different from the one in the other site because Loki clusters are separated.
My second idea has the risk of having different chunks and chunk keys for almost the same logs.

@M3r1
Copy link
Author

M3r1 commented Nov 2, 2021

We are leaning to accept the fact that ingester data will be unavailable at site downtime, we think it is probably the best option out of the bunch.

This feature request was open in hope of an addressing of this topic in future releases.

@owen-d
Copy link
Member

owen-d commented Nov 2, 2021

Hey yall, great discussion going on in here. I suspect we'll introduce zone-aware replication at some point in the future, meaning that we could host ingesters across multiple zones and we'd only write to replication_factor ingesters in distinct zones. This would ensure that a single zone failure wouldn't result in data loss and all ingester data would be available via the other zones in the interim. I realize that doesn't solve your problem now, but hopefully it will in the future :) .

P.S. I'd highly advise moving away from Cassandra and towards a single-store configuration using the boltdb-shipper unless you have very intentional reasons for choosing Cassandra. It's where our current and future development efforts are focused and is a fast, cost-effective, and easier to operate in my opinion.

@M3r1
Copy link
Author

M3r1 commented Nov 2, 2021

Hey yall, great discussion going on in here. I suspect we'll introduce zone-aware replication at some point in the future, meaning that we could host ingesters across multiple zones and we'd only write to replication_factor ingesters in distinct zones. This would ensure that a single zone failure wouldn't result in data loss and all ingester data would be available via the other zones in the interim. I realize that doesn't solve your problem now, but hopefully it will in the future :)

Thank you so much!!! It is definitely reassuring that a feature of this kind is in the works. Should I keep the issue open or close it knowing something is coming in the future?

P.S. I'd highly advise moving away from Cassandra and towards a single-store configuration using the boltdb-shipper unless you have very intentional reasons for choosing Cassandra. It's where our current and future development efforts are focused and is a fast, cost-effective, and easier to operate in my opinion.

We are operating on a completely private network, this means the Object Storage we use is ours and it still heavily uses HDD disks.
We originally deployed Loki on Object storage using single-store, however we got extremely slow results due to our object storage performance, we are getting around 35mbps of download speed from our Object service.
On top of that, without getting into how our architecture is built we are very limited with the number of CPU cores we can give Loki which meant the Object storage bottleneck was even more painful.

The solution was Cassnadra for us, working from a block-storage based database made the speed data was pulled from the backend storage around 15 times faster.
With that said, when we would make the switch to SSD disks we would move to single-store without second guessing it.

@owen-d
Copy link
Member

owen-d commented Nov 2, 2021

I'm fine leaving this issue open, especially so folk can 👍 it :). There's already a lot of code for this in Cortex and I suspect it'd just be a matter of prioritization on our side.

@taisho6339
Copy link
Contributor

taisho6339 commented Nov 12, 2021

maybe we can already use zone aware replication feature by this PR
#4574 so can we close this issue?

@stale
Copy link

stale bot commented Jan 9, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jan 9, 2022
@stale stale bot closed this as completed Apr 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants