Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement basic HAST integration and an automated HA failover scenario for it #104

Open
yaroslav-gwit opened this issue Jan 5, 2024 · 0 comments
Assignees
Labels
new feature Label to apply to new features development

Comments

@yaroslav-gwit
Copy link
Owner

yaroslav-gwit commented Jan 5, 2024

HAST is the FreeBSD's alternative to DRBD. This means that we can use it to synchronise the storage state between 2 Hoster nodes, in the primary/secondary fashion.

Unlike DRBD, HAST is an official part of the FreeBSD OS, so it should be more stable in terms of user space and kernel space utilities talking the same language.

Here are some docs for more details:

https://docs.freebsd.org/en/books/handbook/disks/#disks-hast
https://cobug.org/slides/hast/
https://man.freebsd.org/cgi/man.cgi?query=hast.conf&sektion=5&format=html

The initial implementation will not try to wrap around the HAST management itself, it's pretty easy to work with it as is. Instead HAST will be integrated into our HA offering, in order to support synchronous replication.

Our current model is based on the async nature of the ZFS replication, which has a lot of advantages:

  • Scalability - you can scale such setups nearly indefinitely
  • Low network overhead - WAN replication is supported and even encouraged (for better data redundancy)
  • Data locality - VM or Jail data is located on a local ZFS dataset, so you don't have to rely on the network shares being available
  • Easy and painless disaster recovery
  • Etc

But it brings it's own set of challenges:

  • There are gaps in the data due to the ZFS replication being async, so the failed over VM/Jail may not be in a clean state 100% of the time
  • Financial orgs, and healthcare orgs cannot tolerate gaps in their data due to the failover
  • Switching the replication direction automatically is not always plausible, because you might destroy some data on the receiving side

Here is where the HAST comes in, because we can really easily "cluster" together some storage in the synchronous replication mode, and failover as needed without losing a single bit of data (apart from what was lost in transit on the primary node).

I'll also have to add some docs on how to set up HAST to work with Hoster HA, how to handle split-brain, etc.

@yaroslav-gwit yaroslav-gwit added the new feature Label to apply to new features development label Jan 5, 2024
@yaroslav-gwit yaroslav-gwit self-assigned this Jan 5, 2024
@yaroslav-gwit yaroslav-gwit moved this to Todo in Hoster Jan 7, 2024
@yaroslav-gwit yaroslav-gwit added sponsored Someone has sponsored this feature, privately or publicly feature request New feature request new feature Label to apply to new features development and removed new feature Label to apply to new features development sponsored Someone has sponsored this feature, privately or publicly feature request New feature request labels Jan 7, 2024
@yaroslav-gwit yaroslav-gwit moved this from Todo to In Progress in Hoster Jan 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature Label to apply to new features development
Projects
Status: In Progress
Development

No branches or pull requests

1 participant