Accessing MARS volume from Node B (secondary role) #16

hbjcr · 2017-04-27T15:16:02Z

I can access the MARS volume in Node A (primary role) by following the documentation and mounting the new device MARS created at /dev/mars/[resource_name]. However, I'm looking at Node B and I can't find the same device, which makes me wonder, is there a specific process I may need to follow to gain access to my original data from Node B?

schoebel · 2017-04-27T18:57:31Z

The /dev/mars/mydata only appears at the primary side (active node). The secondary side (passive role) only updates the underlying disk /dev/my-vg/my-data (to be prepared for a primary switch), but never shows a virtual /dev/mars/something in addition during passive role.

It would not make sense due to the asynchronous update (time delay).

Anyways, ordinary filesystems like ext4 or xfs cannot mount the same device twice without causing disastrous inconsistencies / data loss. There exist some special filesystems like ocfs2 which can theoretically do it, but as explained in some of my slides, you don't want to operate such a nervous beast over long distances or over slow network links.

There have been some ideas to show a /dev/mars/mydata in read-only mode during pause-replay, e.g. for drawing a backup from a "snapshot". However, some filesystems like xfs want to replay their own transaction log on any uncleanly umounted filesystem, even when mounted read-only. Therefore I didn't implement it up to now. If you want this functionality, it can be alternatively done via "detach" and obeying some very important rules described in mars-manual.pdf (otherwise you are risking inconsistencies not caused by MARS but by properties of xfs & co).

hbjcr · 2017-04-28T19:34:46Z

I'm glad you covered all these different scenarios in your answer because what I would like to explore is MARS' ability to implement an active-active multigeo cluster. What I would like to understand is what causes the inconsistencies in this type of scenario when using GFS2 or OCFS2 filesystems? Is it concurrency at the block level with high latency?
Moreover, in your docs, you also mentioned that combining DRDB with MARS may lead to good results, but I'm having problems understanding why? Would you please elaborate?

schoebel · 2017-04-28T21:06:59Z

Problem: would GFS2 or OCFS2 work well over long distances or through network bottlenecks?

A) Architectural answer.

NO, never in general. It is not a problem of MARS, or of its brother DRBD. It is a problem of Distributed Systems in general.

The name of the problem is: DSM = Distributed Shared Memory.

A logically shared filesystem like OCFS2 is at the same time physically distributed, and this is nothing else but a particular variant of DSM (with the additional property of persistence which need not necessarily be present at a pure DSM model).

In academic research, there exists a body of several hundreds or thousands of research papers on DSM. They can be summarized as follows, by the following two opinions of researchers working in the field:

It does not work as expected. The original expectation was that the DSM programming model will serve as the general replacement for EC = Explicit Communication model (e.g. directly working with TCP sockets). There exist applications where EC is provably better (although programming with EC is much more effort = higher development costs).
There are some application areas where DSM is working well and achieving about the same performance than EC, while programming costs are considerably lower.

Please look carefully at my formulation: both opinions are not contrary to each other. They just look at the same coin from different perspectives / sides. It is like the famous glass of water, filled only by half.

B) Practical answer:

DRBD supports actice-active operation, while MARS currently does not do it (but might do in future).

Simply use DRBD for testing your application use case, and look whether it works for your application.

Here the crucial point is "for your application". As in my theoretical answer, there exist workloads where it can work, provided that some further conditions are met. However, there exist other workloads where it certainly does not work.

Example: shortly after Dijkstra published his famous article on Semaphores at the end of the 1960s (which were supposed to be implemented on single-processor or on SMP machines), a research group tried to implement a variant called "Distributed Semaphore". The goal was to create a Distributed operating system kernel by using the DS = Distributed Semaphore in place of LS = Local Semaphore. Essentially they hoped that any Distributed System can be implemented the same way as a local system, by using the same methods.

They just implemented the DS part of such a system. And the result was disastrous in terms of performance. DS simply is slower by several orders of magnitude than LS. They tried to improve it, but there was no chance. Finally somebody came up with strong arguments that there is really no chance.

This was the birthday of a new science field called "Distributed Systems".

If both DS and DSM would work as originally expected (naively), then there were no need for reseach on Distibuted Systems at all. The whole field would not be needed, because the fundamental problems were generally solved. But they aren't.

IMPORTANT: please notice that EC can only work with partners knowing each other [example: you need to call bind() before calling connect()]. Each message must be sent to a known partner whose name must be known (like an address of a letter, or a phone number). In contrast, DSM may be used anonymously: some message is placed at a certain location instead of being sent to a named partner, and later somebody else may fetch it from there. The partner don't need to know each other.

This is a fundamental difference at model level.

There are some more differences, not explained here for space reasons.

Your next question: the mentioned inconistencies of ordinary filesystems like xfs or ext4 are caused by low-level properties of implementations which are simply no supported features. Theoretically it could be solved by different implementation techniques. However, this would blow up their code. So their engineers decided to not support such a feature.

MARS does also not support such a feature (like active-active in combination with OCFS2) at the moment, for similar reasons.

In future, I could support it. At the moment, I don't have the time for implementing it. Anyway, it would only help for those application use cases where DSM is bearable, or probably only for a true subset of it:

Since Einstein we know that "coincidence" does not exist in the large-scale universe. Only for very short distances it seems to exist in our subjective perception.

The same is also true for Distributed Systems.

Consequence: the higher the distances (or the worse the network bottlenecks / packet loss), the worse will DSM or its implementation like OCFS2 work.

Therefore I currently don't give a high prio to this new MARS feature. But if you pay me for implementing it, I might decide to do it :)

Another practical answer: at 1&1, there was a practical experiment with a real customer application.

Some years ago, some architect had decided to use NFS for implemting a shared storage for a certain application, sold to end customers.

The NFS worked well until about 20000 customers were using the application. After having more than 20000 customers on a single cluster, it reached its scalability limit.

Then they decided to replace NFS with OCFS2: it worked again.

But after reaching about 35000 customers (if I remember the numbers correctly), again the scaling problems popped up, and stable operations was no longer possible. There were a lot of incidents involving performance.

Then, after many fruitless attempts, they came up with GLUSTERFS which then worked for more than 35000 customers.

But after reaching about 50000 customers, they were at their scaling limit again. Stable operations of their workload was impossible after crossing that limit.

DAMN HELL, is this just an accident?

No, there is a systematic reason behind it.

After explaining to that team what I write here and what you can find in my GUUG2017 slides, they decided to move from a BIG CLUSTER architecture to a SHARDING architecture.

Please look at my slides where I am explaining the differences between BIG CLUSTER and SHARDING.

And guess: sharding works well for that team, until today. No problems anymore, although in the meantime they have reached a magnitude of almost a million of customers using that application.

Tipp for you from me: just avoid big clusters. They don't scale. Although there are many evangelists at the internet claiming the opposite. I know from both theoretical research as well as from practical experience from a big company that these evangelists are wrong with their claims with respect to general workloads.

Your last question: I think that a combination of DRBD with MARS is possible, while I am unsure whether it really is a good idea. Personally, I never tried it (but I know somebody who probably has tried it). There should be strong reasons for it, and please: run a PoC = Proof of Concept first. I won't guarantee. My experience is that such complicated stacks are typically some workarounds which can work or will not work, always depending on the properties of the workload.

schoebel · 2020-04-04T08:46:51Z

Please read the new mars-architecture-guide.pdf which has TONS of explanations about your issue.

I don't close this discussion for now, because other people might find it interesting.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accessing MARS volume from Node B (secondary role) #16

Accessing MARS volume from Node B (secondary role) #16

hbjcr commented Apr 27, 2017

schoebel commented Apr 27, 2017 •

edited

Loading

hbjcr commented Apr 28, 2017

schoebel commented Apr 28, 2017 •

edited

Loading

schoebel commented Apr 4, 2020

Accessing MARS volume from Node B (secondary role) #16

Accessing MARS volume from Node B (secondary role) #16

Comments

hbjcr commented Apr 27, 2017

schoebel commented Apr 27, 2017 • edited Loading

hbjcr commented Apr 28, 2017

schoebel commented Apr 28, 2017 • edited Loading

schoebel commented Apr 4, 2020

schoebel commented Apr 27, 2017 •

edited

Loading

schoebel commented Apr 28, 2017 •

edited

Loading