-
Notifications
You must be signed in to change notification settings - Fork 39
DO NOT EDIT: This page has been migrated to Confluence: https://openxt.atlassian.net/wiki/display/DC/V4V
Overview:
The V4V technology is a new approach to inter-domain communications on a Xen virtualization platform. Most existing inter-domain communications frameworks use shared memory between domains and some form of event channel interrupt mechanism. V4V is a departure from this approach, instead relying on the hypervisor to broker all communications. Domains manage their own data rings and there is no sharing of actual memory. Domains communicate with one or more other domains (or themselves in the trivial case) using source and destination addresses and ports. This forms a standard 4-tuple that can unambiguously specify where a given block of data came from and is destined for. Note that V4V defines several protocols (making a 5-tuple) but the protocol is not use in the core V4V framework but rather may be implemented at a higher layer in the communications stack. It should be noted that V4V provides TCP/IP like network semantics with the core components described in this document being roughly analogous the Network Layer 3 in the OSI model. The V4V core provides reliable delivery of "network packets" using the 4-tuple described above. The term "messages" will be used to indicate discrete blocks of data within V4V rings as opposed to "packet".
The term "domain" will be used to indicate guest VMs or domains on a Xen platform. Note that Domain 0 can also use V4V in the same fashion as other de-privileged guest domains. Since there is nothing inherently special in the way Domain 0 would use V4V, no differentiation will be made.
Details:
Addressing:
As noted above, v4v uses a 4-tuple address scheme where each end of the communication channel is defined by an address structure as follows.
struct v4v_addr { uint32_t port; domid_t domain; };
Domain IDs are unique on any given platform and serve as the end point address. The port value is analogous to a TCP/IP port that specifies some service at a particular address.
Rings:
The basic construct in V4V is the v4v_ring. A domain that wants to communicate with other domains must register a v4v_ring with the V4V management code in the hypervisor. Rings are identified but a v4v_ring_id which is defined as follows:
struct v4v_ring_id { struct v4v_addr addr; domid_t partner; };
The ring ID defined the local address values and a partner domain. If a partner domain is specified then only communications between the two domains is possible. An ANY value for partner allows a given ring to accept traffic from any other domain. The following defines the ring itself. The domain portion of the id field is always set to the local domain ID.
struct v4v_ring { uint64_t magic; struct v4v_ring_id id; uint32_t len; V4V_VOLATILE uint32_t rx_ptr; V4V_VOLATILE uint32_t tx_ptr; uint64_t reserved[4]; V4V_VOLATILE uint8_t ring[0]; };
The length of the ring is specified and the actual ring data buffer starts at ring[0] in the structure. The rx_ptr is the receive pointer into the ring where the next message to be read by the domain is located. This pointer is only ever modified by the domain that owns the ring as it consumes messages in the ring. The tx_ptr is the transmit pointer into the ring indicating where the next received message can be written into the ring. It also represents the end of the message data to be read by the ring owning domain. This pointer is only every modified by the hypervisor as it writes new messages into the domain's ring.
For clarity it should be stated that a ring's data area starting a ring[0] only contains received messages passed to it by the V4V management code in the hypervisor. V4V rings are not shared memory rings with messages going in both directions moving through them.
Register and Unregister Rings:
A key aspect of V4V is that each domain creates its own ring memory and registers it with the V4V management code. In most cases this involves creating a block of system memory then presenting V4V with the physical addresses of the pages backing the newly allocated buffer. The following structure is used to pass that information to V4V.
struct v4v_pfn_list { uint64_t magic; uint32_t npage; uint32_t pad; uint64_t reserved[3]; v4v_pfn_t pages[0]; };
This describes the number of pages in the ring and the Page Frame Number of each page.
A ring is registered using the V4VOP_register_ring hypercall passing in the new v4v_ring descriptor and the v4v_pfn_list descriptor. On success, the ring is active and may start sending immediately or be notified of received traffic. Diagram 1 shows the creatuion of a V4V ring.
Unregistering a ring is simply using another hypercall V4VOP_unregister_ring. The domain completely owns the ring and can unregister it at any point in time.
VIRQ:
VIRQ or virtual interrupts are interrupts delivered on the Xen platform device's IRQ; the interrupts are sourced from within the hypervisor. This is a generic Xen mechanism that V4V uses and will not be described in further detail here. V4V uses a dedicated VIRQ number to indicate a change in state of interest to a domain concerning V4V. Such a domain must first register for these notifications using the appropriate hypercalls.
The reception of a VIRQ_V4V events indicates 2 possible changes of V4V state a domain would be interested in:
- One or more rings that a domain owns has received messages.
- One or more destination rings that a domain attempted to send messages to but could not, now has sufficient space to receive.
A VIRQ_V4V event could mean either or both of the above has occured.
Ring Receive:
The domain that owns a ring is free to read data from its ring at any point. The terminating condition indicating no more messages to be read is when rx_ptr == tx_ptr. Note that the ring is not actually circular so a domain must handle when the wring wraps around (i.e. when tx_ptr < rx_ptr). As a domain reads messages from its ring it moves the rx_ptr forward to indicate the message was consumed. Each message in the ring is prefixed with the following descriptor.
struct v4v_ring_message_header { uint32_t len; struct v4v_addr source; uint16_t pad; uint32_t protocol; uint8_t data[0]; };
As stated, a ring can be read at any time (e.g. using a polling algorithm) but V4V also provides an interrupt mechanism to indicate message arrival. In addition (and perhaps more useful), the domain owning a ring can receive virtual interrupts (VIRQ_V4V) to indicate the arrival of messages (see above).
Ring Notify:
V4V provides a facility to notify the management code in the hypervisor that reveive processing has been done and/or that there are pending sends. The V4VOP_notify hyerpcall should be made when either or both of these conditions exist.
To notify of receive activity, no additional information is supplied to the notify hypercall (the change is implicit in that the rx_ptr changed).
When a domain is ready to send messages to 1 or more destination rings, the notify hypercall is used to query the state of the destination rings to determine if they can receive the data. The following structures are used to specify what the notifying domain is interested in.
struct v4v_ring_data_ent { struct v4v_addr ring; uint16_t flags; uint32_t space_required; uint32_t max_message_size; };
struct v4v_ring_data { uint64_t magic; uint32_t nent; uint32_t pad; uint64_t reserved[4]; struct v4v_ring_data_ent data[0]; };
The caller supplies the above structures include N v4v_ring_data_ent structures after the main descriptor. Within the v4v_ring_data_ent structures the ring and space_required information for the destination ring to query is filled in. V4V fills in the flags and max_message_size in the v4v_ring_data_ent structures as output.
The max_message_size indicates how much message data can be sent at the current time. If max_message_size < space_required at the time of the call, V4V will internally request that a VIRQ_V4V notification be raised when enough space becomes available.
The flags can indicate:
- V4V_RING_DATA_F_EMPTY - The ring is empty
- V4V_RING_DATA_F_EXISTS - The ring exists
- V4V_RING_DATA_F_PENDING - Pending interrupt exists - do not rely on this field - for profiling only
- V4V_RING_DATA_F_SUFFICIENT - Sufficient space to queue space_required bytes exists
Sending:
There are two hypercalls for sending messages, V4VOP_send and V4VOP_sendv. They both take a source and destination v4v_addr and a protocol value. The send op takes a buffer and length where the sendv op takes a list of buffers and a count of items in the list. If the message(s) cannot be sent, a return code indicating the caller should try again will be returnd and V4V will internally request that a VIRQ_V4V notification be raised when enough space becomes available.
V4V IPTables:
Built into the V4V management code is an IPTables like firewall. Three hypercalls allow rules to be added, deleted and listed. The implementation is much the same as Linux IPTables (public information could be referenced here).
Motivation:
The motivation for V4V is to invent a new approach to inter-domain communications on a Xen platform that is simpler, more secure and less prone to failure. The existing approaches fall short on many of these criteria.
Security:
V4V provides a much higher level of isolation between domains because no memory is shared. Each domain completely owns their rings. Only the domain that owns the ring and the hypervisor can access those rings. The hypervisor (as a trusted component of the system) brokers all communications and ensures the integrity of the rings.
Fault Tolerance:
V4V is more fault tolerant than existing approaches. Since the hypervisor brokers all activity it has complete control over V4V. An individual domain can manage the lifetime of its rings without any ill effect on other domains. A domain that corrupts or misuses its own ring, damage (or even see) rings owned by other domains. Domain shutdown or crashes with open rings is trivially handled in the V4V management code.
Simplicity:
The interface and semantics for using V4V are quite simle. Its likeness to TCP/IP means it is easily fit into existing protocol frameworks within operating systems. The internal workings of V4V are also far simpler.
Performance:
Though the reliance on data copying from one domain to another may seem a major performance issue, it actually turns out not to be. Data copies on modern systems are extremely fast due to memory/bus speeds and advanced instructions allowing larger copies per CPU cycle. In addition, due to locality of reference with respect to CPU caches when using V4V, most copies will occur in cache drastically speeding up the data copies. Finally V4V does not introduce any extra overhead due to VMEXITs than existing solutions.
At the moment V4V, doesn't keep track of an established connection, which have an impact on the firewall ability to track connections, and make every rings of a guest potentially able to receive message from any other guest.
One possible solution is to provide the ability to have private rings that are only for receiving data from a specific guest.
- connect hypercall
The notification mechanism currently present doesn't allow any communication between the hypervisor and the guest. The mechanism is just notifying the guest that something happened, and this is up to the guest to find out which elements changed.
One way to improve here, would be do offer a list of events along with the notification. Instead of reinventing a brand new mechanism, we could have a v4v ring that is only use by the hypervisor to write events. Those events would allow the guest to identify which rings need processing, and which destinations have space too. It could also have other events: connection request, etc.
The current approach to securing V4V is through the use of a firewall-like interface known as viptables. This mechanism is described briefly in the overview section of the V4V wiki page This is a simple filtering mechanism which allows dom0 to specify ACCEPT / REJECT policy for packets being sent from one end point to another. In this case an 'endpoint' is a (domid, port) 2-tuple / ordered pair.
The rules could be represented as follows:
ACCEPT: (X, Y) -> (X', Y')
Such a rule would allow the source domain X to send data over V4V with a source port Y to destination domain X' with destination port Y'. Similarly this rule could specify the REJECT action be taken for matching communications over V4V in which case the data would be rejected and the sender notified through an error value returned from the hypercall.
The approach v4vtables takes to securing communications over V4V between VMs is definitely "the right way to do it". There are however a few issues with the approach. This section will deal with several issues raised in xen-devel discussions around [V4v_Patchset_10]. We'll also address some concerns raised internally with regard to XSM.
General issues w/r to v4vtables raised in Patchset 10 came from Tim D. These had less to do with security than they did with general implementation details but since v4vtables is a security mechanism these are discussed here. Specifically see issues like #126 and any others with the phrase "v4vtables misery".
[Issue 126] More v4vtables messTim also asked for a more explicit description of the calling convention for v4vtables rules. From this it's reasonable to conclude that v4vtables needs more love before it'll be ready for upstream.
There is the possibility of a denial of service on the hypervisor caused by a guest. The scenario here would be that a guest creates a very large number of V4V rings/sendv vectors/notify requests and exhausts hypervisor resources. This seems to be a "very bad thing" so some limitations need to be considered. This is issue 6.
Another DoS situation exists currently. In this scenario a guest sending unwanted or unexpected data to another guest could saturate it's V4V rings with garbage. This would effectively deny service to the guest owning the V4V ring. This scenario could be addressed by the v4vtables if they were modified to allow guests to add rules at any point limiting senders to their rings. This is part of issue 7.
It's been suggested on the list that we need a mechanism to disable V4V in situations where it's not being used. This was brought up and tagged by Ross as [issue].
Tim D. briefly mentioned XSM w/r to adding v4vtables rules in issue 132 but it's become clear that the issue is more fundamental. v4vtables supplies functionality that overlaps with what XSM is designed to do. Adding v4vtables to Xen effectively adds objects to the hypervisor that belong to a specific domain (message rings). Access to these objects for communication with the guest to which they belong is effectively an access control decision. We've invented v4vtables as an access control mechanism that governs access to this specific object type.
Xen has however already accepted XSM as a generic access control mechanism intended to solve similar problems. That's not to say that XSM is a perfect fit to replace v4vtables and in fact it can't replace v4vtables completely. Still it's likely a good idea to use XSM where possible and use v4v tables to extend this functionality where necessary. This includes not only considering the use of XSM for access control on V4V message exchange but also on the manipulation of v4vtable objects.
This section documents some recommendations to keep V4V moving forward. This is all open for discussion and none of it is set in stone. Please edit this document with suggestions / objections / ideas.
The requested flag to disable V4V system-wide is a pretty heavy-handed approach but it's likely a good thing to have. This should be a Xen command line option. It may be best to actually have V4V disabled by default and provide the cmdline option to enable V4V. Semantics like the flask-enforcing flag may be right:
- v4v=1 to enable
- v4v=0 to disable
- disable by default (when no cmdline option is given, opt-in semantics)
To address the concerns over a DoS from guests creating a large number of V4V resources it's probably sufficient to introduce limits on a per-guest basis. This would be something like adding per-VM config options like the following:
- v4v-rings-max=N to allow VM to create N V4V rings
- v4v-rings-max=0 to disallow VM from creating rings (note the VM could still send data)
- v4v-sendv-max=N to allow VM to send N sendv vectors during a sendv op
- v4v-sendv-max=0 to disallow VM from sending data (note the VM could still receive data)
- v4v-send-max=N to allow VM to send a maximum to N bytes in a sendv op
- v4v-notify-max=N to allow VM to send N rings to check in a notify op
- default to 0 when not specified (RJP I am leaning towards default to reasonable limits)?
To get XSM involved in controlling access to V4V objects we first need to enumerate the objects and the actions that are performed on them. The objects will likely be easy enough to enumerate. From a quick chat yesterday there's obviously the ring itself but there will likely be others including those belonging to the v4vtables stuffs. Some work should be done to fill in the following data:
Actions for V4V Ring: Create, Destroy, Send
Actions for v4vtable entries: Create, Delete, Read
Further this data should be linked to the structures (source file and line) where the XSM label will live and where the access control hooks need to be placed. Similarly we'll need to work up a patch to the default XSM policy which adds the necessary object classes and access vectors.
So for the sake of argument let's assume that we implement the XSM stuff above and we can write XSM policy like the following:
allow domA_t self:v4v { create delete }; allow domB_t domA_t:v4v send;
This would allow a VM with the label domB_t to send data to a V4V ring with the label domA_t (presumably belonging to the VM labeled domA_t). This gives us the same semantics as the SELinux extensions to DBus.
With this, we've achieved ~50% of the protections offered by v4vtables: we can restrict which VMs are able to communicate over V4V. What we're lacking is the notion of a 'port'. Unfortunately the ordered pair of (network address, port) doesn't map well to the Flask policy. SELinux has a mechanism for labeling port numbers but the language doesn't allow complex types so the label of the node (an IP address) cannot be associated with the port. I'd suggest we don't try to extend XSM in the same way with the same limitations and use v4vtables instead.
Now that we've extended XSM to govern v4vtables rules it makes sense to expose the v4vtables hypercalls to guests beyond dom0. This makes v4vtables much closer to a real firewall in that the guest is in control of their own policy. Still dom0 will be able to create and delete policy as well and with the XSM rules it's possible for dom0 to add rules that the guest cannot manipulate or even see:
allow domA_t self:v4v_rule { create delete read }; allow dom0_t domA_t:v4v_rule { delete read }; allow dom0_t dom0_t:v4v_rule { create delete read };
This would allow both domA and dom0 to create and delete rules but dom0 would be able to delete rules that belong to domA but domA would not be able to manipulate rules created by dom0. Obviously there would need to be (and likely are already) hard coded policy to prevent a VM from creating v4vtables policy where it's anything other than the destination (ingress only). This also assumes that there are not transition rules for v4v_rule objects as I believe they're all stored in a single list (no labeled parent object to base transition on). Some additional thought on this last point may be useful.
There has been some discussion around exposing v4vtables to the guest. This would allow it to protect itself to some extent. In this case some default constraints need to be placed on rule manipulation. This is for basic sanity and for systems that are unable or unwilling to use XSM:
- dom0 or some privileged domain can manipulate all rules
- guests can manipulate rules provided they are ingress rules (the guest creating the rule is the destination)
Next steps include a proposal for forward progress on V4V access control with XSM and v4vtables.
So here is some of the reasoning behind why we think v4v is a good solution for inter-domain communication (and why we think it is better than the current shared memory grant method that is used).
Reasons why the v4v method is quite good even though it does memory copies:
- Memory transfer speeds through the FSB in modern chipsets is quite fast. Speeds on the order of 10-12 Gb/s (over say 2 DRAM channels) can be realized.
- Transfers on a single clock cycle using SSE(2)(3) instructions allow moving up to 128 bits at a time.
- Locality of reference arguments with respect to processor caches imply even more speed-up due to likely cache hits (this may in fact make the most difference in mem copy speed).
Reasons why the v4v method is better than the shared memory grant method:
- v4v provides much better domain isolation since one domain's memory is never seen by another and the hypervisor (the most trusted component) brokers all interactions. This also implies that the structure of the ring can be trusted.
- Use of v4v obviates the event channel availability issue since it doesn't consume individual channel bits when using VIRQs. This one is obsolete since it was switched to normal event channel use.
- The projected overhead of VMEXITs (that was originally cited as a majorly limiting factor) did not manifest itself as an issue. In fact, it can be seen that in the worst case v4v is not causing many more VMEXITs than the shared memory grant method and in general is at parity with the existing method.
- The implementation specifics of v4v make its use in both a Windows and a Unix/Linux type OS's very simple and natural (ReadFile/WriteFile and sockets respectively). In addition, v4v uses TCP/IP protocol semantics which are widely understood and does not introduce an entire new protocol set that must be learned.
Some of the downsides to using the shared memory grant method:
- This method imposes an implicit ordering on domain destruction. When this ordering is not honored the grantor domain cannot shutdown while the grantee still holds references. In the extreme case where the grantee domain hangs or crashes without releasing it granted pages, both domains can end up hung and unstoppable - the DEADBEEF issue. This issue does not occur with libvchan we discovered.
- You can't trust any ring structures, because the entire set of pages that are granted are available to be written by the other guest.
- The PV connect/disconnect state-machine is poorly implemented. There's no trivial mechanism to synchronize disconnecting/reconnecting and dom0 must also allow the two domains to see parts of xenstore belonging to the other domain in the process.
- Using the grant-ref model and having to map grant pages on each transfer cause updates to V->P memory mappings and thus leads to TLB misses and flushes (TLB flushes are expensive operations).
V4V had to be changed quite a bit to be accepted upstream. The API and hypervisor ABI changed which means we will need to build compat layers into the guest drivers. The VIRQ was replaced with a standard masked event channel. Since the number of event channels has been increased in upstream Xen, this is not a big deal. These changes mean that some of the information in this page and on the API link below are incorrect (or will be) with the new implementation. But the details about functionality in the overview and the "thoughts and justifications" are still relevant.
TODO: We had once collected some metrics on V4V vs. libvchan. I think they were posted to xen-devel but I have not been successful in finding them. It would be nice to have these data.