The Design of a Practical System for Fault-Tolerant Virtual Machines
基于容错虚拟机的实用系统设计
由于没有找到翻译,所以自己翻了一下好方便总结回忆,所以完全完全不保证翻译质量
只看了一二章、了解了一下主要思路,对于技术细节没有做过多的关注(第三章)
ABSTRACT
We have implemented a commercial enterprise-grade system for providing fault-tolerant virtual machines , based on the approach of replicating the execution of a primary virtual machine (VM) via a backup virtual machine on another server.we have designed a complete system in VmwareSphere 4.0 that is easy to use, runs on commodity servers,and typically reduces performance of real applications by less than 10%.In addition , the data bandwidth needed to keep the primary and secondary VM executing in lockstep is less than 20Mbit/s for several real applications,which allows for the possibility of implementing fault tolerance over longer distances . An easy-to-use, commercial system that automatically restores redundancy after failure requires many additional components beyond replicated VM execution. We have designed and implemented these extra components and addressed many practical issues encounted in supporting VMs running enterprise applications. In this paper, we describe our basic design, discuss alternate design choices and a number of the implementation details, and provide performance results for both micro-benchmarks and real applications
我们已经实现了一个基于容错虚拟机的商业化企业级系统,该系统通过运行在不同服务器上的备份虚拟机,将主虚拟机的运行情况进行了“复制”。我们已经在虚拟环境上设计了一个完整易上手的系统,它运行在商用服务器上,仅仅减少了真实应用不到10%的性能。此外,对于绝大多数真实应用,为了保证主从虚拟机运行协调,数据带宽不能超过20Mbit/s,这使得远距离实现容错成为了可能。一个容易使用的、能够在发生错误后自动地恢复冗余的商用系统需要许多额外运行在VM上的组件。我们已经设计并实现了这些额外的组件,并处理在使得VM能够支持运行企业级应用时所遇到的许多问题。在这篇论文中,我们描述了我们的基本设计,讨论了替代设计选项,以及一些实现细节,还提供了若干方面和真实应用的性能数据。
1. INTRODUCTION
A common approach to implementing fault-tolerant servers is the primary/backup approach, where a backup server is always available to take over if the primary server fails. The state of the backup server must be kept nearly identical to the primary server at all times, so that the backup server can take over immediately when the primary fails, and in such a way that the failure is hidden to external clients and no data is lost. One way of replicating the state on the backup server is to ship changes to all state of the primary, including CPU, memory,and I/O devices, to the backup nearly continuously. However, the bandwidth needed to send this state, particular changes in memory, can be very large.
一个实现容错服务器的常见方法是主从复制,从属服务器总能够在主服务器出现错误的时候,及时地接管工作。从属服务器的状态必须一直与主服务器保持几乎一致,这样当主服务器退出的时候,从属服务器才能立即接管工作,让主服务器发生的错误对外部的用户不可见,并且不丢失数据。在从属服务器上复制主服务器状态的一个方法是连续不断地传输所有主服务器状态的变化给自己,包括CPU,存储,I/O设备。然而,这样要求传输的带宽会很高,尤其是在传输存储相关的状态变化的时候。
A different method for replicating servers that can use much less bandwidth is sometimes referred to as the state-machine approach. The idea is to model the servers as deterministic state machines that are kept in sync by starting them from the same intial state and ensuring that they receive the same input requests in the same order. Since most servers or services have some opreations that are not deterministic, extra coordination must be used to ensure that a primary and backup are kept in sync. However, the amount of extra information need to keep the primary and backup in sync is far less than the amout of state(mainly memory updates) that is changing in the primary.
另外一个能够使用较少的带宽复制服务器的方法是状态机方法。状态机方法将服务器建模成确定性状态机,在启动是时候保证初始状态相同,以及之后它们收到相同顺序的相同输入请求,那么它们就能够保持同步。然而,绝大多数服务器或者设备都有一些不确定性操作,所以必须要使用额外的协作来保证主从同步。尽管如此,额外需要保持的信息也比状态机本身的状态更新信息要小得多。
Implementing coordination to ensure deterministic execution of physical servers is diffcult, particularly as processor frequencies increase. In contrast, a virtual machine running on top of a hypervisor is an excellent platform for implementing the state-machine approach. A VM can be considered a well-defined state machine whose opreations are the opreations of the machine being vritualized(including all its devices). As with physical servers, VMs have some non-deterministic operations( e.g. reading a time-of-day clock or delivery of an interrupt),and so extra information must be sent to the backup to ensure that it is kept in sync. Since the hypervisor has full control over the execution of a VM, including delivery of all inputs, the hyervisor is able to capture all the necessary information about non-deterministic on the primary VM and to replay these operations correctly on the backup VM.
实现能够确保物理服务器的确定性运行的协作是十分困难的,尤其是在处理器频率增加的情况下。与之对照的是,虚拟机运行在虚拟监视器上,是一个非常好的平台来实现状态机方法。一个VM其实是一个well_defined 的状态机,它的操作都是实际被虚拟化的物理机器的操作。和物理服务器一样,虚拟机有一些不确定性操作,所以额外的写作信息必须要发送给从属服务器以确保双方同步。因为虚拟监视器完全掌控着虚拟机的运行,包括输入的传递。所以一个虚拟监视器能够捕捉主VM上所有关于非确定性操作的必要信息,并且能够在从属VM上重新正确实现这些操作。
Hence,the state-machine approach can be implemented for virtual machines on commodity hardware, with no hard-ware modifications, allowing fault tolerance to be implemented immediately for the newest microprocessors. In addition, the low bandwidth required for the state-machine approach alllows for the possibility of greater physical separation of the primary and the backup. For example, replicated virtual machines can be run on physical machines distributed across a campus, which provides more reliability than VMs running in the same building.
因此,状态机方法可以通过在商用硬件上运行虚拟机来实现,而无需硬件级别的特殊修正,并且允许在最新的微处理器上也能实现(容错)。此外,低带宽使得状态机方法让主从服务器物理隔离很远成为了可能。比如说,被复制的虚拟机们可以运行在分布在整个校园的物理机上,而不是仅仅分布在一座大楼里。
We have implemented fault-tolerant VMs using the primary/backup approach on the VMware vSphere 4.0 platform, which runs fully virtualized x86 virtual machines in a highly-efficient manner. Since VMware vSphere implements a complete x86 virtual machine, we are automatically able to provide fault tolerance for any x86 opreating systems and applications. The base technology that allows us to record the exeuction of a primary and ensure that the backup executes identically is known as deterministic replay. VMware vSphere Fault Tolerance is based on deterministic replay, but adds in the necessary extra protocols and functionality to build a complete fault-tolerant system. In addition to providing hardware fault tolerance, our system automatically restores redundancy after a failure by starting a new backup virtual machine on any available server in the local cluster. At this time, the production versions of both deterministic replay and Vmware FT support only uni-processor VMs. Recording and replaying the exeuction of multi-processor VM is still work in progress, with significant performance issues because nearly every access to shared memory can be non-deterministic opreation.
我们已经在Vmware vSphere 4.0上现实了基于状态机的容错虚拟机,该容错虚拟机是x86虚拟机。因为Vmware vSphere 实现了一个完整的 x86 虚拟机,我们自动地能够提供容错服务给所有的x86架构的应用和系统。允许我们记录主VM运行并确保从属VM能够以等价运行的技术是确定性replay。Vmware vSphere Fault Tolerance 基于确定性replay,但是添加了额外必须的协议以及功能来支持建立一个完整的容错系统。除了提供硬件容错,我们的系统在局部簇任可使用的服务器上开启一个新的从属虚拟机遭遇失败时,能够自动地恢复冗余。到目前为止,Vmware FT 只支持 单处理器VMs。在记录并且replay 多处理器VM的运行,我们遇到了性能困难。因为几乎每一次访问共享内存都会成为不确定性操作。
Bressoud and Schneider describe a prototype implementation of fault-tolerant VMs for the HP PA-RISC platform. Our approach is similar, but we have made some fundamental changes for performance reasons and investiagetd a number of design alternatives. In addition, we have had to design and implement many additional components in the system and deal with a number of practical issueds to build a complete system that is efficicent and usable by customers running enterprise applications. Similar to most ohter practical systems discussed, we only attempt to deal with fail-stop failures, which are server failures that can be detected before the failing server causes an incorrect externally visible action.
Bressoud 和 Schneiderr 描述了一个协议版本 的HP PA-RISC平台上容错虚拟机的实现。我们的方法是相似的,但是我们因为性能原因进行了一些底层改变,并尝试了大量的替代设计。此外,我们已经在系统中设计并实现许多额外组件并且处理了大量的实际问题来建立完整、高效、能够运行企业级应用的系统。和许多物理系统类似的是,我们仅仅尝试处理fail-stop类型的服务器错误,这类错误能够在服务器做出外部可见的错误之前被系统察觉到。
The rest of the paper is organized as follows. First, we describe our basic design and detail our fundamental protocols that ensure that no data is lost if a back up VM takes over after a primary VM fails. Then, we describe in detail many of the practical issues that must be addressed to build a robust, complete and automated system. We also describe several design choices that arise for implementing fault-tolerant VMs and discuss the tradeoffs in these choices. Next, we give performance results for our implementation for some benchmarks and some real enterprise applications. Finally, we describe related work and conclude.
本文剩余部分主要内容如下。首先,我们描述了 VM-FT的基本设计和基本协议的细节,这些设计和协议确保了如果一个从属VM从主服务器发生错误后进行接管,将不会导致数据损失。而后,我们描述了许多必须要处理的实际问题。此外,我们也阐述了多种实现VM-FT的设计选择,并对这些选择进行了相关讨论。然后,我们给出了我们实现的系统在一些实际企业应用上的性能数据。最后,我们阐述了相关工作和结论。
2 BASIC FT DESIGN
Figure 1 shows the basic steup of our system for fault-tolerant VMs. For a given VM for which we desire to provide fault tolerance(the primary VM), we run a backup VM on a different physical server that is kept in sync and executes indentically to the primary virtual machine, though with a small time lag. We say that the two VMs are in virtual lock-step. The virtual disks for the VMs are on shared storage (such as a Fibre Channel or iSCSI disk array), and therefore accessible to the primary and backup VM for input and output. (We will discuss a design in which the primary and backup VM have separte non-shared virtual disks in Section 4.1.) Only the primary VM advertises its presence on the network, so all network inputs come to the primary VM. Similarly, all other inputs(such as keyboard and mouse) go only to the primary VM.
图一展示了VM-FT系统的基本设置。对于某个希望能够支持容错的VM(主VM),我们在另外一个物理机上运行从属VM,该从属VM与主VM保持同步,并且相比主VM能够等同地运行着,尽管会有些时延。我们把这叫做虚拟同步。VMs的虚拟硬盘都是建立在共享存储上的,因此 accessible to the primary and backup VM for input and output. 只有主VM在网络中宣告自己的存在,所以所有的网络输入都会去往主VM.相似的,所有的其他输入也仅仅是去往主VM。
All input that the primary VM receives is sent to the backup VM via a network connection known as the logging channel. For server workloads, the dominant input traffic is network and disk. Additional information,as discussed below in Section 2.1, is transmitted as necessary to ensure that the backup VM executes non-deterministic operations in the same way as the primary VM. The result is that the backup VM always executes identically to the primary VM.However, the outputs of the backup VM are dropped by the hypervisor, so only the primary produces actual outputs that are returned to clients. As described in Section 2.2, the primary and backup VM follow a specific protocol, including explicit acknowledgments by the backup VM, in order to ensure that no data is lost if the primary fails
所有主VM收到的输入都通过网络(logging channel) 发送给从属VM。至于服务器负载,主要的输入瓶颈是网络和硬盘。一些额外的控制信息也会被传输给从属VM来保证和主VM和运行着一致的不确定性操作。然而,从属VM输出会被VM监视器给丢弃,所以仅仅是主VM产生实际的输出并将其返回给用户。正如在2.2所说,主、从属VM遵顼特殊的协议,包括显式的从属VM的ack,以确保没有信息丢失。
To detect if a primary or backup VM has failed, our system uses a combination of heartbeating between the relevant servers and monitoring of the traffic on the logging channel. In addition, we must ensure that only one of the primary or backup VM takes over execution, even if there is a split-brain situation where the primary and backup servers have lost communcation with each other.
为了检测主、从属VM是否出现故障,我们的系统在相关的服务器间使用了心跳通信,并且监视了logging channel 上的网络情况(是否发送网络拥堵)。此外,我们必须确保仅有一台主或者从属VM接管运行,如果存在“split-brain"情况。(这里应该是说防止出现split-brain)
In the following sections, we provide more details on serveral important areas. In section 2.1, we give some details on the deterministic replay technology that ensures that primary and backup VMs are kept in sync via the information sent over the logging channel. In section 2.2, we describe a fundamental rule of our FT protocol that ensures that no data is lost if the primary fails. In section 2.3, we describe our methods for detecting and responding to a failure in a correct fashion.
在接下来的章节中,我们在许多重要方面提高了更多细节。在2.1,我们重点阐述了确定性重演技术,这项技术能够通过logging channel 发送信息来保证主、从属VM一致同步。在2.2,我们阐述了我们的FT 协议的基本规则,该规则保证了主VM失效时不会造成数据损失。在2.3,我们阐述了检测和响应错误的方法。
2.1 Deterministic Replay Implementation
As we have mentioned, replicating server(or VM) execution can be modeled as the replication of a deterministic state machine. If two deterministic state machines are started in the same order, then they will go through the same sequences of states and produce the same outputs. A virtual machine has a broad set of inputs, including incoming network packets, disk reads, and input from the keyboard and mouse. Non-deterministic events(such as virtual interrupts) and non-deterministic operations(such as reading the clock cycle counter of the processor) also affect the VM's state. This presents three challenges for replicating execution of any VM running any operating system and workload: (1) correctly capturing all the input and non-determinism necessary to ensure deterministic execution of a backup virtual machine, (2) correctly applying the inputs and non-determinism to the backup virtual machine, and (3) doing so in a manner that doesnt degrade performance. In addition, many complex operations in x86 microprocessors have undefined, hence non-deterministic,side effects. Capturing these undefined side effects and replaying them to produce the same state presents an additional challenge.
正如我们所提到的,复制服务器运行 可以被抽象成 复制一个确定性状态机。如果两个确定性状态机开始在相同状态,之后它们经历相同的状态序列,将产生相同的输出序列。一个虚拟机有着各种形式的输入,包括网络数据包,硬盘读取的数据,来自键盘和数据的数据等。不确定性事件(比如虚拟中断)以及不确定性操作(比如读取cpy 的时钟计数器)也会影响VM的状态。这揭示了复制任何运行任何操作系统和工作负载的VM的运行,都需要面临三大挑战:(1) 正确地捕捉所有输入和必要的不确定性行为来保证从属VM的确定性运行。(2) 正确地把输入和必要的不确定行为应用到从属VM上。(3) 以不影响性能的前提下实现(1)(2)。除了这三点以外,许多在x86微处理器上复杂的操作都有未定义的边际影响。捕捉这些边际影响并重现它们,来达到相同的状态(对于从属VM而言),仍然是一个挑战。
VMware deterministic replay provides exactly this functionality for x86 virtual machines on the Vmwares vSphere platform. Deterministic replay records the inputs of a VM and all possible non-determinism associated with the VM execution in a stream of log entries written to a log file. The VM execution may be exactly replayed later by reading the log entries from the file. For non-deterministic operations, sufficient information is logged to allow the operations to be reproduced with the same state change and output. For non-deterministic events such as timer or IO completion interrupts, the exact instruction at which the event ocurred is also recorded. During replay, the event is delieverd at the same point int the instruction stream. VMware deterministic replay implements an efficient event recording and event delivery mechanism that employs various techniques, including the use of hardware performance counters developed in conjunction with AMD and Intel
在Vmware平台上,实现了x86虚拟机的确定性重现。确定性重现在流式日志中记录了VM的输入以及可能与VM运行相关的不确定行为。通过读取日志文件,VM运行状态可能会被准确重现。对于不确定性操作,通过记录冗余信息来确保操作能够被以相同的状态和输出来复制。一些不确定性事件比如计时器或者IO中断,事件发生时的相关指令也会被记录起来。在重现期间,事件被放置在指令流的相同位置。虚拟确定性重现实现了高效事件记录和事件传递机制
Bressoud and Schneider mention dividing the execution of VM into epochs, where non-deterministic events such as interrupts are only delivered at the end of an epoch. The notion of epoch seems to be used as a batching mechanism because it's too expensive to deliver each interrupt separately at the exact instruction where it occurred. However, our event delivery mechanism is efficient enough that VMware deterministic replay has no need to use epochs. Each interrupt is recorded as it occurs and efficiently delivered at the appropriate instruction while being replayed.
Bressoud和Schneider提到将VM的运行划分成多个时期,非确定性事件比如中断仅仅在一个时期的最后进行传递。时期传递的概念主要是批处理,因为单独正确传递每一个中断开销太大。然而,我们的事件传递机制是足够高效的以至于VMware 确定性重演没有必要去划分成多个时期。所有中断都是在它发生时就被记录,并且在重现时在合适的指令位置下,被高效地传输。
2.2 FT Protocol
For VMware FT, we use deterministic replay to produce the necessary log entries to record the execution of the primary VM, but instead of writing the log entries to disk,we send them to the backup VM via the logging channel. The backup VM replays the entries in real time, and hence executes identically to the primary VM. However, we must augment the logging entries with a strict FT protocol on the logging channel in order to ensure that we achieve fault tolerance. Our fundamental requirement is the following
对于Vmware-FT而言,我们产生必要的log记录来记录主VM的运行状态,但是与把log记录写入磁盘不同的是,我们将它们发送到backup VM上(通过logging channel). backup VM 实时重演这些记录,因此能够“像”主VM一样运行。为了保证容错,我们基本要求如下:
输出要求:如果backup VM在主VM 宕机之后接管了服务,backup VM的运行状态将与主VM的输出状态保持完全一致。
Note that after a failover occurs (i.e. the backup VM takes over after the failure of the primary VM), the backup VM will likely start executing quite difffferently from the way the primary VM would have continued executing, because of the many non-deterministic events happening during execution. However, as long as the backup VM satisfifies the Output Requirement, no externally visible state or data is lost during a failover to the backup VM, and the clients will notice no interruption or inconsistency in their service.
在错误发生后(比如backup接管了宕机的主VM),backup将开始以与主VM不同的方式运行,因为在运行过程中会发生许多不确定性操作。然而,只要backup满足输出要求、没有额外可见状态或者数据在backup准备接管的过程中被丢失,那么用户就不会在外界观察到不一致或者中断
The Output Requirement can be ensured by delaying any external output (typically a network packet) until the backup VM has received all information that will allow it to replay execution at least to the point of that output operation. One necessary condition is that the backup VM must have received all log entries generated prior to the output operation. These log entries will allow it to execute up to the point of the last log entry. However, suppose a failure were to happen immediately after the primary executed the output operation. The backup VM must know that it must keep replaying up to the point of the output operation and only “go live” (stop replaying and take over as the primary VM, as described in Section 2.3) at that point. If the backup were to go live at the point of the last log entry before the output operation, some non-deterministic event (e.g. timer interrupt delivered to the VM) might change its execution path before it executed the output operation
输出要求可以通过延迟外部输出来满足(尤其是多余网络数据包),直到backup收到了所有的能够让它运行到输出所在的状态点的信息。 (也就是说,主VM不能直接输出,而是要等待发送的log entries发送到backup才能运行),如果主VM在output之后失效了,那么backup必须要replay到output所在的状态点,并且必须在该结点上线。如果过早上线,那么可能会由于一些不确定性操作(比如时钟中断)错过输出。
Given the above constraints, the easiest way to enforce the Output Requirement is to create a special log entry at
each output operation. Then, the Output Requirement may be enforced by this specific rule:
If the backup VM has received all the log entries, including the log entry for the output-producing operation, then the backup VM will be able to exactly reproduce the state of the primary VM at that output point, and so if the primary dies, the backup will correctly reach a state that is consistent with that output. Conversely, if the backup VM takes over without receiving all necessary log entries, then its state may quickly diverge such that it is inconsistent with the primary’s output. The Output Rule is in some ways analogous to the approach described in [11], where an “externally synchronous” IO can actually be buffered, as long as it is actually written to disk before the next external communication.
如果backup收到了所有的log记录,包括输出操作的log,那么当primary宕机的时候,backup就能够达到与主VM输出时相一致的状态。否则,会出错。这里的输出规则实质是在下一次外部通信前,对外部的输出是可以进行缓存的。
Note that the Output Rule does not say anything about stopping the execution of the primary VM. We need only delay the sending of the output, but the VM itself can continue execution. Since operating systems do non-blocking network and disk outputs with asynchronous interrupts to indicate completion, the VM can easily continue execution and will not necessarily be immediately affected by the delay in the output. In contrast, previous work [3, 9] has typically indicated that the primary VM must be completely stopped prior to doing an output until the backup VM has acknowledged all necessary information from the primary VM.As an example, we show a chart illustrating the requirements of the FT protocol in Figure 2. This figure shows a timeline of events on the primary and backup VMs. The arrows going from the primary line to the backup line represent the transfer of log entries, and the arrows going from the backup line to the primary line represent acknowledgments. Information on asynchronous events, inputs, and output operations must be sent to the backup as log entries and acknowledged. As illustrated in the figure, an output to the external world is delayed until the primary VM has received an acknowledgment from the backup VM that it has received the log entry associated with the output operation. Given that the Output Rule is followed, the backup VM will be able to take over in a state consistent with the primary’s last output.
请注意,Output Rules并不是说让主VM停止运行。我们仅仅需要延迟发送输出,但是VM可以继续工作(干其他的事情)。因为是非阻塞网络IO,VM很容易不受输出延迟的影响。...一堆废话,后面是解释图二的
We cannot guarantee that all outputs are produced exactly once in a failover situation. Without the use of transactions with two-phase commit when the primary intends to send an output, there is no way that the backup can determine if a primary crashed immediately before or after sending its last output. Fortunately, the network infrastructure (including the common use of TCP) is designed to deal with lost packets and identical (duplicate) packets. Note that incoming packets to the primary may also be lost during a failure of the primary and therefore won’t be delivered to the backup. However, incoming packets may be dropped for any number of reasons unrelated to server failure, so the network infrastructure, operating systems, and applications are all written to ensure that they can compensate for lost packets
我们不能保证在failover情境下,所有的输出仅仅产生一次。简单的说,如果backup发现主VM在output 操作附近宕机,那么backup是无法判断,主VM是在对外输出之后宕机的,还是在对外输出之前宕机的。幸运的是,网络结构被设计为能够处理丢包、重复包等情况,所以这些重复输出是可接收的(这里的输出是packet级别的)。另外值得注意的是,在主VM失败的情况下(在没有主VM的情况下,系统对于外界的输入是不会响应的,也就变相的丢掉了),其他的backup可能会丢失一些packet,但是完全不慌,因为无论是网络结构、应用都有应对这种情况的手段。
2.3 Detecting and Responding to Failure
As mentioned above, the primary and backup VMs must respond quickly if the other VM appears to have failed. If the backup VM fails, the primary VM will go live — that is, leave recording mode (and hence stop sending entries on the logging channel) and start executing normally. If the primary VM fails, the backup VM should similarly go live, but the process is a bit more complex. Because of its lag in execution, the backup VM will likely have a number of log entries that it has received and acknowledged, but have not yet been consumed because the backup VM hasn't reached the appropriate point in its execution yet. The backup VM must continue replaying its execution from the log entries until it has consumed the last log entry. At that point, the backup VM will stop replaying mode and start executing as a normal VM. In essence, the backup VM has been promoted to the primary VM (and is now missing a backup VM). Since it is no longer a backup VM, the new primary VM will now produce output to the external world when the guest OS does output opreations. During the transition to normal mode, there may be some device-specific operations needed to allow this output to occur properly. In particular, for the purposes of networking, VMware FT automatically advertises the MAC address of the new primary VM on the network, so that physical network switches will know on what server the new primary VM is located. In addition, the newly promoted primary VM may need to reissue some disk IOs(as described in Section 3.4)
如上所述,VM必须能够快速响应其他VM出现宕机时的情况,如果backup 宕机了,那么主VM会停止发送log记录,并像单机一样运行。如果主VM宕机了,backup也会类似地上线,但是情况会变得复杂一些。由于运行的延迟,backupVM可能收到了一些log,但是还没来得及运行。backup必须运行到它收到的最后一条log记录的位置,并在那个位置上线,像一个普通的VM一样运行。在backup从发现主VM宕机到自己转化成为主VM的期间,可能需要一些硬件级别的指令,来保证新的VM能够正常工作。举个例子,在某些网络场景下,backup转化成为主VM的时候,可能也需要copy之前的主VM的mac地址,这样才能保证外部一致性。
There are many possible ways to attempt to detect failure of the primary and backup VMs. VMware FT uses UDP heartbeating between servers that are running fault-tolerant VMs to detect when a server may have crashed. In addition, VMware FT monitors the logging traffic that is sent from the primary to the backup VM and the acknowledgments sent from the backup VM to the primary VM. Because of regular timer interrupts, the logging traffic should be regular and never stop for a functioning guest OS. Therefore, a halt in the flow of log entries or acknowledgments could indicate the failure of a VM. A failure is declared if heartbeating or logging traffic has stopped for longer than a specific timeout (on the order of a few seconds).
有许多方式检测primary和backup之间的failure,Vmware使用udp 心跳通信来进行检测。除此之外,primary和backup之间的通信加入了ack 确认。由于周期性的时钟中断,所以primary和backup之间的logging channels应该是周期性通信的,所以如果超过周期,logging channels处于空闲状态,那么就可以推测某个VM宕机了。 也就是通过超时检测机制,来检测错误。
However, any such failure detection method is susceptible to a split-brain problem. If the backup server stops receiving heartbeats from the primary server, that may indicate that the primary server has failed, or it may just mean that all network connectivity has been lost between still functioning servers. If the backup VM then goes live while the primary VM is actually still running, there will likely be data corruption and problems for the clients communicating with the VM. Hence, we must ensure that only one of the primary or backup VM goes live when a failure is detected. To avoid split-brain problems, we make use of the shared storage that stores the virtual disks of the VM. When either a primary or backup VM wants to go live, it executes an atomic test-and-set operation on the shared storage. If the operation succeeds, the VM is allowed to go live. If the operation fails, then the other VM must have already gone live, so that current VM actually halts itself("commits suicide"). If the VM cannot access the shared storage when trying to do the atomic operation ,then it just waits until it can. Note that if shared storage is not accessible because of some failure in the storage network, then the VM would likely not be able to do useful work anyway because the virtual disks reside on the same shared storage. Thus,using shared storage to resolve split-brain situations does not introduce any extra unavailability.
然而,以上的错误检测无法解决脑裂问题。如果backup与primary之间的通信中断,但是backup和primary两者都运行良好,那么就会出现,backup认为primary宕机了,primary认为backup宕机了,两者都同时成为新的primary,脑裂就发生了。
Vm-ft的解决方案是通过了一个shared storage实现了test-and-set 操作,有点类似CAS那个原理。
本质是个套娃问题。(后面raft解决了这个问题)
One final aspect of the design is that once a failure has ocurred and one of the VMs has gone live,VMware FT automatically restores redundancy by starting a new backup VM on another host. Though this process is not covered in most previous work, it is fundamental to making fault-tolerant VMs userful and requires careful design. More details are given in Section 3.1
在运行时会动态判断冗余,并动态加入backup主机
3 PRACTIACL IMPLEMENTATION OF FT
第三章更接近于技术细节,而不怎么设计整个架构的设计,所以就不翻译了。
Section 2 described our fundamental design and protocols for FT. However, to create a usable,robust,and automatic system, there are many other components that must be designed and implemented.
3.1 Starting and Restarting FT VMs
One of the biggest additional components that must be designed is the mechanism for starting a backup VM in the same state as a primary VM.This mechanism will also be used when re-starting a backup VM after a failure has ocurred. Hence, this mechanism must be usable for a running primary VM that is in an arbitrary state(i.e. not just starting up) . In addition, we would prefer that the mechanism does not significantly disrupt the execution of the primary VM, since that will affect any current clients of the VM.
For VMware FT, we adapted the existing VMotion functionality of Vmware vSphere. VMware VMotion[10] allows the migration of a running VM from one server to another server with minimal disruption — VM pause times are typically less than a second. We created a modified form of VMotion that creates an exact running copy of a VM on a remote server, but without destroying the VM on the local server. That is, our modified FT VMotion clones a VM to a remote host rather than migrating it. The FT VMotion also sets up a logging channel, and causes the source VM to enter logging mode as the primary, and the destination VM to enter replay mode as the new backup. Like normal VMotion, FT VMotion typically interrupts the execution of the primary VM by less than a second. Hence, enabling FT on a running VM is an easy, non-disruptive operation.
Another aspect of starting a backup VM is choosing a server on which to run it. Fault-tolerant VMs run in a cluster of servers that have access to shared storage, so all VMs can typically run on any server in the cluster. This flexibility allows VMware vSphere to restore FT redundancy even when one or more servers have failed. VMware vSphere implements a clustering service that maintains management and resource information. When a failure happens and a primary VM now needs a new backup VM to re-establish redundancy, the primary VM informs the clustering service that it needs a new backup. The clustering service determines the best server on which to run the backup VM based on resource usage and other constraints and invokes an FT VMotion to create the new backup VM. The result is that VMware FT typically can re-establish VM redundancy within minutes of a server failure, all without any noticeable interruption in the execution of a fault-tolerant VM
3.2 Managin the Loggin Channel
There are a number of interesting implementation details in managing the traffiffiffic on the logging channel. In our implementation, the hypervisors maintain a large buffffer for logging entries for the primary and backup VMs. As the primary VM executes, it produces log entries into the log buffffer, and similarly, the backup VM consumes log entries from its log buffffer. The contents of the primary’s log buffffer are flflushed out to the logging channel as soon as possible, and log entries are read into the backup’s log buffffer from the logging channel as soon as they arrive. The backup sends acknowledgments back to the primary each time that it reads some log entries from the network into its log buffffer. These acknowledgments allow VMware FT to determine when an output that is delayed by the Output Rule can be sent. Figure 3 illustrates this process.
If the backup VM encounters an empty log buffffer when it needs to read the next log entry, it will stop execution until a new log entry is available. Since the backup VM is not communicating externally, this pause will not affffect any clients of the VM. Similarly, if the primary VM encounters a full log buffffer when it needs to write a log entry, it must stop execution until log entries can be flflushed out. This stop in execution is a natural flflow-control mechanism that slows down the primary VM when it is producing log entries at too fast a rate. However, this pause can affffect clients of the VM, since the primary VM will be completely stopped and unresponsive until it can log its entry and continue execution. Therefore, our implementation must be designed to minimize the possibility that the primary log buffffer fifills up.
One reason that the primary log buffffer may fifill up is because the backup VM is executing too slowly and therefore consuming log entries too slowly. In general, the backup VM must be able to replay an execution at roughly the same speed as the primary VM is recording the execution. Fortu nately, the overhead of recording and replaying in VMware deterministic replay is roughly the same. However, if the server hosting the backup VM is heavily loaded with other VMs (and hence overcommitted on resources), the backup VM may not be able to get enough CPU and memory resources to execute as fast as the primary VM, despite the best efffforts of the backup hypervisor’s VM scheduler. Beyond avoiding unexpected pauses if the log buffffers fifill up, there is another reason why we don’t wish the execution lag to become too large. If the primary VM fails, the backup VM must “catch up” by replaying all the log entries that it has already acknowledged before it goes live and starts communicating with the external world. The time to fifinish replaying is basically the execution lag time at the point of the failure, so the time for the backup to go live is roughly equal to the failure detection time plus the current execution lag time. Hence, we don’t wish the execution lag time to be large (more than a second), since that will add signifificant time to the failover time.
Therefore, we have an additional mechanism to slow down the primary VM to prevent the backup VM from getting too far behind. In our protocol for sending and acknowledging log entries, we send additional information to determine the real-time execution lag between the primary and backup VMs. Typically the execution lag is less than 100 milliseconds. If the backup VM starts having a signifificant execution lag (say, more than 1 second), VMware FT starts slowing down the primary VM by informing the scheduler to give it a slightly smaller amount of the CPU (initially by just a few percent). We use a slow feedback loop, which will try to gradually pinpoint the appropriate CPU limit for the primary VM that will allow the backup VM to match its execution. If the backup VM continues to lag behind, we continue to gradually reduce the primary VM’s CPU limit. Conversely, if the backup VM catches up, we gradually increase the primary VM’s CPU limit until the backup VM returns to having a slight lag.
Note that such slowdowns of the primary VM are very rare, and typically happen only when the system is under extreme stress. All the performance numbers of Section 5 include the cost of any such slowdowns
3.3 Operation on FT VMs
Another practical matter is dealing with the various control operations that may be applied to the primary VM. For example, if the primary VM is explicitly powered off. The backup VM should be stopped as well, and not attempt to go live. As another example, any resource management change on the primary(such as increased CPU share) shoud also be applied to the backup. For these kind of operations. special control entries are sent on the logging channel from the primary to the backup, in order to effect the appropriate operation on the backup.
In general, most operations on the VM should be initated only on the primary VM. VMware FT then sends any necessary control entry to cause the appropriate change on the backup VM. The only operation that can be done independently on the primary and backup VMs is VMotion. That is, the primary and backup VMs can be VMotioned independently to other hosts. Note that VMware FT ensures that neither VM is moved to the server where the other VM is,since that situation would no longer provide fault tolerance.
VMotion of a primary VM adds some complexity over a normal VMotion, since the backup VM must disconnect from the source primary and re-connect to the destination primary VM at the appropriate time. VMotion of a backup VM has a similar issue, but adds an additional complexity. For a normal VMotion, we require that all outstanding disk IOs be quiesced(i.e. completed) just as the final switchover on the VMotion occurs. For a primary VM, this quiescing is easily handled by waiting until the physical IOs complete and delivering these completions to the VM. However, for a backup VM, there is no easy way to cause all IOs to be completed at any required point, since the backup VM must replay the primary VM's execution and complete IOs at the same execution point. The primary VM may be running a workload in which there are always disk IOs in flight during normal execution. VMware FT has a unique method to solve this problem. When a backup VM is at the final switchover point for a VMotion, it requests via the logging channel that the primary VM temporarily quiesce all of its IOs. The backup VM's IOs will then naturally be quiesced as well at a single execution point as it replays the primary VM's execution of the quiescing operation.
3.4 Implementation Issues For Disk IOs
There are a number of subtle implementation issues related to disk IO. First , given that disk opreations are non-blocking and so can execute in parallel, simultaneous disk operations that access the same disk location can lead to non-determinism. Also, our implementation of disk IO uses DMA directly to/from the memory of the virtual machines, so simultaneous disk opreations that access the same memory pages can also lead to non-determinism.Our solution is generally to detect any such IO races(which are rare) , and force such racing disk operations to execute sequentially in the same way on the primary and backup.
Second, a disk operation can also race with a memory access by an application (or OS) in a VM, because the disk operation directly access the memory of a VM via DMA. For example, there could be a non-deterministic result if an application/OS in a VM is reading a memory block at the same time a disk read is occurring to that block. This situation is also unlikely, but we must detect it and deal with it if it happens. One solution is to set up page protection temporarily on pages that are targets of disk operations. The page protections result in a trap if the VM happens to make an access to a page that is alos the target of an outstanding disk operation, and the VM can be paused until the disk operation completes. Because changing MMu protections on pages is an expensive operation, we choose instead to use bounce buffers, A bounce buffer is a temporary buffer that has the same size as the memory being accessed by a disk operation. A disk read operation is modified to read the specified data to the bounce buffer, and the data is copied to guest memory only as the IO completion is delivered.Similarly, for a disk write operation, the data to be sent is first copied to the bounce buffer, and the disk write is modified to write data from the bounce buffer. The use of the bounce buffer can slow down disk operations, but we have not seen it cause any noticeable performance loss.
Thrid,ther are some issues associated with disk IOs that are outstanding(i.e. not completed) on the primary when a failure happens, and the backup takes over. There is no way for the newly-promoted primary VM to be sure if the disk IOs were issued to the disk or completed successfully. In addition, because the disk IOs were not issued externally on the backup VM, there will be no explicit IO completion for them as the newly-promoted primary VM continues to run, which would eventually cause the guest operating system in the VM to start an abort or reset procedure. We could send an error completion that indicates that each IO failed, since it is acceptable to return an error even if the IO completed successfully. However, the guest OS might not respond well to errors from its local disk. Instead, we re-issue the pending IOs during the go-live process of the backu VM. Because we have eliminated all races and all IOs specify directly which memory and disk blocks are accessed, these disk operations can be re-issued even if they have already completed successfully(i.e. they are idempotent)
3.5 Implementation Issues for NetWork IO
VMware vSphere provides many performance optimizations for VM networking. Some of these optimizations are based on the hypervisor asynchronously updating the state of the virtual machine’s network device. For example, receive buffffers can be updated directly by the hypervisor while the VM is executing. Unfortunately these asynchronous updates to a VM’s state add non-determinism. Unless we can guarantee that all updates happen at the same point in the instruction stream on the primary and the backup, the backup’s execution can diverge from that of the primary. The biggest change to the networking emulation code for FT is the disabling of the asynchronous network optimizations. The code that asynchronously updates VM ring buffffers with incoming packets has been modifified to force the guest to trap to the hypervisor, where it can log the updates and then apply them to the VM. Similarly, code that normally pulls packets out of transmit queues asynchronously is disabled for FT, and instead transmits are done through a trap to the hypervisor(except as noted below).
The elimination of the asynchronous updates of the network device combined with the delaying of sending packets described in Section 2.2 has provided some performance challenges for networking. We've taken two approaches to improving VM network performance while running FT. First we implemented clustering optimizations to reduce VM traps and interrupts. When the VM is streaming data at a sufficient bit rate, the hypervisor can do one transimit trap per group of packets and, in the best case, zero traps, since it can transmit the packets as part of receiving new packets. Like-wise, the hypervisor can reduce the number of interrupts to the VM for incoming packets by only posting the interrupt for a group of packets.
Our second performance optimization for networking involves reducing the delay fro transmitted packets. As noted earilier, the hypervisor must delay all transmitted packets until it gets an acknowledgment from the backup for the appropriate log entries. The key to reducing the transmit delay is to reduce the time required to send a log mesage to the backup and get an acknowledgment. Our primary optimaizations in this area involve ensuring that sending and receiving log entries and acknowledgments can all be done without any thread context switch. The VMware vSphere hypervisor allows functions to be registered with the TCP stack that will be called from a deferred-execution context(similar to a tasklet in Linux) whenever TCP data is received. This allows us to quickly handle any incoming log messages on the backup and any acknowledgments received by the primary without any thread context switches. In addition, when the primary VM enqueues a packet to be transmitted, we force an immdeiate log flush of the associated output log entry(as described in Section 2.2) by scheduling a deferred-execution context to do the flush.
DESIGN ALTERNATIVES
后面是替代设计、实验、性能测试等,没有太多干活,就暂且丢掉吧