Fault Tolerance for Service Function Chains
Traffic in enterprise networks typically traverses a sequence of middleboxes forming a service function chain, or simply a chain. The ability to tolerate failures when they occur along chains is imperative to the availability and reliability of enterprise applications. Service outages due to chain failures severely impact customers and cause significant financial losses. Making a chain fault-tolerant is challenging since, in the case of failures, the state of faulty middleboxes must be correctly and quickly recovered while providing high throughput and low latency. In this paper, we present FTC, a novel system design and protocol for fault-tolerant service function chaining. FTC provides strong consistency with up to f middlebox failures for chains of length f + 1 or longer without requiring dedicated replica nodes. In FTC, state updates caused by packet processing at a middlebox are collected, piggybacked onto the packet, and sent along the chain to be replicated. We implement and evaluate a prototype of FTC. Our results for a chain of 2-5 middleboxes show that FTC improves throughput by 2-3.5x compared with state of the art [50] and adds only 20 us latency overhead per middlebox. In a geo-distributed Cloud deployment, our system recovers lost state in 271 ms.
READ FULL TEXT