Scalable Tail Latency Estimation for Data Center Networks

by   Kevin Zhao, et al.

In this paper, we consider how to provide fast estimates of flow-level tail latency performance for very large scale data center networks. Network tail latency is often a crucial metric for cloud application performance that can be affected by a wide variety of factors, including network load, inter-rack traffic skew, traffic burstiness, flow size distributions, oversubscription, and topology asymmetry. Network simulators such as ns-3 and OMNeT++ can provide accurate answers, but are very hard to parallelize, taking hours or days to answer what if questions for a single configuration at even moderate scale. Recent work with MimicNet has shown how to use machine learning to improve simulation performance, but at a cost of including a long training step per configuration, and with assumptions about workload and topology uniformity that typically do not hold in practice. We address this gap by developing a set of techniques to provide fast performance estimates for large scale networks with general traffic matrices and topologies. A key step is to decompose the problem into a large number of parallel independent single-link simulations; we carefully combine these link-level simulations to produce accurate estimates of end-to-end flow level performance distributions for the entire network. Like MimicNet, we exploit symmetry where possible to gain additional speedups, but without relying on machine learning, so there is no training delay. On large-scale networks where ns-3 takes 11 to 27 hours to simulate five seconds of network behavior, our techniques run in one to two minutes with 99th percentile accuracy within 9 for flow completion times.


page 1

page 2

page 3

page 4


SWP: Microsecond Network SLOs Without Priorities

The increasing use of cloud computing for latency-sensitive applications...

DeepConfig: Automating Data Center Network Topologies Management with Machine Learning

In recent years, many techniques have been developed to improve the perf...

Expander Datacenters: From Theory to Practice

Recent work has shown that expander-based data center topologies are rob...

RepNet: Cutting Tail Latency in Data Center Networks with Flow Replication

Data center networks need to provide low latency, especially at the tail...

FLIT-level InfiniBand network simulations of the DAQ system of the LHCb experiment for Run-3

The LHCb (Large Hadron Collider beauty) experiment is designed to study ...

Backpressure Flow Control

Effective congestion control in a multi-tenant data center is becoming i...

Large-Scale Cell-Level Quality of Service Estimation on 5G Networks Using Machine Learning Techniques

This study presents a general machine learning framework to estimate the...

Please sign up or login with your details

Forgot password? Click here to reset