Network Ramblings: Data Center TCP (DCTCP)

SIGCOMM 2010 paper:
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf

This is certainly one of the well written papers I have read in a long time. The paper is very easy to read (except the analysis in section 3.3) with powerful results. The paper brings into focus the bufferbloat problem which has received attention lately in home networks. This paper shows its existence in data center networks as well.

Preliminaries
TCP Incast
Read blog post by Brad Hedlund

The paper analyzes measurements from a 6000 server data center and studies performance anomalies. The network should provide low latency to short message flows, high burst tolerance for incast flows (queries) and high throughput for long lived bandwidth-hungry background flows. DCTCP is to be use strictly for intra-DC flows. The primary insight of the paper is that the switch buffer occupancy of the large flows should be kept as low as possible without affecting their throughput. This helps in two ways:
1) For delay sensitive short messages this reduces network latency
2) For query traffic which can cause Incast, it helps ensure that sufficient buffer space is available to absorb the micro-bursts and hence avoid packet loss => which causes latency because we have to wait fro TCP re-transmission timeout to expire. Reducing the RTO can fix the incast problem (see paper) but does not help with point 1 above.

Most commodity switches are shared memory switches that have logically common bugger for all switch ports/interfaces (fate sharing). The MMU manages fairness by setting a maximum amount of memory an interface can take. If more packets arrive beyond this limit they are dropped (Incast).

Buffer pressure: Since the packet buffer space is shared among all interfaces, long flows on other interfaces can reduce the space available for bursts of short flows on free interfaces leading to loss and timeouts. This does not require the incoming flows to be synchronized.

DCTCP uses the ECN bit which is set by switches when their queue length exceeds a threshold and senders reduce the congestion window in proportion to congestion (i.e. fraction of marked packets). The paper gives accurate guidelines for selecting the values of g and K.

Thorough evaluations with key takeaways after each experiment.

Tell me why?
1) Why did the authors not compare the performance of their algorithm to TCP Vegas. It would have been interesting to see how DCTCP fares against it though I agree with the authors claim measuring slight increases in RTT in a data center environment can be difficult.
2) The authors show that jittering reduces the response time at higher percentiles at the cost of increasing median response time. Why is this bad if all the responses meet the application deadline?
3) Why do authors use K as instantaneous queue length instead of average queue length? Don't they want to avoid effects of instantaneous packet bursts?
4) The authors say that using Delayed ACKs is important, however, another paper identified them as a serious problem in DCs.

Data center numbers for people who don't have access to data centers
1) The latency targets for data center applications like web search, retail, advertising and recommendation systems are ~10ms to ~100ms. Applications use Partition/Aggregate workflow pattern. Tasks not completed before deadline are cancelled and affect the quality of the final result.
2) 99.91% of traffic is TCP traffic.
3) In absence of queuing, intra-rack RTT = 100μs and inter-rack RTT = 250μs.
4) Each rack in a cluster holds ~40 servers. The data center in this paper has 44 servers per rack. There are typically 20 to 80 VMs per hypervisor.
5) ToR switches have 4MB of shared buffer for ToR switch with 48 1Gbps and 2 10Gbps ports.
6) Median number of concurrent flows active in a 50ms window on each worker node is 36.
7) To avoid TCP Incast developers limit size of response to 2 KB and introduce application level jittering to get asynchronous responses from worker nodes.

Notes to future me
Read TCP Vegas paper
How Facebook handles scale
Remember TCP Large Send Offload

Random quote:
The biggest challenge is to find a novel solution to a novel problem which people in academia deem novel

Network Ramblings

Pages

Thursday, June 21, 2012

Data Center TCP (DCTCP)

No comments:

Post a Comment