Pages

Saturday, June 2, 2012

A Scalable, Commodity Data Center Network Architecture

First read this nice article on data center architectures by Brad Hedlund:
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/

The objective of the paper is the following: Achieve full bisection bandwidth (if the network is segmented into two equal parts, this is the bandwidth between the two parts) between all the hosts in a network using commodity Ethernet (incrementally scalable) switches in face of over-subscription.

Here is a traditional L3/L2 design:


There are typically 20-40 servers per rack. Each server is connected to a ToR switch via 1 Gbps link. ToR switches have 10 Gbps uplinks to ARs. ToR have 48 ports while AR and CRs have 128 ports. AR and CR have only 10 Gbps ports. So, the target is to have DC with 25,000+ hosts with a bisection bandwidth of 1 Gbps.

The paper proposes a Clos Network/Fat-tree network as a new network architecture for achieving this goal. They propose a L3/L2 architecture:



k - Number of ports of the switches used in the fat-tree
There are k pods, each pod has 2 layers of k/2 switches. Each edge switch is connected to k/2 hosts. There are (k/2)2 CRs. The topology can support k/2*k/2*k = k3/4 hosts. Between any 2 hosts there are (k/2)2 paths. We must spread the traffic on these paths somehow to achieve the bisection bandwidth.
Hosts connected to the same edge switch form a /24 subnet. VMs can migrate only within their own subnet. No talk of VLANs or migration between subnets in the paper.


Addressing (For a 10/8 network)
CR - 10.k.j.i - i and j lie in the range [1,k/2]
Pod switch - 10.pod.switch.1 (switch number lies in [0,k-1])
Host - 10.pod.switch.ID (ID lies in [2,k/2+1]) - This is very wasteful


Routing
Here is how the routing table looks like for a pod switch:


- There are k/2 prefixes for the subnets in the same pod and k/2 suffixes for k/2 hosts. Prefixes are preferred to suffixes.
- For intra-pod traffic the lower level pod switches send the packet to the upper level (evenly split between upper layer switches) which sends it back to the other edge switch. For inter-pod traffic the ARs spread the traffic among the CRs based on the last byte of the IP address. TCP ordering is not broken as packets always follow the same path to CR.
- The CR aggregate network prefixes of all the subnets (i.e. k /16 networks) which should not take too much space in the forwarding tables. The traffic spreading takes only on the route from edge to CRs.
- The paper advocates the use of a central controller (Openflow anyone?) for generating the routing tables and statically storing it on the switches. Since, the routing tables will not change this seems alright.

No comments:

Post a Comment