Network Ramblings: VL2: A Scalable and Flexible Data Center Network

The paper also tries to propose an architecture which allows the data center to become a single large Layer-2 domain. I will not go into the advantages of doing so - low cost, zero-configuration (apart from DHCP), make services scalable, easy VM migration etc.
This paper raises some important research questions in the field of data center networking. It combines a measurement study and a new architecture.

Q1. How to provide high capacity throughput to intra-dc flows in face of over-subscription?

Q2. How to isolate the traffic of one service from another? What prevents one service from flooding the network with its own traffic?
The paper does not really solve this problem

Q3. How to make Ethernet broadcasts scalable? VLANs don't scale well.
Only ARP and DHCP are made scalable. No talk about multicast.

System design:
1) Clos topology (links between AR and CR form a complete bipartite graph)
2) Virtual Layer-2 (hence the name VL2)
3) VLB and ECMP
4) Central directory service

Here is the topology:

The CR, AR and ToR use OSPF to learn about the switch-level topology and shortest path routing.

The measurement study done for the data center they were analyzing showed two properties:
1) Traffic pattern inside a DC is very unpredictable and changes rapidly
2) Ratio of traffic volume between servers to traffic volume leaving/entering the DC is 4:1

VL2 routing
- VLB is used to spread traffic across multiple CRs. Take a random path from a source ToR to a CR and from CR to destination ToR switch.
- Use anycast address at the CR and ToR to enable ECMP forwarding.
- Two address spaces - application-specific addresses (AAs) and location-specific addresses (LAs). The directory system maintains a mapping from AA to LA. Every server in the DC has a VL2 agent running in the network stack. The agent traps the ARP request and instead sends a unicast to the directory service and caches its reply. It is also responsible for tunneling (IP in IP) the packet to the ToR switch of the destination host. The agent also rate-limits broadcast traffic to prevent storms.
- DHCP is implemented using relay agents in each ToR and a central DHCP server. The directory service is notified of the LA-AA mapping when a new server joins the network and also after a server migrates to a new rack (by looking for gratuitous ARP).
- The directory service has two components:
1) Read optimized directory servers (store soft state which may be stale)
2) Write optimized strongly consistent RSM servers. They run Paxos for reliability and fault tolerance.

Here is how routing is done -

VL2 utilizes both VLB and ECMP (each one complementing other's weakness). If you use only VLB:
Each agent is provided a random intermediate router address. We don't need anycast addresses. Each router interface can have different IP. OSPF takes care of reachability. We stack up the IP headers at source and off we go. Now, what happens if the intermediate router fails. We will have to update a huge amount of VL2 agents to use a new random router.
If we use ECMP alone: Configure each router's interface to have the same unicast address (10.0.0.1). ECMP makes sure that any one of these can be taken as the next hop. In this case we don't need to do multiple encapsulations. A single encapsulation with H(ft) | dest_ToR_IP is enough. But commodity switches support only 16-way ECMP, so we can't spread traffic across all the intermediate routers.
Solution is to combine both VLB and ECMP: We define several anycast addresses, each associated with only as many intermediate switches as ECMP can accommodate. The agent is given a random intermediate router IP address. We perform multiple encapsulations. OSPF and ECMP reacts upon link/switch failure, eliminating the need to notify agents.

Drawbacks
1) Bipartite graph will limit scalability of this approach.
2) Does not talk about how to tackle multiple tenants each with same private IP address.
3) Configuring routers for OSPF is not scalable. This is an error prone process. Not sure if OSPF will be scalable (each router stores all the LSAs and broadcasts HELLO packets). Need to configure areas, which creates further configuration overhead.
4) Every outgoing packet needs to encapsulated (also the agent computes the hash and puts it in each packet). This increases latency, reduces useful bandwidth. If we have multiple layers in the data center hierarchy I don't think this approach will scale.
5) Hosts which need to communicate with outside world are given both LA and AA. This essentially fixes their position in the topology and further increases the burden on OSPF (these IPs are not aggregated).
6) VL2 has eventual consistency when it comes to AA-LA mappings. This implies that VM migration will have more downtime and packet loss.

Random
- A commodity switch stores only 16K entries in its routing table
- ToR informs the directory server in case it receives a packet for which it has no mapping which triggers the directory server to update the VL2 agent caches. This can lead to DOS attacks on the directory servers.
- The servers had delayed ACKs turned on which should be turned off.
- What is a hose model? Not really clear from the paper. The reference cited talks about it in terms of VPNs.

Network Ramblings

Pages

Friday, June 1, 2012

VL2: A Scalable and Flexible Data Center Network

No comments:

Post a Comment