Network Ramblings: Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

This is an influential paper in the Data Center arena. You can also see a video presentation of the paper here.

The basic idea of the paper is very simple: Make Ethernet scalable by making it hierarchical like IP. Thats it. I personally think SEATTLE is a more elegant solution to the problem.

First two sections give the same rhetoric about how L2 is simple but not scalable and how L3 is scalable but hinders mobility and is prone to configuration errors. The Related Work section is thorough.

Design:
1) Fat-tree topology

2) Fabric manger (centralized directory service)

3) Distributed location discovery protocol (basically send packets like LSPs in OSPF for checking if a switch/host is alive and also so that switches learn their pod and position numbers)

Paper defines a Pseudo MAC address (48 bits) - PMAC - which encodes the location of a host in the topology. It is of type pod.position.port.vmid (hierarchical like IP address)

Lets see how a packet travels from a host (1, 10.0.1.2) in Pod 0 to a host (2, 10.1.1.2) in Pod 1

- Host 1 sends an ARP: Who has 10.1.1.2?
- The edge switch intercepts the ARP and if it sees a source MAC it hasn't seen before it generates a PMAC for it (edge switch assigns a monotonically increasing vmid to MACs it sees on a particular port) and creates an entry in its local PMAC table mapping the host's AMAC (actual MAC) and IP to its PMAC. It then communicates the IP-PMAC mapping to Fabric manager. This data is used to reply to ARP requests from other hosts and for translating the PMAC to AMAC on packets destined to the host. The edge switch also asks the Fabric manager if it knows the PMAC for 10.1.1.2.
If it does:
The edge switch constructs an ARP reply with the PMAC from fabric manager and sends it to the host. Host learns the PMAC (48 bits) instead of AMAC. That is why PMAC is 48 bits long.
If it does not:
ARP is broadcasted like it is in standard Ethernet. The target host replies with its AMAC. The edge switch of target host rewrites the AMAC with PMAC before forwarding to querying host and fabric manager (This is how the fabric manager gets populated after failure)
- 10.0.1.2 sends the packet to the PMAC it has learnt. The last hop edge switch translates the PMAC to AMAC before sending it to the target host. End hosts don't see their own PMACs but cache PMACs of other hosts.

Forwarding is pretty simple. If the PMAC is in the same Pod but different position forward it to your aggregation switch. If it is in different Pod aggregation switch will forward it to the core switch. The switch through the LDP learns which switch is connected to it on which port. The forwarding data on a switch is aggregatable like IP so it does not take up much space.
The PMACs can support expansion of the DC (adding another Pod). The multi-rooted tree topology however should remain fixed which is a fair assumption.

A good note:
When considering a distributed versus centralized protocol there is an inherent trade off between protocol simplicity and system robustness.

The fabric manager contains soft state data of the following type:

Now the fabric manager will contain 1 million entries (corresponding to all the hosts in the DC). Is this scalable? DHT anyone?
The switches send LDMs to each other and the fabric manager also monitors connectivity to each switch module. Is all this broadcast traffic scalable?

Drawbacks:
1) What about traffic coming in from the Internet? How do you scalably handle all that traffic?
2) The paper says that their approach works for multi-rooted trees and they take fat-tree topology as an example but that is easier said than done. If there is one more layer between the aggregation and core layer their LDP protocol cannot work without the help from fabric manager.
3) VL2 runs OSPF and Portland LDP which requires some configuration and broadcasts. Also, both schemes use indirection which involves caching of information for speedup. So, when the cache is invalid, the central controller/fabric manger has to take care of things and this is tedious.
4) How does it account for MB in a data center? What if a MB is filtering on dst MAC or something?
5) Cannot be implemented in conjunction with existing data center topology because legacy switches don't understand MAC prefixes or LDP. Even existing Openflow switches don't understand LDPs so these packets will have to be processed by the controller. Not backwards compatible.

Network Ramblings

Pages

Thursday, May 31, 2012

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

No comments:

Post a Comment