Pages

Friday, June 22, 2012

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

This is a SIGCOMM 2008 paper:
http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf

Just for fun, you can see network conference acceptance stats here

I had read the paper before but I felt too lazy to write a post about it. I finally got around to it. This paper can be described in two words: Consistent Hashing. Ok maybe three: Consistent Hashing Ethernet. Definitely four: Consistent Hashing Ethernet Scalable. Read my earlier post on Consistent Hashing for refreshing your concepts. 

The basic premise of the paper is very simple but things gets convoluted when you introduce caching, timeouts, mobility and failures. I felt the paper could have addressed these scenarios lucidly. Also, I felt the paper was badly organized especially in the middle sections. The good thing is it only requires changes to switch architecture and control plane and not the end hosts but if a simple protocol like TRILL took what about 7 years? to get adopted by Cisco this will take another decade or so. Here is the rundown.

Why is Ethernet good?
- Plug-and-play
- Host mobility

Why is Ethernet bad?
- Source MAC learning requires flooding. DHCP, ARP packets are flooded which waste link bandwidth and processing resources.
- Broadcast storms because there is no TTL in the header. 
- Spanning tree computation under-utilizes links. Frames are not always sent via shortest paths. Root bridge must be selected carefully as it carries most of the traffic.
- Lot of security vulnerabilities and privacy concerns. Snooping on other people's traffic. ARP spoofing. DOS attacks.


L3/L2 architecture: Pools of Ethernet domains connected by IP routers.
- Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.
- Host mobility is restricted only within the subnet


Enter VLAN
- No restrictions on host mobility. Though we need to trunk the VLAN at all the bridges.
- Scopes broadcasts to small domains
- Each subnet is treated as a VLAN (http://www.youtube.com/watch?v=PpuWBh3EnVM)
- Need to compute multiple spanning trees now and choose root bridges carefully so as to effectively utilize all links and balance load
- Limited number: 4094
- Switches still need to maintain forwarding table data for all hosts in VLANs provisioned to them

The switches run link-state routing protocol (OSPF) among themselves. This is scalable as it is used to maintain only the switch level topology and not disseminate end-host information. Although, as the network gets bigger it would hit a limit because of large forwarding tables and LSA broadcasts will cause scalability issues. The paper proposes a remedy for that by dividing the switches into regions (very similar to OSPF areas)Since we are running OSPF between switches, if a switch fails all the other switches will eventually know about it. Also, OSPF helps route packets along the shortest path.

We use consistent hashing to evenly distribute and store (IP,MAC) and (MAC,host_switch_id)  key value pairs among the existing switches. Each switch chooses the MAC address of one of its interfaces as its unique network wide switch_id. Since, we use consistent hashing the information is almost equally spread across all the switches. So, instead of all switches storing all the information they now store only a subset of the information. Switches cache information to avoid doing repeated lookups (more below). Powerful switches can have more nodes in the key space (virtual switches) to hold more information. We can also hash information for services like PRINTER and DHCP_SERVER and store it on switches or implement anycasting by storing the same IP address with different MAC addresses or change the MAC address for fail-over. This is a tangent though.

It uses loop-free multicasting to handle general broadcasts and groups instead of VLANs (broadcasting in a group sends unicasts to group members). This was little hand wavy in the paper though.

H - Hash function used by all switches
H(k) = rk resolver switch for key k. If (k,v) = (MAC,host_switch_id), rk is called location resolver

We have two (k,v) stores (IP,MAC) and (MAC,host_switch_id). 

How is the packet sent from src to dst. Does the host_switch see the MAC address and tunnel the packet to appropriate switch or do they attach a separate header like TRILL? The paper does not say anything about this. 

What information does each switch store?
- (MAC,port) key value pairs of directly attached hosts.
- (IP,MAC) and (MAC,host_switch_id) key value pairs of directly attached hosts. Switch must delete these once it detects the host MAC is unreachable. The location resolver informs the switch if the host has migrated to another switch. In this case the switch updates entry and forwards the packets to the new switch and informs other switches of this change, and possibly deletes the entry after timeout.
- (MAC,host_switch_id) key value pairs for the part of key space it is responsible for.
(MAC,host_switch_id) key value pairs for the MACs to which hosts directly attached to the switch are sending packets. This is not very big because most hosts communicate with a small number of popular hosts. This is removed after timeout.
- (IP,MAC) or (IP,MAC,host_switch_id) translation for the part of key space it is responsible for. This information is used only for replying to ARP requests.

When a host ha joins switch sa
The switch publishes (IPa,MACa) and (MACa,sa) to resolvers determined by H. It is now responsible for monitoring the liveness of resolver(how frequently is this done?) and republishing the host information. What if the publisher is the resolver? 
The switches also receive the information through OSPF if another switch dies or new switch enters the network. They go through the cached (k,v) pairs and remove those entries where v = died switch. If a new switch has come up and (k,v) have been re-published to it, the old switch still maintains the (k,v), to answer queries from other switches which still don't know about the new switch, and removes the entry after a fixed timeout. How is this timeout decided?

ha attached to switch swants to communicate to hb attached to switch sb
ha broadcasts an ARP request. sintercepts and forwards it to the resolver. Resolver sends back MACb and if optimization is enabled sb. sa sends ARP reply to ha.
sa now tunnels all packets to MACb to sb now.
Without optimization:
sa calculates H(MACb) and tunnels the packet to the resolver (IP tunneling I guess). Resolver de-tunnels and finds that the packet must be sent to MACb which it knows how to reach. So, it tunnels the packet to sand tells sa also so that it can cache this information to use the shortest path from next time. In case a switch comes alive after reboot it uses this process to repopulate its cache. 


Host Dynamics
3 types of dynamics:
1) Host changes location
The new host switch updates the (MAC,host_switch_id) entry to reflect the changes. 

2) Host changes MAC
Update (IP,MAC) and deletes old (MAC,host_switch_id) entry and inserts a new one. Other hosts might have still cached the old information. So, the new switch maintains a MAC revocation list, and sends a gratuitous ARP to hosts sending packets packets to old MAC. It also informs the sending host switch about the entry (NewMAC, switchid). It can still forward the packets sent to old MAC to the host in the meantime.

3) Host changes IP
Host switch deletes old (IP,MAC) and inserts the new one

If both location and MAC change: When new switch registers (old_IP, NewMAC), the address resolver calculates (oldMAC, oldSwitchid) and tells old switch about MAC address change. Old switch now maintains revocation list sending gratuitous ARPs to senders along with sending new location of host (newMAC,newswitchid) to sending switch.

One drawback of SEATTLE is that it does not support ECMP so it does not utilize links efficiently (even if they are idle).

No comments:

Post a Comment