Network Ramblings: June 2012

Wednesday, June 27, 2012

I created a shell script in Windows and ran it on Linux...

and it did not work. Everything seems to be in order. The problem is in Windows the EOL marker is \r\n and in Linux it is \n. You need to install dos2unix to convert your file to unix specifications.

sudo apt-get install dos2unix

Also see:
http://www.virtualhelp.me/linux/164-dos2unix-missing-ubuntu-1004

Tuesday, June 26, 2012

Learn Python

Awesome videos by Nick Parlante. All righty
1) http://www.youtube.com/watch?v=tKTZoB2Vjuk
2) http://www.youtube.com/watch?v=EPYupizJYQI
3) http://www.youtube.com/watch?v=haycL41dAhg
4) http://www.youtube.com/watch?v=kWyoYtvJpe4
5) http://www.youtube.com/watch?v=uKZ8GBKmeDM
6) http://www.youtube.com/watch?v=Nn2KQmVF5Og
7) http://www.youtube.com/watch?v=IcteAbMC1Ok

Monday, June 25, 2012

Towards a Next Generation Data Center Architecture: Scalability and Commoditization

This paper is the precursor to the VL2 paper. Lots of interesting ideas. My random thoughts:

Preliminaries:
Redundancy models
N+1
N+1 means we have 1 backup device per group, so we can tolerate one failure. Any component can replace
any of the components, but only once. Minimum number of components is N.

1+1
1+1 means that every component has one dedicated backup. Each component can be replaced by the one
and only other backup device. Minimum number of components is 1.
In a 1+1 system the standby server has the ability to maintain the state, e.g. in a phone network the standby server would be able to keep ongoing calls as it constantly monitors the state of all activities in the active server. In an N+1 system there's only one standby for N active servers and the call state is lost in case of a switchover.

Destination NAT/Half-NAT + Direct Server Return + Source NAT/Full NAT
Read http://lbwiki.com/index.php/DSR for understanding DSR. DSR requires the load balancer and server to be in the same Layer 2 domain. Source/Destination NAT require that all packets pass through them in reverse direction as well. Therefore, the servers should set up the load balancer as their default route. This severely limits the number of servers which a load balancer can support.

This figure is taken from the excellent book Load Balancing Servers, Firewalls, and Caches

- Workload inside a data center changes frequently which leads to sporadic congestion. VLB helps accommodate arbitrary traffic patterns.

- No link oversubscription: Any server can communicate with any other server at full NIC speed (minimum of sender and receiver's NIC, usually 1 Gbps)

- This is probably the earliest paper proposing the idea of software load balancers on commodity servers in data centers. Advantages: Scale out. Flexible load balancing algorithms. Cheap Disadvantages: Can't forward at line rate.

- Packets coming from external sources are sent to an Ingress Server which performs packet encapsulation after contacting the directory service. This is a hidden hardware cost we must incur. Interestingly, VL2 does not even mention anything about Load Balancers or communication with external hosts.
- Used MAC-in-MAC tunneling instead of IP-in-IP tunneling used in VL2. The switches know about other switches (not hosts) and directly connected servers only.
- OS is changed but not the applications. The network stack performs encapsulation after examining each packet. Would be slower than usual.

- Hose traffic model: No server is asked to send or receive more than what its network interface allows.
- Monsoon proposes that we run IS-IS with the controller, so that it knows where every host is located in order to run the directory service.

Random Facts

- Networking gear consumes 10-20% of the data center's equipment budget
- An Ethernet port is 10% to 50% the cost of an equivalent Layer 3 port
- 16K MAC entries can stored in commodity switches.

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

I started reading this paper a month ago but put it down after 2 pages. I couldn't follow the paper. I tried reading it again a week ago and put it down again. Why is this paper so hard to read? Why am I so stupid? I found two blog posts where students had reproduced the results in the paper and had found the paper very clear and easy to read. I was like

Read blog post 1 2
So if anyone out there is reading this post, I would really appreciate it if you can provide links to blog posts, videos or ppts which explain this paper in an easy way. I want to understand this paper in depth.

Sunday, June 24, 2012

Hedera: Dynamic Flow Scheduling for Data Center Networks

To repeat the experiments get the code from:
http://reproducingnetworkresearch.wordpress.com/2012/06/09/hedera-2/
or
http://reproducingnetworkresearch.wordpress.com/2012/06/06/hedera/

The paper is straightforward and fairly easy to read.

Things learnt
- VLB does per-flow instead of per-packet spreading of flows to preserve packet ordering. TCP treats unordered packets as a signal of congestion.
- The whole idea of the paper is that Elephant flows must be scheduled/placed carefully to achieve good aggregate bandwidth.
- Global first fit works quite well and there might be no need to run the complicated simulated annealing. The only disadvantage might be the running time of the algorithm.
- NetFPGA Openflow switches have 2 hardware tables: 32 entry TCAM that does matching on variable length prefixes and 32K entry SRAM that matches flow entries with fully qualified 10 tuples

git checkout new branch

I always forget the command for this so I am writing it here:

To view the branches:

git branch -a

To checkout a branch:

git checkout -b branch_name

then do

git pull origin branch_name

Friday, June 22, 2012

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

This is a SIGCOMM 2008 paper:
http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf

Just for fun, you can see network conference acceptance stats here

I had read the paper before but I felt too lazy to write a post about it. I finally got around to it. This paper can be described in two words: Consistent Hashing. Ok maybe three: Consistent Hashing Ethernet. Definitely four: Consistent Hashing Ethernet Scalable. Read my earlier post on Consistent Hashing for refreshing your concepts.

The basic premise of the paper is very simple but things gets convoluted when you introduce caching, timeouts, mobility and failures. I felt the paper could have addressed these scenarios lucidly. Also, I felt the paper was badly organized especially in the middle sections. The good thing is it only requires changes to switch architecture and control plane and not the end hosts but if a simple protocol like TRILL took what about 7 years? to get adopted by Cisco this will take another decade or so. Here is the rundown.

Why is Ethernet good?
- Plug-and-play
- Host mobility

Why is Ethernet bad?
- Source MAC learning requires flooding. DHCP, ARP packets are flooded which waste link bandwidth and processing resources.
- Broadcast storms because there is no TTL in the header.
- Spanning tree computation under-utilizes links. Frames are not always sent via shortest paths. Root bridge must be selected carefully as it carries most of the traffic.
- Lot of security vulnerabilities and privacy concerns. Snooping on other people's traffic. ARP spoofing. DOS attacks.

L3/L2 architecture: Pools of Ethernet domains connected by IP routers.
- Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.
- Host mobility is restricted only within the subnet

Enter VLAN
- No restrictions on host mobility. Though we need to trunk the VLAN at all the bridges.
- Scopes broadcasts to small domains
- Each subnet is treated as a VLAN (http://www.youtube.com/watch?v=PpuWBh3EnVM)
- Need to compute multiple spanning trees now and choose root bridges carefully so as to effectively utilize all links and balance load
- Limited number: 4094
- Switches still need to maintain forwarding table data for all hosts in VLANs provisioned to them

The switches run link-state routing protocol (OSPF) among themselves. This is scalable as it is used to maintain only the switch level topology and not disseminate end-host information. Although, as the network gets bigger it would hit a limit because of large forwarding tables and LSA broadcasts will cause scalability issues. The paper proposes a remedy for that by dividing the switches into regions (very similar to OSPF areas). Since we are running OSPF between switches, if a switch fails all the other switches will eventually know about it. Also, OSPF helps route packets along the shortest path.

We use consistent hashing to evenly distribute and store (IP,MAC) and (MAC,host_switch_id) key value pairs among the existing switches. Each switch chooses the MAC address of one of its interfaces as its unique network wide switch_id. Since, we use consistent hashing the information is almost equally spread across all the switches. So, instead of all switches storing all the information they now store only a subset of the information. Switches cache information to avoid doing repeated lookups (more below). Powerful switches can have more nodes in the key space (virtual switches) to hold more information. We can also hash information for services like PRINTER and DHCP_SERVER and store it on switches or implement anycasting by storing the same IP address with different MAC addresses or change the MAC address for fail-over. This is a tangent though.

It uses loop-free multicasting to handle general broadcasts and groups instead of VLANs (broadcasting in a group sends unicasts to group members). This was little hand wavy in the paper though.

H - Hash function used by all switches
H(k) = r_k resolver switch for key k. If (k,v) = (MAC,host_switch_id), r_k is called location resolver

We have two (k,v) stores (IP,MAC) and (MAC,host_switch_id).

How is the packet sent from src to dst. Does the host_switch see the MAC address and tunnel the packet to appropriate switch or do they attach a separate header like TRILL? The paper does not say anything about this.

What information does each switch store?
- (MAC,port) key value pairs of directly attached hosts.
- (IP,MAC) and (MAC,host_switch_id) key value pairs of directly attached hosts. Switch must delete these once it detects the host MAC is unreachable. The location resolver informs the switch if the host has migrated to another switch. In this case the switch updates entry and forwards the packets to the new switch and informs other switches of this change, and possibly deletes the entry after timeout.
- (MAC,host_switch_id) key value pairs for the part of key space it is responsible for.
- (MAC,host_switch_id) key value pairs for the MACs to which hosts directly attached to the switch are sending packets. This is not very big because most hosts communicate with a small number of popular hosts. This is removed after timeout.
- (IP,MAC) or (IP,MAC,host_switch_id) translation for the part of key space it is responsible for. This information is used only for replying to ARP requests.

When a host h_a joins switch s_a
The switch publishes (IP_a,MAC_a) and (MAC_a,s_a) to resolvers determined by H. It is now responsible for monitoring the liveness of resolver(how frequently is this done?) and republishing the host information. What if the publisher is the resolver?
The switches also receive the information through OSPF if another switch dies or new switch enters the network. They go through the cached (k,v) pairs and remove those entries where v = died switch. If a new switch has come up and (k,v) have been re-published to it, the old switch still maintains the (k,v), to answer queries from other switches which still don't know about the new switch, and removes the entry after a fixed timeout. How is this timeout decided?

h_a attached to switch s_awants to communicate to h_b attached to switch s_b
- h_a broadcasts an ARP request. s_aintercepts and forwards it to the resolver. Resolver sends back MAC_b and if optimization is enabled s_b. s_a sends ARP reply to h_a.
- s_a now tunnels all packets to MAC_b to s_b now.
Without optimization:
s_a calculates H(MAC_b) and tunnels the packet to the resolver (IP tunneling I guess). Resolver de-tunnels and finds that the packet must be sent to MAC_b which it knows how to reach. So, it tunnels the packet to s_band tells s_a also so that it can cache this information to use the shortest path from next time. In case a switch comes alive after reboot it uses this process to repopulate its cache.

Host Dynamics
3 types of dynamics:
1) Host changes location
The new host switch updates the (MAC,host_switch_id) entry to reflect the changes.

2) Host changes MAC
Update (IP,MAC) and deletes old (MAC,host_switch_id) entry and inserts a new one. Other hosts might have still cached the old information. So, the new switch maintains a MAC revocation list, and sends a gratuitous ARP to hosts sending packets packets to old MAC. It also informs the sending host switch about the entry (NewMAC, switchid). It can still forward the packets sent to old MAC to the host in the meantime.

3) Host changes IP
Host switch deletes old (IP,MAC) and inserts the new one

If both location and MAC change: When new switch registers (old_IP, NewMAC), the address resolver calculates (oldMAC, oldSwitchid) and tells old switch about MAC address change. Old switch now maintains revocation list sending gratuitous ARPs to senders along with sending new location of host (newMAC,newswitchid) to sending switch.

One drawback of SEATTLE is that it does not support ECMP so it does not utilize links efficiently (even if they are idle).

Thursday, June 21, 2012

Data Center TCP (DCTCP)

SIGCOMM 2010 paper:
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf

This is certainly one of the well written papers I have read in a long time. The paper is very easy to read (except the analysis in section 3.3) with powerful results. The paper brings into focus the bufferbloat problem which has received attention lately in home networks. This paper shows its existence in data center networks as well.

Preliminaries
TCP Incast
Read blog post by Brad Hedlund

The paper analyzes measurements from a 6000 server data center and studies performance anomalies. The network should provide low latency to short message flows, high burst tolerance for incast flows (queries) and high throughput for long lived bandwidth-hungry background flows. DCTCP is to be use strictly for intra-DC flows. The primary insight of the paper is that the switch buffer occupancy of the large flows should be kept as low as possible without affecting their throughput. This helps in two ways:
1) For delay sensitive short messages this reduces network latency
2) For query traffic which can cause Incast, it helps ensure that sufficient buffer space is available to absorb the micro-bursts and hence avoid packet loss => which causes latency because we have to wait fro TCP re-transmission timeout to expire. Reducing the RTO can fix the incast problem (see paper) but does not help with point 1 above.

Most commodity switches are shared memory switches that have logically common bugger for all switch ports/interfaces (fate sharing). The MMU manages fairness by setting a maximum amount of memory an interface can take. If more packets arrive beyond this limit they are dropped (Incast).

Buffer pressure: Since the packet buffer space is shared among all interfaces, long flows on other interfaces can reduce the space available for bursts of short flows on free interfaces leading to loss and timeouts. This does not require the incoming flows to be synchronized.

DCTCP uses the ECN bit which is set by switches when their queue length exceeds a threshold and senders reduce the congestion window in proportion to congestion (i.e. fraction of marked packets). The paper gives accurate guidelines for selecting the values of g and K.

Thorough evaluations with key takeaways after each experiment.

Tell me why?
1) Why did the authors not compare the performance of their algorithm to TCP Vegas. It would have been interesting to see how DCTCP fares against it though I agree with the authors claim measuring slight increases in RTT in a data center environment can be difficult.
2) The authors show that jittering reduces the response time at higher percentiles at the cost of increasing median response time. Why is this bad if all the responses meet the application deadline?
3) Why do authors use K as instantaneous queue length instead of average queue length? Don't they want to avoid effects of instantaneous packet bursts?
4) The authors say that using Delayed ACKs is important, however, another paper identified them as a serious problem in DCs.

Data center numbers for people who don't have access to data centers
1) The latency targets for data center applications like web search, retail, advertising and recommendation systems are ~10ms to ~100ms. Applications use Partition/Aggregate workflow pattern. Tasks not completed before deadline are cancelled and affect the quality of the final result.
2) 99.91% of traffic is TCP traffic.
3) In absence of queuing, intra-rack RTT = 100μs and inter-rack RTT = 250μs.
4) Each rack in a cluster holds ~40 servers. The data center in this paper has 44 servers per rack. There are typically 20 to 80 VMs per hypervisor.
5) ToR switches have 4MB of shared buffer for ToR switch with 48 1Gbps and 2 10Gbps ports.
6) Median number of concurrent flows active in a 50ms window on each worker node is 36.
7) To avoid TCP Incast developers limit size of response to 2 KB and introduce application level jittering to get asynchronous responses from worker nodes.

Notes to future me
Read TCP Vegas paper
How Facebook handles scale
Remember TCP Large Send Offload

Random quote:
The biggest challenge is to find a novel solution to a novel problem which people in academia deem novel

Monday, June 18, 2012

Bisection bandwidth of a data center

Read:
http://etherealmind.com/bisectional-bandwidth-l2mp-trill-bridges-design-value/

Assuming all the ports are 1Gbps, from the Fat-tree paper:

So, the architecture provides full bisection bandwidth and there is no oversubscription at the aggregation and edge switches.

Sunday, June 17, 2012

Applying NOX to the Datacenter

Get the paper here

The paper demonstrates that NOX can be used to implement various data center architectures (Portland and VL2) and addressing schemes; along with the fact that it scales well, can provide fine-grained QoS, access control and middlebox traversal. The paper has many good insights.

1) Data center networks are a special case networking paradigm which allow new architectures, specialized addressing schemes and easy modification of the hypervisor and network stack. This paves way for easier innovation as shown by recent research work some of which I have described in this blog. Also, since the data center is a single administrative domain using a centralized controller is better than using distributed dynamic protocols.

2) The paper describes three types of data centers: Bare-metal (w/o any hypervisor), virtualized (~20VMs per physical server) and virtualized-multitenant (like Amazon EC2)

3) Networking requirements of a data center:
i) Scaling: allocate more servers/VMs effortlessly as the demand increases. The data centers architecture and addressing schemes should be able to scale for at least a million VMs or hundreds of thousands of servers. With bigger scale the forwarding table entries grow, the switches/routers require more ports and hence become costlier and broadcasts become expensive.
ii) VMs should be able to migrate without any service interruption
iii) Fault tolerance to avoid service interruption

4) Next the paper describes VL2 and Portland. Here are the cons:
In Portland every core switch is connected to every pod with a single link: this will not scale well as it requires too many ports on core routers. VL2 runs OSPF over all its switches to build forwarding tables. This does not scale well as LSP which are broadcasted are expensive. VM migration in VL2 is not easy and requires the controller to push the new state to all the switches which are sending traffic to the VM. In both schemes ARP and new flows prompt sending packet to the controller which is not scalable in my opinion. The controller should be used only for bootstrapping and on link failures only. The middleboxes will take care of QoS and load-balancing. Doing fine grain control is not scalable

5) NOX can operate in both proactive and reactive mode. In proactive mode the controller computes all the forwarding data beforehand and pushes it onto the switches. This incurs no latency unlike the reactive approach where the first packet of each new flow is sent to the controller for processing.

6) An interesting point in this section was that for larger networks multiple NOX controllers can act in parallel where each new flow packet is sent to the designated controller in reactive mode. The controller work independently but must maintain global consistency about the network topology and host address mapping. Achieving consistency will incur more latency and complication, although it depends on the controller application under analysis.
The paper says an individual controller can handle 30K new flows per second with ~10ms latency.
A low cost 1Gbps switch can support 250K source/destination pair flow entries.

How to calculate over-subscription ratio: http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#wp1088848

7) In case of link/switch failure the controller has to install new flow entries at all the switches. The complexity and response time will increase with multiple failures. Also, installing low priority flow entries before hand will not work in case of multiple failures.

Saturday, June 16, 2012

Simple Expander graph constructions

An expander graph is a sparse graph with high connectivity properties. I will show code for simple constructions of some families of Expander Graphs using Python networkx library:

1) LPS graph

From Michael Nielsen's blog
In this example the family of graphs is indexed by a prime number,

. The set of vertices for the graph G_p

is just the set of points in Z_p

, the field of integers modulo

. We construct a

-regular graph by connecting each vertex $x \neq 0$ to x-1,x+1

and $x^{-1}$ . The vertex x=0

is connected to p-1,0

and

. According to the lecture notes by Linial and Wigderson, this was proved to be a family of expanders by Lubotsky, Phillips and Sarnak in 1988.

Python code

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def miller_rabin_pass(a, s, d, n):
a_to_power = pow(a, d, n)
if a_to_power == 1:
return True
for i in xrange(s-1):
if a_to_power == n - 1:
return True
a_to_power = (a_to_power * a_to_power) % n
return a_to_power == n - 1

def miller_rabin(n):
d = n - 1
s = 0
while d % 2 == 0:
d >>= 1
s += 1
for repeat in xrange(20):
a = 0
while a == 0:
a = random.randrange(n)
if not miller_rabin_pass(a, s, d, n):
return False
return True

def ramanujan_graph(p):
if not miller_rabin(p):
print "p is not prime"
return None
g = nx.Graph()

for i in range(p):
g.add_node(i)
for i in range(p):
if i == 0:
g.add_edge(0,p-1)
g.add_edge(0,1)
else:
g.add_edge(i,i-1)
g.add_edge(i,(i+1)%p)
g.add_edge(i,(i**(p-2))%p)

g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = ramanujan_graph(41)
PL.figure()
nx.draw_circular(g)
plt.show()

2) Paley graph

Paley graph of order q (where q is a prime or integer power of a prime) has q nodes, each with degree (q-1)/2. Paley graph is undirected when q % 4 = 1. The set of vertices for the graph G_p

is just the set of points in Z_p

. Two vertices a,b in graph G_p

are connected if a-b = x² % p where x lies in [1,p). See also: http://mathworld.wolfram.com/PaleyGraph.html

Python code

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def paley_graph(p):
if not p%4 == 1:
return None
square_list = []
for i in range((p-1)/2):
square_list.append((i**2)%p)

g = nx.Graph()

for i in range(p):
g.add_node(i)
for i in range(p):
for j in square_list:
g.add_edge(i,(i+j)%p)
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = paley_graph(17)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()

3) Margulis construction

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def margulis_graph(n):
#Graph has n^2 vertices
g = nx.Graph()

for i in range(n):
for j in range(n):
g.add_node((i,j))

for (i,j) in g.nodes():
g.add_edge((i,j),((i+2*j)%n,j))
g.add_edge((i,j),(i,(2*i+j)%n))
g.add_edge((i,j),(i,(2*i+j+1)%n))
g.add_edge((i,j),((i+2*j+1)%n,j))
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = margulis_graph(5)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()

4) Hypercube graph

Wikipedia. Use nx.hypercube_graph(n)

5) Superconcentrators

Smaller Explicit Superconcentrators

Acknowledgments
I would like to thank Abhishek Bhowmick and Ankit Garg for helpful discussions.

Saturday, June 9, 2012

Installing ns2 on Ubuntu 11.10

Prerequisites

Install development libraries for X11:

sudo apt-get install libx11-dev libxt-dev g++

Download ns-allinone package from here:
http://www.isi.edu/nsnam/ns/ns-build.html#allinone

Untar the package

tar -xvf ns-allinone-2.35.tar.gz

Install the package:

sudo ./install

Input the required path information in .bashrc file like this

export PATH=$PATH:<Place your paths here>

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: <place the LD_LIBRARY_PATHS>

export TCL_LIBRARY=$TCL_LIBRARY:<place the TCL_LIBRARY>

After these steps, you can now run the ns validation suite with

cd ns-2.35; ./validate

Wednesday, June 6, 2012

RBridges and TRILL

http://www.ieee-infocom.org/2004/Papers/26_1.PDF
http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_14-3/143_trill.html

Ivan Pepelnjak on TRILL:
"bridging forwarding paradigms are used to forward host-to-host data traffic, including building MAC address table by listening to data traffic and flooding of packets sent to unknown MAC destination. TRILL therefore retains most of the non-scalable properties of transparent bridging"

Brad Hedlund on TRILL in data center:
http://bradhedlund.com/2010/05/07/setting-the-stage-for-trill/

Cisco FabricPath:
http://www.youtube.com/watch?v=M85WYV8i3jg

Saturday, June 2, 2012

A Scalable, Commodity Data Center Network Architecture

First read this nice article on data center architectures by Brad Hedlund:
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/

The objective of the paper is the following: Achieve full bisection bandwidth (if the network is segmented into two equal parts, this is the bandwidth between the two parts) between all the hosts in a network using commodity Ethernet (incrementally scalable) switches in face of over-subscription.

Here is a traditional L3/L2 design:

There are typically 20-40 servers per rack. Each server is connected to a ToR switch via 1 Gbps link. ToR switches have 10 Gbps uplinks to ARs. ToR have 48 ports while AR and CRs have 128 ports. AR and CR have only 10 Gbps ports. So, the target is to have DC with 25,000+ hosts with a bisection bandwidth of 1 Gbps.

The paper proposes a Clos Network/Fat-tree network as a new network architecture for achieving this goal. They propose a L3/L2 architecture:

k - Number of ports of the switches used in the fat-tree
There are k pods, each pod has 2 layers of k/2 switches. Each edge switch is connected to k/2 hosts. There are (k/2)² CRs. The topology can support k/2*k/2*k = k³/4 hosts. Between any 2 hosts there are (k/2)² paths. We must spread the traffic on these paths somehow to achieve the bisection bandwidth.
Hosts connected to the same edge switch form a /24 subnet. VMs can migrate only within their own subnet. No talk of VLANs or migration between subnets in the paper.

Addressing (For a 10/8 network)
CR - 10.k.j.i - i and j lie in the range [1,k/2]
Pod switch - 10.pod.switch.1 (switch number lies in [0,k-1])
Host - 10.pod.switch.ID (ID lies in [2,k/2+1]) - This is very wasteful

Routing
Here is how the routing table looks like for a pod switch:

- There are k/2 prefixes for the subnets in the same pod and k/2 suffixes for k/2 hosts. Prefixes are preferred to suffixes.
- For intra-pod traffic the lower level pod switches send the packet to the upper level (evenly split between upper layer switches) which sends it back to the other edge switch. For inter-pod traffic the ARs spread the traffic among the CRs based on the last byte of the IP address. TCP ordering is not broken as packets always follow the same path to CR.
- The CR aggregate network prefixes of all the subnets (i.e. k /16 networks) which should not take too much space in the forwarding tables. The traffic spreading takes only on the route from edge to CRs.
- The paper advocates the use of a central controller (Openflow anyone?) for generating the routing tables and statically storing it on the switches. Since, the routing tables will not change this seems alright.

Friday, June 1, 2012

VL2: A Scalable and Flexible Data Center Network

The paper also tries to propose an architecture which allows the data center to become a single large Layer-2 domain. I will not go into the advantages of doing so - low cost, zero-configuration (apart from DHCP), make services scalable, easy VM migration etc.
This paper raises some important research questions in the field of data center networking. It combines a measurement study and a new architecture.

Q1. How to provide high capacity throughput to intra-dc flows in face of over-subscription?

Q2. How to isolate the traffic of one service from another? What prevents one service from flooding the network with its own traffic?
The paper does not really solve this problem

Q3. How to make Ethernet broadcasts scalable? VLANs don't scale well.
Only ARP and DHCP are made scalable. No talk about multicast.

System design:
1) Clos topology (links between AR and CR form a complete bipartite graph)
2) Virtual Layer-2 (hence the name VL2)
3) VLB and ECMP
4) Central directory service

Here is the topology:

The CR, AR and ToR use OSPF to learn about the switch-level topology and shortest path routing.

The measurement study done for the data center they were analyzing showed two properties:
1) Traffic pattern inside a DC is very unpredictable and changes rapidly
2) Ratio of traffic volume between servers to traffic volume leaving/entering the DC is 4:1

VL2 routing
- VLB is used to spread traffic across multiple CRs. Take a random path from a source ToR to a CR and from CR to destination ToR switch.
- Use anycast address at the CR and ToR to enable ECMP forwarding.
- Two address spaces - application-specific addresses (AAs) and location-specific addresses (LAs). The directory system maintains a mapping from AA to LA. Every server in the DC has a VL2 agent running in the network stack. The agent traps the ARP request and instead sends a unicast to the directory service and caches its reply. It is also responsible for tunneling (IP in IP) the packet to the ToR switch of the destination host. The agent also rate-limits broadcast traffic to prevent storms.
- DHCP is implemented using relay agents in each ToR and a central DHCP server. The directory service is notified of the LA-AA mapping when a new server joins the network and also after a server migrates to a new rack (by looking for gratuitous ARP).
- The directory service has two components:
1) Read optimized directory servers (store soft state which may be stale)
2) Write optimized strongly consistent RSM servers. They run Paxos for reliability and fault tolerance.

Here is how routing is done -

VL2 utilizes both VLB and ECMP (each one complementing other's weakness). If you use only VLB:
Each agent is provided a random intermediate router address. We don't need anycast addresses. Each router interface can have different IP. OSPF takes care of reachability. We stack up the IP headers at source and off we go. Now, what happens if the intermediate router fails. We will have to update a huge amount of VL2 agents to use a new random router.
If we use ECMP alone: Configure each router's interface to have the same unicast address (10.0.0.1). ECMP makes sure that any one of these can be taken as the next hop. In this case we don't need to do multiple encapsulations. A single encapsulation with H(ft) | dest_ToR_IP is enough. But commodity switches support only 16-way ECMP, so we can't spread traffic across all the intermediate routers.
Solution is to combine both VLB and ECMP: We define several anycast addresses, each associated with only as many intermediate switches as ECMP can accommodate. The agent is given a random intermediate router IP address. We perform multiple encapsulations. OSPF and ECMP reacts upon link/switch failure, eliminating the need to notify agents.

Drawbacks
1) Bipartite graph will limit scalability of this approach.
2) Does not talk about how to tackle multiple tenants each with same private IP address.
3) Configuring routers for OSPF is not scalable. This is an error prone process. Not sure if OSPF will be scalable (each router stores all the LSAs and broadcasts HELLO packets). Need to configure areas, which creates further configuration overhead.
4) Every outgoing packet needs to encapsulated (also the agent computes the hash and puts it in each packet). This increases latency, reduces useful bandwidth. If we have multiple layers in the data center hierarchy I don't think this approach will scale.
5) Hosts which need to communicate with outside world are given both LA and AA. This essentially fixes their position in the topology and further increases the burden on OSPF (these IPs are not aggregated).
6) VL2 has eventual consistency when it comes to AA-LA mappings. This implies that VM migration will have more downtime and packet loss.

Random
- A commodity switch stores only 16K entries in its routing table
- ToR informs the directory server in case it receives a packet for which it has no mapping which triggers the directory server to update the VL2 agent caches. This can lead to DOS attacks on the directory servers.
- The servers had delayed ACKs turned on which should be turned off.
- What is a hose model? Not really clear from the paper. The reference cited talks about it in terms of VPNs.

Pages