Network Ramblings: April 2013

Monday, April 22, 2013

SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary topologies

Drawbacks:
Not scalable
Requires end host modifications
Does not handle failures well

Wednesday, April 10, 2013

Java array creation running time

I created integer array of different sizes and measured the running time of array creation (eg. int[] x = new int[67108864])

67108864 0.052
134217728 0.102
268435456 0.206
536870912 0.41

It is linear.
To be fair this depends on the JVM version (I have java version "1.6.0_27") and there are algorithms out there which can give you constant time but take more memory (http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time/)

Monday, April 8, 2013

PAST: Scalable Ethernet for Data Centers

Here is the paper
PAST stands for Per-Address Spanning Tree routing algorithm which is the core idea behind the paper. Much of the introduction is similar to SEATTLE.

Pure Ethernet
Pros:
1) Self-configuration (plug-and-play)
2) Allows host mobility without hassles.
Cons:
1) Not scalable (Flooding based learning does not scale for more than 1000 hosts and also the MAC table becomes huge because each switch has to know how to divert traffic to all hosts and MAC addresses don't aggregate). We can limit broadcasts by using VLANs but they are limited (4094).
2) Spanning trees limits the available bandwidth by removing links to avoid forwarding loops.

IP
Pros:
1) Allows ECMP forwarding (limited to layer 3 networks currently). More efficient use of available bandwidth.
2) IP addresses are aggregatable and hence scalable.
Cons:
1) Difficult to configure and manage (router config, DHCP server config --> error prone)
2) Host mobility is an issue. Live Migration is needed for fault tolerance and to achieve high host utilization.

Ethernet + IP
Group Ethernet LANs via IP (run OSPF among IP routers). Layer 3 routers allow for ECMP routing and provide scalability.
Cons:
Subnetting limits host mobility to within a single LAN (subnet). Scalability becomes an issue as tenants grow. Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.

Problem statement
We want ease and flexibility of Ethernet and the scalability and performance of IP while using inexpensive commodity hardware (using proprietary hardware is harder to deploy and loses the advantage of economies of scale) and should work with arbitrary topologies.

PAST relies solely on destination MAC and VLAN tags to forward the packets. So, it can use the larger MAC tables (SRAM) instead of TCAMs (higher area and power per entry than SRAM).

The paper provides a good overview of Trident switch chip and a useful table on table sizes of some of the available commercial 10Gbps switches:

http://www.intel.com/content/www/us/en/switch-silicon/ethernet-switch-fm6000-series.html
IntelFM6000 support 64K IPv4 routes as well.

Lack of information about switch internals makes it difficult for networking researchers to consider constraints of real hardware. This paper fills this gap (for Trident switch chip).

L2 table performs an exact match lookup on VLAN ID and dest MAC address. This table is much larger compared to TCAM tables. Output of the table is either an output port or a group.

TCAMs can wildcard match on most packet header fields. Rewrite TCAM supports output actions that modify packet headers. Forwarding TCAM is used to choose an output port or group. Trident can support 1K ECMP groups each with maximum 4 outport ports in the ECMP Group Table.

Switches contain a control plane processor that is responsible for programming the switch chip, listening to controller messages and participating in control plane protocols like spanning tree, OSPF etc.

IBM RackSwitch G8264 has unique capability to install Openflow rules that exact match on only dest MAC and VLAN ID in L2 table. The switch can install 700-1600 rules/s and each rule installation takes 2-12ms. Each time a packet matches no rule it is sent to the controller. The limit on this feature is 200 packets/s. Therefore reactive forwarding will not provide acceptable performance.

Given the small size of TCAMs any routing mechanism that requires the flexibility of a TCAM must aggregate routes. The paper argues that given the large size of L2 tables we can fit one entry per routable address in every switch. I don't believe this argument at all. Portland started out with the same argument and basically said that lets bring the benefits of aggregation to MAC addresses. But by doing so you again limit host placement (and in Portland's case topology also). To get around this, Layer 2 and 3 schemes make use of indirection to separate location and application addresses which leads to familiar set of inefficiencies.
SPAIN and TRILL are not able to exploit the advantages of multiple paths in topologies like Fat-Tree, Jellyfish because STP constructs a single tree to forward packets along to avoid forwarding loops.
TRILL runs IS-IS to build shortest path routes between switches but it uses broadcast for address resolution, limiting its scalability.

Multipath routing and Valiant load balancing are used to fully exploit available bandwidth in a network.
If there are multiple shortest paths between two hosts we can use ECMP to increase path diversity, avoid hotspots and flow collisions. But ECMP is applicable in architectures which find shortest path like IP routing and TRILL (IS-IS).
VLB forwards traffic minimally to a random switch after which the traffic follows minimal path to its destination.

Core idea
Many possible spanning trees can be built for an address if the topologies have high path diversity. If we make individual per-address trees as disjoint as possible we can improve aggregate network utilization.

Baseline: Destination-rooted shortest path trees.
Build a BFS spanning tree for every address in the network. This spanning tree is rooted at the destination host, so provides minimum hop-count path from any point in the network to that destination. No links are ever disabled in the topology. Every switch forwards the traffic to the host via the path calculated using STP. Forward and reverse paths can be asymmetric (since STP is calculated for each host).

Three variants
PAST-R
Pick a uniformly random next hop in BFS. Intuitively we are spreading the traffic across all the links since each destination will have a random spanning tree.

PAST-W
Use weighted randomization in BFS: Weigh the next hop selection by considering how many children next hop-switch has.

Above two variants are similar to ECMP. Unlike ECMP which enables load balancing at per-flow PAST enables load balancing on per-destination granularity.

NM-PAST
This is similar to VLB and achieves high performance under adversarial workloads. Select a random intermediate switch i and use it as the root for the BFS spanning tree for each host h. The switches along the path in the tree from h to i are then updated to route traffic towards h and not i, making h the sink of the tree.

Evaluations found no significant difference between PAST-R and PAST-W. NM-PAST performs better than ECMP and VLB.

Implementation

PAST uses the Openflow controller-dumb switch architecture. All switches forward the ARP messages to the controller. The controller sends back ARP replies. Ditto for DHCP. Switches run LLDP, so the controller knows the entire topology. PAST is topology independent

What happens when?
1) A host joins/migrates: Migrated host sends a gratuitous ARP which is forwarded to controller. The controller calculates the new tree for the host and updates the rules on the switches.
2) A link/switch fails: All hell breaks loose. We need to recompute all affected trees (if the switch is in the core layer this can be very costly). Biggest drawback of the paper. The paper says the algorithm can compute trees for all hosts in a network with 8,000 hosts in 300ms and 100,000 hosts in 40 seconds. Installation of a single rule on the switch takes 12ms. They say it takes up to a minute for them to recover from a single failure.

Drawbacks

1) The authors say their switch does not need special hardware but they depend on using the IBM G8264 switches which is the only switch that currently supports installing Openflow rules in the Ethernet forwarding tables.
2) Installing a recomputed tree can lead to routing loops. This can be prevented by removing rules corresponding to that tree from the switches first and then use barrier command to make sure they are removed before installing new tree rules. Wastes time.
3) PAST requires one Ethernet table entry per routable address, so the scalability of the architecture is limited to the Ethernet table size (around 100K max). The paper hopes that PAST will scale well in future networks because it uses SRAM-based Ethernet tables.
SPAIN another paper similar to PAST requires a switch to store all possible (VLAN, destination MAC) pairs hence limiting the scalability of these approaches. The paper says that the switches resort to flooding in case of table overflows which is not acceptable in a big data center.

Wednesday, April 3, 2013

CloudNaas: A Cloud Networking Platform for Enterprise Applications

Here is the paper
The paper is not well written. It is all over the place and is missing many details. The introduction moves in circles telling the same thing again and again. It leaves you confused about what problem the paper is trying to solve. It is only on page 4 when the paper talks about the implementation that you understand what the paper's about.

I have rewritten the premise of the paper in my words:

Network admin has an enterprise network. He wants to migrate it to the cloud but he is reluctant because clouds don't support MBs and private IPs (absent or limited control available to customers to configure the network). The cloud provider need to reduce the need to rewrite applications when moving them to the cloud. We should allow customers to use private IP addresses that they were using in their enterprise networks. The paper says another key issue is to allow broadcasts - I don't think this is an issue as broadcasts are generally bad and should be avoided unless absolutely necessary. There are ways for getting around ARP and DHCP broadcasts by making them unicasts to the network controller/directory service. Also, you can replace broadcasts with IP multicasts.

Now, CloudNaaS comes into the picture: It asks the admin for the network policy and tries to satisfy the needs of this logical topology using the cloud's physical resources.

CloudNaaS is an under the cloud implementation i.e. it is implemented by the cloud provider
Allow customer to specify network requirements (nothing new)
1) Tenants specify bandwidth requirements for applications hosted in the cloud.
2) Specify MB traversal

Components:

Cloud controller
Takes user network policy and
Network controller
Monitor and configure the network devices
Decide optimal placement of VMs in the cloud to satisfy network policies - Monitor and change to meet policies

Drawbacks
1) Multiple VMs can have the same private IP address. How do you distinguish between them? Each VM has a public and private IP address and the software switch inside the hypervisor rewrites the IP address of incoming and outgoing packets (How does it work with tunneling). On migration these rules are updated. So, communication still happens using public IP address. To allow MB traversal, CloudNaaS installs rules in software switches of source VM and subsequent MBs which simply tunnel the packet through the policy path. This works but now you will need policy rules at each hop which means changing policies is inflexible, you use more switch memory, installing certain policies is infeasible. You still have to identify the previous hop somehow but the paper does not talk about it.

2) This assumes the MB is a layer 3 device. How do I handle transparent firewalls/NAT? These devices will now have to be installed at choke points (the paper says they provide NAT service at cloud gateway) within the data center network which has disadvantages as explained in PLayer paper.

3) The software switch attached to the MB has to know the privateIP<-->public IP translation for all the hosts of the tenant. Migration is still not seamless. How does a LB work? It can't use DSR/DNAT/Tunneling
4) Bandwidth reservation schemes are not very useful in a cloud with high churn rate. They are either overly conservative and lead to low network utilization or overly lenient and have bad isolation as a result.
5) Not super clear about the communication matrix business and how is VM placement decided using that?
6) The paper does not talk about how the core network topology is set up. It implicitly assumes it somehow works in a scalable manner. The policies are pushed to the edge switches in the hypervisors. The paper suggests the optimization of breaking up the address space and allocating each subnet to servers attached to a ToR. This is exactly what a traditional hierarchical data center looks like (doesn't allow seamless VM migration). Also, migration changes the public IP address of a host and this requires major rewriting of rules on all switches.

Things learnt from the paper
The hardware Openflow switches use expensive and power hungry TCAMs while the software switches use DRAM. As a result the software switch can store many more rules. So, try and push more intelligence to the edge of the network while treating the core as a simple and dumb packet pusher.

Tuesday, April 2, 2013

Paxos demystifyied

Videos
http://www.youtube.com/watch?v=s66GsKmU7kg
http://www.youtube.com/watch?v=5scBtoyz8HU
http://www.youtube.com/watch?v=r08MaA31wQc

Paxos made practical
http://read.seas.harvard.edu/~kohler/class/08w-dsi/mazieres07paxos.pdf
http://angusmacdonald.me/writing/paxos-by-example/#_ftn3
ftp://ftp.cs.utexas.edu/pub/techreports/tr07-25.pdf

Distributed systems
http://www.scs.stanford.edu/07wi-cs244b/
http://www.stanford.edu/class/cs244b/
http://www.scs.stanford.edu/~dm/home/classes.html

Implementation coming soon...

A Policy-aware Switching Layer for Data Centers

A good paper with good ideas about the role and position of middleboxes in data center topology. The premise of the paper is simple. Middleboxes are cause of lot of agony in data centers. 78% of downtime in data center is cause by misconfiguration. This is because of human involvement in implementing the middlebox policies in a data center. Typical methods used to ensure correct MB traversal are:
1) Overloading path selection mechanisms (adjusting weights in spanning tree protocol)
2) Put MBs at choke points
3) Place MBs at all possible choke points/ incorporate MB functionality in switches - costly.

Problems
1) Ad-hoc practices like overloading existing path selection mechanisms are hard to configure and maintain (think link/switch failure). Other practices: remove links to create choke points: complex and we lose fault tolerance, separate VLANs with MB at inter-connection points: violates efficiency, can no longer do seamless VM migration.
2) Inefficiency - packet should not traverse MBs it is not supposed to - this wastes resources and can also cause incorrect behavior.
3) Inflexibility - New instances of MB can be deployed to balance load (how to make this automatic) or policy changes in future. Doing this will require human intervention - leads to errors.
4) If MB fails this leads to network partition (since MBs are placed at choke points)

What we need
1) Correctness - Traffic should traverse middleboxes in the sequence specified by the network administrator under all network conditions.
2) Flexibility - The sequences of middleboxes should be easily reconfigured as application requirements change.
3) Efficiency - Traffic should not traverse unnecessary middleboxes

Solution
1) Don't place MBs at choke points, instead attach them to pswitches and configure switches to implement MB policies i.e. Take MBs off the physical network path. Data centers have low network latency so sending packets to off-path MBs is not costly
2) Separate logical topology from physical topology or in other words separate policy from reachability.

Architecture
Consists of Policy controller, Middlebox controller, and pswitches

Policy controller accepts policies (from network admin) and converts it into pswitch rules

Policies are of the form:
[Start Location, Traffic selector (5-tuple)] --> sequence of MB types

Pswitch rules are of the form:
[Previous Hop, Traffic selector (5-tuple)] --> Next Hop

Middlebox controller monitors the liveness of MBs and informs pswitches about addition or failure of MBs.

Pswitches
perform three operations:
1) Identify the previous hop traversed by the frame - based on their source MAC addresses (this can cause problems if the MB modifies the source MAC address) or incoming interface.
2) Determine the next hop to be traversed by the frame
3) Forward the frame to its next hop - using L2 encapsulation - A redirected frame is encapsulated in a new Ethernet frame identified by a new EtherType code. The dst MAC is set to next MB or server and src MAC is set to original frame or last MB instance traversed. Preserving original MAC address is required for correctness by some MBs.
Forwarding also allows balancing load across multiple instances of same MB type (specify multiple next hops and use 5-tuple consistent hashing to distribute traffic among them) which is resilient to MB failure
If the next hop is a transparent device (say firewall) then we need to identify it somehow using a MAC and set up the dst MAC of packets going to it. Also, the next hop after the firewall will need to identify the previous hop as the firewall somehow. The paper gets around this by giving a fake MAC address to such devices which is registered with the MB controller when the device comes up for the first time. If the device is attached to a non-pswitch however, we then need a SrcMacRewriter in between the MB and the switch. This is a stateless device which inserts a special source MAC address that can uniquely identify the MB.

Drawbacks of the paper
1) The number of rules stored in pswitches does not scale very well. Each pswitch is essentially storing rules for all the policies implemented in the data center. This will not scale well if we move to public cloud with multiple tenants.
2) Leaves MB unmodified but modifies switches (cannot be adopted in its current form)
3) The paper does not talk much about how pswitches fit in to a traditional data center topology (it briefly says that they'll replace layer-2 switches). The examples they give are focused more on simple line topologies. There is a disconnect here.
4) Uses flooding-based L2-learning switch which does not scale well. Broadcasts become a problem.
5) Extra SrcMacRewriter boxes for transparent MBs.
6) Some MBs are stateful (firewalls), so we need to make the packets in both directions (forward and reverse) go through the same MB. The pswitches must make sure this condition is met.
7) If the policy changes in future, ensuring that all MBs switch to new policy simultaneously is not possible*. Some frames will violate middlebox traversal guarantee (i.e. they might not traverse either the previous or the new policy). This is an eventual consistency model and has security vulnerabilities. An end-to-end model will be better here. We just change the policy at the end and get stronger consistency guarantees.

8) Since, pswitches use consistent hashing, adding new MBs of same type causes re-assignment of existing flows. This is bad for stateful MBs.

* It is possible to get the consistency guarantees even in this case but it requires essentially doubling the number of rules installed on the switches. (http://conferences.sigcomm.org/sigcomm/2012/paper/sigcomm/p323.pdf)

Pages