Network Ramblings: 2012

Friday, December 28, 2012

Viewing the contents of an archive/static library

http://prefetch.net/blog/index.php/2005/04/25/printing-the-contents-of-a-static-library/

Wednesday, December 26, 2012

Real and Effective User IDs

http://www.makelinux.net/alp/083

http://www.bashguru.com/2010/03/unixlinux-advanced-file-permissions.html

http://bash.cyberciti.biz/monitoring/setgid-setuid-find-script/

http://bash.cyberciti.biz/monitoring/find-setuid-files-script/

Thursday, December 6, 2012

bash-4.1$ stat sample
File: `sample'
Size: 5             Blocks: 68         IO Block: 65536 regular file
Device: 1bh/27d    Inode: 502918152   Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce)   Gid: (   34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500

bash-4.1$ ln -s sample sample1

bash-4.1$ stat sample
File: `sample'
Size: 5             Blocks: 68         IO Block: 65536 regular file
Device: 1bh/27d    Inode: 502918152   Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce)   Gid: (   34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500

So, the link count did not increase. If you don't use the -s flag and create a hard link, the link count increases by 1 on each link operation.

Use the readlink command to read contents of a symbolic link

lrwxrwxrwx. 1 foo admin 5 Dec 6 23:21 avg_s.c -> avg.c

$ readlink avg_s.c
avg.c

So, the symbolic link contains the path to the file it is pointing to.

Soft link pros:
1) Can create link to files in other file systems or remote file systems.
2) Can create link to directories. Hard link to directories are not allowed in Unix.
3) Editing file using text editors like vim is more reliable. If we have hard links the edits to file inode are done in place make it prone to inconsistency on failures.
Cons:
1) Wastes an inode and takes space on disk.
2) Indirection. Slower.

Hard link pros:
1) Faster access as there is no indirection
2) Protection from accidental deletion.
3) Versioning/Backup software can easily make sure that file is backed up only once.

How to get the contents of a directory?

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <assert.h>

int main(int argc, char *argv[])
{
DIR * dp = opendir(".");
assert(dp != NULL);

struct dirent *d;
while ((d = readdir(dp)) != NULL)
{
printf("%d %hd %ld %s\n", (int)d->d_type, d->d_reclen, (long) d->d_ino, d->d_name);
}
closedir(dp);
return 0;
}

Saturday, November 24, 2012

Tar 101

I keep forgetting the options
http://www.cyberciti.biz/faq/how-do-i-compress-a-whole-linux-or-unix-directory/

Thursday, November 22, 2012

Virtual Memory Management in x86

System Developer Manuals:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html

Condensed picture:
http://osr507doc.sco.com/en/OSAdminG/addr_trans.html

Some more explanation:
http://www.cs.cmu.edu/~410/doc/segments/segments.html

When is LDT used?
http://en.wikipedia.org/wiki/Local_Descriptor_Table

http://stackoverflow.com/questions/4575032/reason-for-different-segments-in-linux-on-x86

Tuesday, November 20, 2012

Follow my blog with Bloglovin

Friday, November 16, 2012

Programming Interview Questions

1) Given an integer x and a sorted array a of N distinct integers, design a linear-time algorithm to determine if there exists two distinct indices i and j such that a[i] + a[j] == x
http://stackoverflow.com/questions/11928091/linear-time-algorithm-for-2-sum

2) Find the starting node of the linked list cycle
http://stackoverflow.com/questions/2936213/explain-how-finding-cycle-start-node-in-cycle-linked-list-work
Using this diagram. Let
i = m + p.n + k - m
2i = m + q.n + k - m

k = (q - 2p).n i.e. a multiple of cycle length

Now, let
k - m + x = n
Then,
k = (q - 2p -1).n + k - m + x
m = (q - 2p -1).n + x

So, if two pointers start at the beginning of the list and the meeting point and move the same number of steps, they will meet at the starting node of the cycle.

3) Given a string find all its permutations:

#include <stdio.h>
#include <string.h>
void perm(char * base, int * res, int pos)
{
int i,j;
for (i = 0; i < strlen(base); i++){
for (j = 0; j < pos; j++)
if (i == res[j])
break;
if (j==pos){
res[pos] = i;
if (pos+1 != strlen(base))
perm(base, res, pos + 1);
else{
for (j = 0; j < strlen(base); j++)
printf("%c",base[res[j]]);
printf("\n");
return;
}
}
}
}
int main(){
int i;
char str[10];
int res[10];
printf("Enter a string: ");
scanf("%s",str);
for (i = 0; i < strlen(str); i++)
{
res[0] = i;
perm(str, res, 1);
}
return 0;
}

4) Counting inversions in a array:
http://stackoverflow.com/questions/337664/counting-inversions-in-an-array

5) Find the element appearing more than n/2 times in an array in O(n) time.
http://keithschwarz.com/interesting/code/?dir=majority-element

6) Find the median of two sorted arrays
http://www.youtube.com/watch?v=_H50Ir-Tves

7) Each node of a linked list is of struct data type

struct node {
struct node *next;
struct node *what;
void *data;
};

what is a pointer to a random node in the list. Duplicate the list in O(N) time without using any extra space.

http://stackoverflow.com/questions/1924741/deep-copy-linkedlist-without-destroying-original-list-and-extra-storage-space-u/1924968#1924968

8) Given an array of size N, find the sub-sequence of elements of size L in the array which has the maximum sum in O(N) time.

http://stackoverflow.com/questions/7861387/maximum-contiguous-subsequence-sum-of-at-least-length-l

Tuesday, November 6, 2012

Cache Basics

For uniprocessor:
http://www.ccs.neu.edu/course/com3200/parent/NOTES/cache-basics.html

What constrains the size of L1 cache?
1) Space constraints - Since L1 is located on CPU chip it needs to small.
2) Larger the cache more is the access latency.
3) Larger cache have more soft errors. We remove ECC bits from L1 cache by keeping it small. This also lowers the latency of L1.

TLB and Page Tables
http://users.dickinson.edu/~braught/courses/cs354f97/Classes/Class17/Class17LN.html

Sunday, November 4, 2012

Mbps vs MBps vs MiBps

I forget what is what every year, so this is for future me:
http://en.wikipedia.org/wiki/Data_rate_units

Thursday, October 25, 2012

Find duplicates in N logk time in k sorted arrays

Construct heap of size k using first elements of each array and then keep on adding elements and calling delmin()

Euler cycle in an undirected graph

http://www.graph-magics.com/articles/euler.php

Finding a Hamiltonian cycle is a NP complete problem

Thursday, October 18, 2012

Math.abs of -2^31

http://rl337.org/2012/07/31/in-java-when-is-math-abs-negative/

Friday, September 21, 2012

Parsing command line arguments in C

http://www.ibm.com/developerworks/aix/library/au-unix-getopt.html

Sunday, September 16, 2012

How does gcc calculate size of a structure?

http://stackoverflow.com/questions/6963998/how-does-gcc-calculate-the-required-space-for-a-structure/6964021

Monday, August 13, 2012

Great book on ns2

Introduction to Network Simulator NS2

Printing on the network

lpstat -a : Lists all available printers
lpr -P <printer_name> <file_name> : Print file
lpq -P <printer_name> : See printer queue
lprm : To cancel a print job
enscript -P <printer_name> -2rhC -f Courier7 <file_name> : Good layout for printing source code

Sunday, August 5, 2012

TCP Vegas: New Techniques for Congestion Detection and Avoidance

Saturday, August 4, 2012

How to measure TCP congestion window, re-transmission count, ssthresh

http://linuxgazette.net/136/pfeiffer.html

Monday, July 30, 2012

ICTCP: Incast Congestion Control for TCP in Data Center Networks

Safe and effective fine-grained TCP retransmissions for datacenter communication

Better Never than Late: Meeting Deadlines in Datacenter Networks

Measurement and Analysis of TCP Throughput Collapse in Cluster-based Storage Systems

Wednesday, July 25, 2012

Python Logging

Really nice tutorial:

http://pingbacks.wordpress.com/2010/12/21/python-logging-tutorial/

Wednesday, June 27, 2012

I created a shell script in Windows and ran it on Linux...

and it did not work. Everything seems to be in order. The problem is in Windows the EOL marker is \r\n and in Linux it is \n. You need to install dos2unix to convert your file to unix specifications.

sudo apt-get install dos2unix

Also see:
http://www.virtualhelp.me/linux/164-dos2unix-missing-ubuntu-1004

Tuesday, June 26, 2012

Learn Python

Awesome videos by Nick Parlante. All righty
1) http://www.youtube.com/watch?v=tKTZoB2Vjuk
2) http://www.youtube.com/watch?v=EPYupizJYQI
3) http://www.youtube.com/watch?v=haycL41dAhg
4) http://www.youtube.com/watch?v=kWyoYtvJpe4
5) http://www.youtube.com/watch?v=uKZ8GBKmeDM
6) http://www.youtube.com/watch?v=Nn2KQmVF5Og
7) http://www.youtube.com/watch?v=IcteAbMC1Ok

Monday, June 25, 2012

Towards a Next Generation Data Center Architecture: Scalability and Commoditization

This paper is the precursor to the VL2 paper. Lots of interesting ideas. My random thoughts:

Preliminaries:
Redundancy models
N+1
N+1 means we have 1 backup device per group, so we can tolerate one failure. Any component can replace
any of the components, but only once. Minimum number of components is N.

1+1
1+1 means that every component has one dedicated backup. Each component can be replaced by the one
and only other backup device. Minimum number of components is 1.
In a 1+1 system the standby server has the ability to maintain the state, e.g. in a phone network the standby server would be able to keep ongoing calls as it constantly monitors the state of all activities in the active server. In an N+1 system there's only one standby for N active servers and the call state is lost in case of a switchover.

Destination NAT/Half-NAT + Direct Server Return + Source NAT/Full NAT
Read http://lbwiki.com/index.php/DSR for understanding DSR. DSR requires the load balancer and server to be in the same Layer 2 domain. Source/Destination NAT require that all packets pass through them in reverse direction as well. Therefore, the servers should set up the load balancer as their default route. This severely limits the number of servers which a load balancer can support.

This figure is taken from the excellent book Load Balancing Servers, Firewalls, and Caches

- Workload inside a data center changes frequently which leads to sporadic congestion. VLB helps accommodate arbitrary traffic patterns.

- No link oversubscription: Any server can communicate with any other server at full NIC speed (minimum of sender and receiver's NIC, usually 1 Gbps)

- This is probably the earliest paper proposing the idea of software load balancers on commodity servers in data centers. Advantages: Scale out. Flexible load balancing algorithms. Cheap Disadvantages: Can't forward at line rate.

- Packets coming from external sources are sent to an Ingress Server which performs packet encapsulation after contacting the directory service. This is a hidden hardware cost we must incur. Interestingly, VL2 does not even mention anything about Load Balancers or communication with external hosts.
- Used MAC-in-MAC tunneling instead of IP-in-IP tunneling used in VL2. The switches know about other switches (not hosts) and directly connected servers only.
- OS is changed but not the applications. The network stack performs encapsulation after examining each packet. Would be slower than usual.

- Hose traffic model: No server is asked to send or receive more than what its network interface allows.
- Monsoon proposes that we run IS-IS with the controller, so that it knows where every host is located in order to run the directory service.

Random Facts

- Networking gear consumes 10-20% of the data center's equipment budget
- An Ethernet port is 10% to 50% the cost of an equivalent Layer 3 port
- 16K MAC entries can stored in commodity switches.

DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers

I started reading this paper a month ago but put it down after 2 pages. I couldn't follow the paper. I tried reading it again a week ago and put it down again. Why is this paper so hard to read? Why am I so stupid? I found two blog posts where students had reproduced the results in the paper and had found the paper very clear and easy to read. I was like

Read blog post 1 2
So if anyone out there is reading this post, I would really appreciate it if you can provide links to blog posts, videos or ppts which explain this paper in an easy way. I want to understand this paper in depth.

Sunday, June 24, 2012

Hedera: Dynamic Flow Scheduling for Data Center Networks

To repeat the experiments get the code from:
http://reproducingnetworkresearch.wordpress.com/2012/06/09/hedera-2/
or
http://reproducingnetworkresearch.wordpress.com/2012/06/06/hedera/

The paper is straightforward and fairly easy to read.

Things learnt
- VLB does per-flow instead of per-packet spreading of flows to preserve packet ordering. TCP treats unordered packets as a signal of congestion.
- The whole idea of the paper is that Elephant flows must be scheduled/placed carefully to achieve good aggregate bandwidth.
- Global first fit works quite well and there might be no need to run the complicated simulated annealing. The only disadvantage might be the running time of the algorithm.
- NetFPGA Openflow switches have 2 hardware tables: 32 entry TCAM that does matching on variable length prefixes and 32K entry SRAM that matches flow entries with fully qualified 10 tuples

git checkout new branch

I always forget the command for this so I am writing it here:

To view the branches:

git branch -a

To checkout a branch:

git checkout -b branch_name

then do

git pull origin branch_name

Friday, June 22, 2012

Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises

This is a SIGCOMM 2008 paper:
http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf

Just for fun, you can see network conference acceptance stats here

I had read the paper before but I felt too lazy to write a post about it. I finally got around to it. This paper can be described in two words: Consistent Hashing. Ok maybe three: Consistent Hashing Ethernet. Definitely four: Consistent Hashing Ethernet Scalable. Read my earlier post on Consistent Hashing for refreshing your concepts.

The basic premise of the paper is very simple but things gets convoluted when you introduce caching, timeouts, mobility and failures. I felt the paper could have addressed these scenarios lucidly. Also, I felt the paper was badly organized especially in the middle sections. The good thing is it only requires changes to switch architecture and control plane and not the end hosts but if a simple protocol like TRILL took what about 7 years? to get adopted by Cisco this will take another decade or so. Here is the rundown.

Why is Ethernet good?
- Plug-and-play
- Host mobility

Why is Ethernet bad?
- Source MAC learning requires flooding. DHCP, ARP packets are flooded which waste link bandwidth and processing resources.
- Broadcast storms because there is no TTL in the header.
- Spanning tree computation under-utilizes links. Frames are not always sent via shortest paths. Root bridge must be selected carefully as it carries most of the traffic.
- Lot of security vulnerabilities and privacy concerns. Snooping on other people's traffic. ARP spoofing. DOS attacks.

L3/L2 architecture: Pools of Ethernet domains connected by IP routers.
- Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.
- Host mobility is restricted only within the subnet

Enter VLAN
- No restrictions on host mobility. Though we need to trunk the VLAN at all the bridges.
- Scopes broadcasts to small domains
- Each subnet is treated as a VLAN (http://www.youtube.com/watch?v=PpuWBh3EnVM)
- Need to compute multiple spanning trees now and choose root bridges carefully so as to effectively utilize all links and balance load
- Limited number: 4094
- Switches still need to maintain forwarding table data for all hosts in VLANs provisioned to them

The switches run link-state routing protocol (OSPF) among themselves. This is scalable as it is used to maintain only the switch level topology and not disseminate end-host information. Although, as the network gets bigger it would hit a limit because of large forwarding tables and LSA broadcasts will cause scalability issues. The paper proposes a remedy for that by dividing the switches into regions (very similar to OSPF areas). Since we are running OSPF between switches, if a switch fails all the other switches will eventually know about it. Also, OSPF helps route packets along the shortest path.

We use consistent hashing to evenly distribute and store (IP,MAC) and (MAC,host_switch_id) key value pairs among the existing switches. Each switch chooses the MAC address of one of its interfaces as its unique network wide switch_id. Since, we use consistent hashing the information is almost equally spread across all the switches. So, instead of all switches storing all the information they now store only a subset of the information. Switches cache information to avoid doing repeated lookups (more below). Powerful switches can have more nodes in the key space (virtual switches) to hold more information. We can also hash information for services like PRINTER and DHCP_SERVER and store it on switches or implement anycasting by storing the same IP address with different MAC addresses or change the MAC address for fail-over. This is a tangent though.

It uses loop-free multicasting to handle general broadcasts and groups instead of VLANs (broadcasting in a group sends unicasts to group members). This was little hand wavy in the paper though.

H - Hash function used by all switches
H(k) = r_k resolver switch for key k. If (k,v) = (MAC,host_switch_id), r_k is called location resolver

We have two (k,v) stores (IP,MAC) and (MAC,host_switch_id).

How is the packet sent from src to dst. Does the host_switch see the MAC address and tunnel the packet to appropriate switch or do they attach a separate header like TRILL? The paper does not say anything about this.

What information does each switch store?
- (MAC,port) key value pairs of directly attached hosts.
- (IP,MAC) and (MAC,host_switch_id) key value pairs of directly attached hosts. Switch must delete these once it detects the host MAC is unreachable. The location resolver informs the switch if the host has migrated to another switch. In this case the switch updates entry and forwards the packets to the new switch and informs other switches of this change, and possibly deletes the entry after timeout.
- (MAC,host_switch_id) key value pairs for the part of key space it is responsible for.
- (MAC,host_switch_id) key value pairs for the MACs to which hosts directly attached to the switch are sending packets. This is not very big because most hosts communicate with a small number of popular hosts. This is removed after timeout.
- (IP,MAC) or (IP,MAC,host_switch_id) translation for the part of key space it is responsible for. This information is used only for replying to ARP requests.

When a host h_a joins switch s_a
The switch publishes (IP_a,MAC_a) and (MAC_a,s_a) to resolvers determined by H. It is now responsible for monitoring the liveness of resolver(how frequently is this done?) and republishing the host information. What if the publisher is the resolver?
The switches also receive the information through OSPF if another switch dies or new switch enters the network. They go through the cached (k,v) pairs and remove those entries where v = died switch. If a new switch has come up and (k,v) have been re-published to it, the old switch still maintains the (k,v), to answer queries from other switches which still don't know about the new switch, and removes the entry after a fixed timeout. How is this timeout decided?

h_a attached to switch s_awants to communicate to h_b attached to switch s_b
- h_a broadcasts an ARP request. s_aintercepts and forwards it to the resolver. Resolver sends back MAC_b and if optimization is enabled s_b. s_a sends ARP reply to h_a.
- s_a now tunnels all packets to MAC_b to s_b now.
Without optimization:
s_a calculates H(MAC_b) and tunnels the packet to the resolver (IP tunneling I guess). Resolver de-tunnels and finds that the packet must be sent to MAC_b which it knows how to reach. So, it tunnels the packet to s_band tells s_a also so that it can cache this information to use the shortest path from next time. In case a switch comes alive after reboot it uses this process to repopulate its cache.

Host Dynamics
3 types of dynamics:
1) Host changes location
The new host switch updates the (MAC,host_switch_id) entry to reflect the changes.

2) Host changes MAC
Update (IP,MAC) and deletes old (MAC,host_switch_id) entry and inserts a new one. Other hosts might have still cached the old information. So, the new switch maintains a MAC revocation list, and sends a gratuitous ARP to hosts sending packets packets to old MAC. It also informs the sending host switch about the entry (NewMAC, switchid). It can still forward the packets sent to old MAC to the host in the meantime.

3) Host changes IP
Host switch deletes old (IP,MAC) and inserts the new one

If both location and MAC change: When new switch registers (old_IP, NewMAC), the address resolver calculates (oldMAC, oldSwitchid) and tells old switch about MAC address change. Old switch now maintains revocation list sending gratuitous ARPs to senders along with sending new location of host (newMAC,newswitchid) to sending switch.

One drawback of SEATTLE is that it does not support ECMP so it does not utilize links efficiently (even if they are idle).

Thursday, June 21, 2012

Data Center TCP (DCTCP)

SIGCOMM 2010 paper:
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf

This is certainly one of the well written papers I have read in a long time. The paper is very easy to read (except the analysis in section 3.3) with powerful results. The paper brings into focus the bufferbloat problem which has received attention lately in home networks. This paper shows its existence in data center networks as well.

Preliminaries
TCP Incast
Read blog post by Brad Hedlund

The paper analyzes measurements from a 6000 server data center and studies performance anomalies. The network should provide low latency to short message flows, high burst tolerance for incast flows (queries) and high throughput for long lived bandwidth-hungry background flows. DCTCP is to be use strictly for intra-DC flows. The primary insight of the paper is that the switch buffer occupancy of the large flows should be kept as low as possible without affecting their throughput. This helps in two ways:
1) For delay sensitive short messages this reduces network latency
2) For query traffic which can cause Incast, it helps ensure that sufficient buffer space is available to absorb the micro-bursts and hence avoid packet loss => which causes latency because we have to wait fro TCP re-transmission timeout to expire. Reducing the RTO can fix the incast problem (see paper) but does not help with point 1 above.

Most commodity switches are shared memory switches that have logically common bugger for all switch ports/interfaces (fate sharing). The MMU manages fairness by setting a maximum amount of memory an interface can take. If more packets arrive beyond this limit they are dropped (Incast).

Buffer pressure: Since the packet buffer space is shared among all interfaces, long flows on other interfaces can reduce the space available for bursts of short flows on free interfaces leading to loss and timeouts. This does not require the incoming flows to be synchronized.

DCTCP uses the ECN bit which is set by switches when their queue length exceeds a threshold and senders reduce the congestion window in proportion to congestion (i.e. fraction of marked packets). The paper gives accurate guidelines for selecting the values of g and K.

Thorough evaluations with key takeaways after each experiment.

Tell me why?
1) Why did the authors not compare the performance of their algorithm to TCP Vegas. It would have been interesting to see how DCTCP fares against it though I agree with the authors claim measuring slight increases in RTT in a data center environment can be difficult.
2) The authors show that jittering reduces the response time at higher percentiles at the cost of increasing median response time. Why is this bad if all the responses meet the application deadline?
3) Why do authors use K as instantaneous queue length instead of average queue length? Don't they want to avoid effects of instantaneous packet bursts?
4) The authors say that using Delayed ACKs is important, however, another paper identified them as a serious problem in DCs.

Data center numbers for people who don't have access to data centers
1) The latency targets for data center applications like web search, retail, advertising and recommendation systems are ~10ms to ~100ms. Applications use Partition/Aggregate workflow pattern. Tasks not completed before deadline are cancelled and affect the quality of the final result.
2) 99.91% of traffic is TCP traffic.
3) In absence of queuing, intra-rack RTT = 100μs and inter-rack RTT = 250μs.
4) Each rack in a cluster holds ~40 servers. The data center in this paper has 44 servers per rack. There are typically 20 to 80 VMs per hypervisor.
5) ToR switches have 4MB of shared buffer for ToR switch with 48 1Gbps and 2 10Gbps ports.
6) Median number of concurrent flows active in a 50ms window on each worker node is 36.
7) To avoid TCP Incast developers limit size of response to 2 KB and introduce application level jittering to get asynchronous responses from worker nodes.

Notes to future me
Read TCP Vegas paper
How Facebook handles scale
Remember TCP Large Send Offload

Random quote:
The biggest challenge is to find a novel solution to a novel problem which people in academia deem novel

Monday, June 18, 2012

Bisection bandwidth of a data center

Read:
http://etherealmind.com/bisectional-bandwidth-l2mp-trill-bridges-design-value/

Assuming all the ports are 1Gbps, from the Fat-tree paper:

So, the architecture provides full bisection bandwidth and there is no oversubscription at the aggregation and edge switches.

Sunday, June 17, 2012

Applying NOX to the Datacenter

Get the paper here

The paper demonstrates that NOX can be used to implement various data center architectures (Portland and VL2) and addressing schemes; along with the fact that it scales well, can provide fine-grained QoS, access control and middlebox traversal. The paper has many good insights.

1) Data center networks are a special case networking paradigm which allow new architectures, specialized addressing schemes and easy modification of the hypervisor and network stack. This paves way for easier innovation as shown by recent research work some of which I have described in this blog. Also, since the data center is a single administrative domain using a centralized controller is better than using distributed dynamic protocols.

2) The paper describes three types of data centers: Bare-metal (w/o any hypervisor), virtualized (~20VMs per physical server) and virtualized-multitenant (like Amazon EC2)

3) Networking requirements of a data center:
i) Scaling: allocate more servers/VMs effortlessly as the demand increases. The data centers architecture and addressing schemes should be able to scale for at least a million VMs or hundreds of thousands of servers. With bigger scale the forwarding table entries grow, the switches/routers require more ports and hence become costlier and broadcasts become expensive.
ii) VMs should be able to migrate without any service interruption
iii) Fault tolerance to avoid service interruption

4) Next the paper describes VL2 and Portland. Here are the cons:
In Portland every core switch is connected to every pod with a single link: this will not scale well as it requires too many ports on core routers. VL2 runs OSPF over all its switches to build forwarding tables. This does not scale well as LSP which are broadcasted are expensive. VM migration in VL2 is not easy and requires the controller to push the new state to all the switches which are sending traffic to the VM. In both schemes ARP and new flows prompt sending packet to the controller which is not scalable in my opinion. The controller should be used only for bootstrapping and on link failures only. The middleboxes will take care of QoS and load-balancing. Doing fine grain control is not scalable

5) NOX can operate in both proactive and reactive mode. In proactive mode the controller computes all the forwarding data beforehand and pushes it onto the switches. This incurs no latency unlike the reactive approach where the first packet of each new flow is sent to the controller for processing.

6) An interesting point in this section was that for larger networks multiple NOX controllers can act in parallel where each new flow packet is sent to the designated controller in reactive mode. The controller work independently but must maintain global consistency about the network topology and host address mapping. Achieving consistency will incur more latency and complication, although it depends on the controller application under analysis.
The paper says an individual controller can handle 30K new flows per second with ~10ms latency.
A low cost 1Gbps switch can support 250K source/destination pair flow entries.

How to calculate over-subscription ratio: http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#wp1088848

7) In case of link/switch failure the controller has to install new flow entries at all the switches. The complexity and response time will increase with multiple failures. Also, installing low priority flow entries before hand will not work in case of multiple failures.

Saturday, June 16, 2012

Simple Expander graph constructions

An expander graph is a sparse graph with high connectivity properties. I will show code for simple constructions of some families of Expander Graphs using Python networkx library:

1) LPS graph

From Michael Nielsen's blog
In this example the family of graphs is indexed by a prime number,

. The set of vertices for the graph G_p

is just the set of points in Z_p

, the field of integers modulo

. We construct a

-regular graph by connecting each vertex $x \neq 0$ to x-1,x+1

and $x^{-1}$ . The vertex x=0

is connected to p-1,0

and

. According to the lecture notes by Linial and Wigderson, this was proved to be a family of expanders by Lubotsky, Phillips and Sarnak in 1988.

Python code

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def miller_rabin_pass(a, s, d, n):
a_to_power = pow(a, d, n)
if a_to_power == 1:
return True
for i in xrange(s-1):
if a_to_power == n - 1:
return True
a_to_power = (a_to_power * a_to_power) % n
return a_to_power == n - 1

def miller_rabin(n):
d = n - 1
s = 0
while d % 2 == 0:
d >>= 1
s += 1
for repeat in xrange(20):
a = 0
while a == 0:
a = random.randrange(n)
if not miller_rabin_pass(a, s, d, n):
return False
return True

def ramanujan_graph(p):
if not miller_rabin(p):
print "p is not prime"
return None
g = nx.Graph()

for i in range(p):
g.add_node(i)
for i in range(p):
if i == 0:
g.add_edge(0,p-1)
g.add_edge(0,1)
else:
g.add_edge(i,i-1)
g.add_edge(i,(i+1)%p)
g.add_edge(i,(i**(p-2))%p)

g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = ramanujan_graph(41)
PL.figure()
nx.draw_circular(g)
plt.show()

2) Paley graph

Paley graph of order q (where q is a prime or integer power of a prime) has q nodes, each with degree (q-1)/2. Paley graph is undirected when q % 4 = 1. The set of vertices for the graph G_p

is just the set of points in Z_p

. Two vertices a,b in graph G_p

are connected if a-b = x² % p where x lies in [1,p). See also: http://mathworld.wolfram.com/PaleyGraph.html

Python code

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def paley_graph(p):
if not p%4 == 1:
return None
square_list = []
for i in range((p-1)/2):
square_list.append((i**2)%p)

g = nx.Graph()

for i in range(p):
g.add_node(i)
for i in range(p):
for j in square_list:
g.add_edge(i,(i+j)%p)
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = paley_graph(17)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()

3) Margulis construction

import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys

def margulis_graph(n):
#Graph has n^2 vertices
g = nx.Graph()

for i in range(n):
for j in range(n):
g.add_node((i,j))

for (i,j) in g.nodes():
g.add_edge((i,j),((i+2*j)%n,j))
g.add_edge((i,j),(i,(2*i+j)%n))
g.add_edge((i,j),(i,(2*i+j+1)%n))
g.add_edge((i,j),((i+2*j+1)%n,j))
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g

g = margulis_graph(5)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()

4) Hypercube graph

Wikipedia. Use nx.hypercube_graph(n)

5) Superconcentrators

Smaller Explicit Superconcentrators

Acknowledgments
I would like to thank Abhishek Bhowmick and Ankit Garg for helpful discussions.

Saturday, June 9, 2012

Installing ns2 on Ubuntu 11.10

Prerequisites

Install development libraries for X11:

sudo apt-get install libx11-dev libxt-dev g++

Download ns-allinone package from here:
http://www.isi.edu/nsnam/ns/ns-build.html#allinone

Untar the package

tar -xvf ns-allinone-2.35.tar.gz

Install the package:

sudo ./install

Input the required path information in .bashrc file like this

export PATH=$PATH:<Place your paths here>

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: <place the LD_LIBRARY_PATHS>

export TCL_LIBRARY=$TCL_LIBRARY:<place the TCL_LIBRARY>

After these steps, you can now run the ns validation suite with

cd ns-2.35; ./validate

Wednesday, June 6, 2012

RBridges and TRILL

http://www.ieee-infocom.org/2004/Papers/26_1.PDF
http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_14-3/143_trill.html

Ivan Pepelnjak on TRILL:
"bridging forwarding paradigms are used to forward host-to-host data traffic, including building MAC address table by listening to data traffic and flooding of packets sent to unknown MAC destination. TRILL therefore retains most of the non-scalable properties of transparent bridging"

Brad Hedlund on TRILL in data center:
http://bradhedlund.com/2010/05/07/setting-the-stage-for-trill/

Cisco FabricPath:
http://www.youtube.com/watch?v=M85WYV8i3jg

Saturday, June 2, 2012

A Scalable, Commodity Data Center Network Architecture

First read this nice article on data center architectures by Brad Hedlund:
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/

The objective of the paper is the following: Achieve full bisection bandwidth (if the network is segmented into two equal parts, this is the bandwidth between the two parts) between all the hosts in a network using commodity Ethernet (incrementally scalable) switches in face of over-subscription.

Here is a traditional L3/L2 design:

There are typically 20-40 servers per rack. Each server is connected to a ToR switch via 1 Gbps link. ToR switches have 10 Gbps uplinks to ARs. ToR have 48 ports while AR and CRs have 128 ports. AR and CR have only 10 Gbps ports. So, the target is to have DC with 25,000+ hosts with a bisection bandwidth of 1 Gbps.

The paper proposes a Clos Network/Fat-tree network as a new network architecture for achieving this goal. They propose a L3/L2 architecture:

k - Number of ports of the switches used in the fat-tree
There are k pods, each pod has 2 layers of k/2 switches. Each edge switch is connected to k/2 hosts. There are (k/2)² CRs. The topology can support k/2*k/2*k = k³/4 hosts. Between any 2 hosts there are (k/2)² paths. We must spread the traffic on these paths somehow to achieve the bisection bandwidth.
Hosts connected to the same edge switch form a /24 subnet. VMs can migrate only within their own subnet. No talk of VLANs or migration between subnets in the paper.

Addressing (For a 10/8 network)
CR - 10.k.j.i - i and j lie in the range [1,k/2]
Pod switch - 10.pod.switch.1 (switch number lies in [0,k-1])
Host - 10.pod.switch.ID (ID lies in [2,k/2+1]) - This is very wasteful

Routing
Here is how the routing table looks like for a pod switch:

- There are k/2 prefixes for the subnets in the same pod and k/2 suffixes for k/2 hosts. Prefixes are preferred to suffixes.
- For intra-pod traffic the lower level pod switches send the packet to the upper level (evenly split between upper layer switches) which sends it back to the other edge switch. For inter-pod traffic the ARs spread the traffic among the CRs based on the last byte of the IP address. TCP ordering is not broken as packets always follow the same path to CR.
- The CR aggregate network prefixes of all the subnets (i.e. k /16 networks) which should not take too much space in the forwarding tables. The traffic spreading takes only on the route from edge to CRs.
- The paper advocates the use of a central controller (Openflow anyone?) for generating the routing tables and statically storing it on the switches. Since, the routing tables will not change this seems alright.

Friday, June 1, 2012

VL2: A Scalable and Flexible Data Center Network

The paper also tries to propose an architecture which allows the data center to become a single large Layer-2 domain. I will not go into the advantages of doing so - low cost, zero-configuration (apart from DHCP), make services scalable, easy VM migration etc.
This paper raises some important research questions in the field of data center networking. It combines a measurement study and a new architecture.

Q1. How to provide high capacity throughput to intra-dc flows in face of over-subscription?

Q2. How to isolate the traffic of one service from another? What prevents one service from flooding the network with its own traffic?
The paper does not really solve this problem

Q3. How to make Ethernet broadcasts scalable? VLANs don't scale well.
Only ARP and DHCP are made scalable. No talk about multicast.

System design:
1) Clos topology (links between AR and CR form a complete bipartite graph)
2) Virtual Layer-2 (hence the name VL2)
3) VLB and ECMP
4) Central directory service

Here is the topology:

The CR, AR and ToR use OSPF to learn about the switch-level topology and shortest path routing.

The measurement study done for the data center they were analyzing showed two properties:
1) Traffic pattern inside a DC is very unpredictable and changes rapidly
2) Ratio of traffic volume between servers to traffic volume leaving/entering the DC is 4:1

VL2 routing
- VLB is used to spread traffic across multiple CRs. Take a random path from a source ToR to a CR and from CR to destination ToR switch.
- Use anycast address at the CR and ToR to enable ECMP forwarding.
- Two address spaces - application-specific addresses (AAs) and location-specific addresses (LAs). The directory system maintains a mapping from AA to LA. Every server in the DC has a VL2 agent running in the network stack. The agent traps the ARP request and instead sends a unicast to the directory service and caches its reply. It is also responsible for tunneling (IP in IP) the packet to the ToR switch of the destination host. The agent also rate-limits broadcast traffic to prevent storms.
- DHCP is implemented using relay agents in each ToR and a central DHCP server. The directory service is notified of the LA-AA mapping when a new server joins the network and also after a server migrates to a new rack (by looking for gratuitous ARP).
- The directory service has two components:
1) Read optimized directory servers (store soft state which may be stale)
2) Write optimized strongly consistent RSM servers. They run Paxos for reliability and fault tolerance.

Here is how routing is done -

VL2 utilizes both VLB and ECMP (each one complementing other's weakness). If you use only VLB:
Each agent is provided a random intermediate router address. We don't need anycast addresses. Each router interface can have different IP. OSPF takes care of reachability. We stack up the IP headers at source and off we go. Now, what happens if the intermediate router fails. We will have to update a huge amount of VL2 agents to use a new random router.
If we use ECMP alone: Configure each router's interface to have the same unicast address (10.0.0.1). ECMP makes sure that any one of these can be taken as the next hop. In this case we don't need to do multiple encapsulations. A single encapsulation with H(ft) | dest_ToR_IP is enough. But commodity switches support only 16-way ECMP, so we can't spread traffic across all the intermediate routers.
Solution is to combine both VLB and ECMP: We define several anycast addresses, each associated with only as many intermediate switches as ECMP can accommodate. The agent is given a random intermediate router IP address. We perform multiple encapsulations. OSPF and ECMP reacts upon link/switch failure, eliminating the need to notify agents.

Drawbacks
1) Bipartite graph will limit scalability of this approach.
2) Does not talk about how to tackle multiple tenants each with same private IP address.
3) Configuring routers for OSPF is not scalable. This is an error prone process. Not sure if OSPF will be scalable (each router stores all the LSAs and broadcasts HELLO packets). Need to configure areas, which creates further configuration overhead.
4) Every outgoing packet needs to encapsulated (also the agent computes the hash and puts it in each packet). This increases latency, reduces useful bandwidth. If we have multiple layers in the data center hierarchy I don't think this approach will scale.
5) Hosts which need to communicate with outside world are given both LA and AA. This essentially fixes their position in the topology and further increases the burden on OSPF (these IPs are not aggregated).
6) VL2 has eventual consistency when it comes to AA-LA mappings. This implies that VM migration will have more downtime and packet loss.

Random
- A commodity switch stores only 16K entries in its routing table
- ToR informs the directory server in case it receives a packet for which it has no mapping which triggers the directory server to update the VL2 agent caches. This can lead to DOS attacks on the directory servers.
- The servers had delayed ACKs turned on which should be turned off.
- What is a hose model? Not really clear from the paper. The reference cited talks about it in terms of VPNs.

Thursday, May 31, 2012

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

This is an influential paper in the Data Center arena. You can also see a video presentation of the paper here.

The basic idea of the paper is very simple: Make Ethernet scalable by making it hierarchical like IP. Thats it. I personally think SEATTLE is a more elegant solution to the problem.

First two sections give the same rhetoric about how L2 is simple but not scalable and how L3 is scalable but hinders mobility and is prone to configuration errors. The Related Work section is thorough.

Design:
1) Fat-tree topology

2) Fabric manger (centralized directory service)

3) Distributed location discovery protocol (basically send packets like LSPs in OSPF for checking if a switch/host is alive and also so that switches learn their pod and position numbers)

Paper defines a Pseudo MAC address (48 bits) - PMAC - which encodes the location of a host in the topology. It is of type pod.position.port.vmid (hierarchical like IP address)

Lets see how a packet travels from a host (1, 10.0.1.2) in Pod 0 to a host (2, 10.1.1.2) in Pod 1

- Host 1 sends an ARP: Who has 10.1.1.2?
- The edge switch intercepts the ARP and if it sees a source MAC it hasn't seen before it generates a PMAC for it (edge switch assigns a monotonically increasing vmid to MACs it sees on a particular port) and creates an entry in its local PMAC table mapping the host's AMAC (actual MAC) and IP to its PMAC. It then communicates the IP-PMAC mapping to Fabric manager. This data is used to reply to ARP requests from other hosts and for translating the PMAC to AMAC on packets destined to the host. The edge switch also asks the Fabric manager if it knows the PMAC for 10.1.1.2.
If it does:
The edge switch constructs an ARP reply with the PMAC from fabric manager and sends it to the host. Host learns the PMAC (48 bits) instead of AMAC. That is why PMAC is 48 bits long.
If it does not:
ARP is broadcasted like it is in standard Ethernet. The target host replies with its AMAC. The edge switch of target host rewrites the AMAC with PMAC before forwarding to querying host and fabric manager (This is how the fabric manager gets populated after failure)
- 10.0.1.2 sends the packet to the PMAC it has learnt. The last hop edge switch translates the PMAC to AMAC before sending it to the target host. End hosts don't see their own PMACs but cache PMACs of other hosts.

Forwarding is pretty simple. If the PMAC is in the same Pod but different position forward it to your aggregation switch. If it is in different Pod aggregation switch will forward it to the core switch. The switch through the LDP learns which switch is connected to it on which port. The forwarding data on a switch is aggregatable like IP so it does not take up much space.
The PMACs can support expansion of the DC (adding another Pod). The multi-rooted tree topology however should remain fixed which is a fair assumption.

A good note:
When considering a distributed versus centralized protocol there is an inherent trade off between protocol simplicity and system robustness.

The fabric manager contains soft state data of the following type:

Now the fabric manager will contain 1 million entries (corresponding to all the hosts in the DC). Is this scalable? DHT anyone?
The switches send LDMs to each other and the fabric manager also monitors connectivity to each switch module. Is all this broadcast traffic scalable?

Drawbacks:
1) What about traffic coming in from the Internet? How do you scalably handle all that traffic?
2) The paper says that their approach works for multi-rooted trees and they take fat-tree topology as an example but that is easier said than done. If there is one more layer between the aggregation and core layer their LDP protocol cannot work without the help from fabric manager.
3) VL2 runs OSPF and Portland LDP which requires some configuration and broadcasts. Also, both schemes use indirection which involves caching of information for speedup. So, when the cache is invalid, the central controller/fabric manger has to take care of things and this is tedious.
4) How does it account for MB in a data center? What if a MB is filtering on dst MAC or something?
5) Cannot be implemented in conjunction with existing data center topology because legacy switches don't understand MAC prefixes or LDP. Even existing Openflow switches don't understand LDPs so these packets will have to be processed by the controller. Not backwards compatible.

To Read

http://www.cs.princeton.edu/courses/archive/fall10/cos561/papers/Seattle08.pdf
http://conferences.sigcomm.org/hotnets/2009/papers/hotnets2009-final103.pdf
http://www.cs.uccs.edu/~zbo/teaching/CS522/Projects/SIGCOMM10-DataCenterTCP.pdf
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
http://bradhedlund.com/2011/05/01/tcp-incast-and-cloud-application-performance/
http://en.wikipedia.org/wiki/TRILL_(computing)
http://cloudscaling.com/pdf/IaaS_Building_Guide_v1.pdf

Tuesday, May 22, 2012

Peer-to-peer communication across network address translators

http://static.usenix.org/event/usenix05/tech/general/full_papers/ford/ford_html/

Thursday, May 17, 2012

How does the whole business of DNS and domain names work?

How do domain names actually work?

Force ssh to not use public key

Force ssh to not authenticate using a public key:

ssh -o "PubkeyAuthentication no" user@server

Wednesday, May 16, 2012

Why does Ethernet have a minimum frame size of 64 bytes?

The IEEE 802.3 states that there can be a maximum of five physical segments between any two nodes, with four repeaters between the nodes. When using 10Base5 cabling, each physical segment can be up to 500 meters long. Therefore, an Ethernet network’s maximum linear length is 2500 meters.

Speed of light in copper is 2/3 of the speed of light in vacuum.

In order to detect collision an adapter must transmit for at least 25 microseconds.

At 10Mbps
After mth collision, choose K randomly from {0, …, 2m-1}
and wait for K*512 bit times before trying again
For 10Mbps = K * 51.2 microseconds

So

12 bytes - Src+Dst MAC address
2 bytes - Ethernet type
4 bytes - CRC
----------
18 bytes
----------

Preamble = 8 bytes, 7 bytes of 10101010 + 1 byte of Start of frame delimiter 10101011
IFG - 12 bytes
Jamming signal - 4 bytes

This leaves 46 bytes as the packet length

Now IP + TCP header = 40 bytes. Therefore, the OS must pad the data before sending it to the adapter.

Tuesday, May 15, 2012

Chord: A scalable peer-to-peer lookup service for Internet applications

I have decided to write down reviews of all the papers I read from now on. This is for the sole purpose of improving my understanding of the paper, help me put my thoughts down in a coherent fashion and improve my writing skills.

This paper was published in SIGCOMM 2001 and has been cited more than 9000 times. I will go over the basic algorithm and system model first and finally note down my evaluation.

Lets get down some definitions first:

successor(k) - the successor node of key k is the node whose identifier is equal to or follows k in identifier space.
n.successor - successor of a node n is the next node in the identifier space. successor(0) = 1, successor(1) = 3, successor(3) = 0. Similarly we have n.predecessor
In the figure the green dots represent nodes.

m - number of bits in the key/node identifier. Both nodes and keys are mapped to the same identifier space.
Each node maintains m entries in a routing table called finger table.

The i^th finger of node n or the i^th entry in the finger table of node n is denoted by n.finger[i].node
The i^th entry in the finger table of node n = s = successor(n + 2^i-1) where 1 <= i <= m.

n.successor = n.finger[1].node
n.finger[i].start = (n + 2^i-1) mod m
n.finger[i].interval = [n.finger[i].start, n.finger[i+1].start)

In the right figure, m = 3. The finger table of node 0 will have:
1st entry: 0.finger[1].start = (0 + 2⁰) mod m = 1 => successor(1) = 1
2nd entry: 0.finger[2].start = (0 + 2¹) mod m = 2 => successor(2) = 3
3rd entry: 0.finger[3].start = (0 + 2²) mod m = 4 => successor(4) = 0

0.finger[4].start = (0 + 2³) mod 8 = 0

1st interval: [1, 2)
2nd interval: [2, 4)
3rd interval: [4, 0)

See the paper for pseudocode for routing and joining.

The finger table also contains the keys that the node is responsible for storing and the node's successor and predecessor. The finger table entry include both the Chord identifier and the IP address, port number of the node.

Here is how the finger table is updated:

Node 6 joins -

Node 1 leaves -

See the stabilization algorithm in this ppt

Here is how I envision Chord will work in a P2P system:
When a new node joins a system it contacts a well known server to help it find its successor. When the node has been integrated into the system it must allocate some space for storing the key value store. If it wants to share something, the cryptographic hash (SHA-1, HMAC) of the file name it wants to share must be stored in the system as a key. The value corresponding to this key will be its IP address and port number. The system should store multiple copies of this (key, value) pair for high availability. This might be done using a different hash function. You can have multiple Chord networks using a different hash functions. You can also simply store each (key, value) pair in designated node and it k successors.
If someone wishes to download this file they must calculate a hash of the search query to get the key and see if it exists in the system. If it does, the value will tell them the node IP address which has the data. If they download the file, their own IP address should get added to all the copies of the (key, value) pair.
I suppose this is how magnet links work too. Though Bittorrents are based on a different DHT model called Kademlia which uses a different routing algorithm but is also based on consistent hashing like Chord. I will be posting about it soon.
An obvious disadvantage of this way of storing data is that we cannot perform a widcard searches like *.mp3. I don't know how that might be done.

For a system with N nodes Chord proves that it can resolve all lookups via O(log N) messages by maintaining information about O(log N) other nodes (in a finger table). It also requires O(log^2 N) messages to re-establish the routing invariants and finger tables in case any node joins or leaves the system.

P2P systems have very high churn rates. The large number of messages that need to be exchanged and the changes required in finger table every time a node joins or leaves the system, in my opinion, makes it infeasible in practice. It would be better to use Chord for the Ultrapeer network only, since they have low churn rate.

Pages