Friday, December 28, 2012
Wednesday, December 26, 2012
Thursday, December 6, 2012
Symbolic link 101
bash-4.1$ stat sample
File: `sample'
Size: 5 Blocks: 68 IO Block: 65536 regular file
Device: 1bh/27d Inode: 502918152 Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce) Gid: ( 34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500
bash-4.1$ ln -s sample sample1
bash-4.1$ stat sample
File: `sample'
Size: 5 Blocks: 68 IO Block: 65536 regular file
Device: 1bh/27d Inode: 502918152 Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce) Gid: ( 34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500
So, the link count did not increase. If you don't use the -s flag and create a hard link, the link count increases by 1 on each link operation.
Use the readlink command to read contents of a symbolic link
lrwxrwxrwx. 1 foo admin 5 Dec 6 23:21 avg_s.c -> avg.c
$ readlink avg_s.c
avg.c
So, the symbolic link contains the path to the file it is pointing to.
Soft link pros:
1) Can create link to files in other file systems or remote file systems.
2) Can create link to directories. Hard link to directories are not allowed in Unix.
3) Editing file using text editors like vim is more reliable. If we have hard links the edits to file inode are done in place make it prone to inconsistency on failures.
Cons:
1) Wastes an inode and takes space on disk.
2) Indirection. Slower.
Hard link pros:
1) Faster access as there is no indirection
2) Protection from accidental deletion.
3) Versioning/Backup software can easily make sure that file is backed up only once.
File: `sample'
Size: 5 Blocks: 68 IO Block: 65536 regular file
Device: 1bh/27d Inode: 502918152 Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce) Gid: ( 34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500
bash-4.1$ ln -s sample sample1
bash-4.1$ stat sample
File: `sample'
Size: 5 Blocks: 68 IO Block: 65536 regular file
Device: 1bh/27d Inode: 502918152 Links: 1
Access: (0600/-rw-------) Uid: (97399/bruce) Gid: ( 34/ UNKNOWN)
Access: 2013-01-20 18:18:31.624217805 -0500
Modify: 2013-01-20 18:18:31.624217805 -0500
Change: 2013-01-20 18:18:55.152443117 -0500
So, the link count did not increase. If you don't use the -s flag and create a hard link, the link count increases by 1 on each link operation.
Use the readlink command to read contents of a symbolic link
lrwxrwxrwx. 1 foo admin 5 Dec 6 23:21 avg_s.c -> avg.c
$ readlink avg_s.c
avg.c
So, the symbolic link contains the path to the file it is pointing to.
Soft link pros:
1) Can create link to files in other file systems or remote file systems.
2) Can create link to directories. Hard link to directories are not allowed in Unix.
3) Editing file using text editors like vim is more reliable. If we have hard links the edits to file inode are done in place make it prone to inconsistency on failures.
Cons:
1) Wastes an inode and takes space on disk.
2) Indirection. Slower.
Hard link pros:
1) Faster access as there is no indirection
2) Protection from accidental deletion.
3) Versioning/Backup software can easily make sure that file is backed up only once.
How to get the contents of a directory?
#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <assert.h>
int main(int argc, char *argv[])
{
DIR * dp = opendir(".");
assert(dp != NULL);
struct dirent *d;
while ((d = readdir(dp)) != NULL)
{
printf("%d %hd %ld %s\n", (int)d->d_type, d->d_reclen, (long) d->d_ino, d->d_name);
}
closedir(dp);
return 0;
}
Saturday, November 24, 2012
Tar 101
I keep forgetting the options
http://www.cyberciti.biz/faq/how-do-i-compress-a-whole-linux-or-unix-directory/
http://www.cyberciti.biz/faq/how-do-i-compress-a-whole-linux-or-unix-directory/
Thursday, November 22, 2012
Virtual Memory Management in x86
System Developer Manuals:
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Condensed picture:
http://osr507doc.sco.com/en/OSAdminG/addr_trans.html
Some more explanation:
http://www.cs.cmu.edu/~410/doc/segments/segments.html
When is LDT used?
http://en.wikipedia.org/wiki/Local_Descriptor_Table
http://stackoverflow.com/questions/4575032/reason-for-different-segments-in-linux-on-x86
http://www.intel.com/content/www/us/en/processors/architectures-software-developer-manuals.html
Condensed picture:
http://osr507doc.sco.com/en/OSAdminG/addr_trans.html
Some more explanation:
http://www.cs.cmu.edu/~410/doc/segments/segments.html
When is LDT used?
http://en.wikipedia.org/wiki/Local_Descriptor_Table
http://stackoverflow.com/questions/4575032/reason-for-different-segments-in-linux-on-x86
Tuesday, November 20, 2012
Friday, November 16, 2012
Programming Interview Questions
1) Given an integer x and a sorted array a of N distinct integers, design a linear-time algorithm to determine if there exists two distinct indices i and j such that a[i] + a[j] == x
http://stackoverflow.com/questions/11928091/linear-time-algorithm-for-2-sum
2) Find the starting node of the linked list cycle
http://stackoverflow.com/questions/2936213/explain-how-finding-cycle-start-node-in-cycle-linked-list-work
Using this diagram. Let
i = m + p.n + k - m
2i = m + q.n + k - m
k = (q - 2p).n i.e. a multiple of cycle length
Now, let
k - m + x = n
Then,
k = (q - 2p -1).n + k - m + x
m = (q - 2p -1).n + x
So, if two pointers start at the beginning of the list and the meeting point and move the same number of steps, they will meet at the starting node of the cycle.
3) Given a string find all its permutations:
http://stackoverflow.com/questions/11928091/linear-time-algorithm-for-2-sum
2) Find the starting node of the linked list cycle
http://stackoverflow.com/questions/2936213/explain-how-finding-cycle-start-node-in-cycle-linked-list-work
Using this diagram. Let
i = m + p.n + k - m
2i = m + q.n + k - m
k = (q - 2p).n i.e. a multiple of cycle length
Now, let
k - m + x = n
Then,
k = (q - 2p -1).n + k - m + x
m = (q - 2p -1).n + x
So, if two pointers start at the beginning of the list and the meeting point and move the same number of steps, they will meet at the starting node of the cycle.
3) Given a string find all its permutations:
#include <stdio.h>
#include <string.h>
void perm(char * base, int * res, int pos)
{
int i,j;
for (i = 0; i < strlen(base); i++){
for (j = 0; j < pos; j++)
if (i == res[j])
break;
if (j==pos){
res[pos] = i;
if (pos+1 != strlen(base))
perm(base, res, pos + 1);
else{
for (j = 0; j < strlen(base); j++)
printf("%c",base[res[j]]);
printf("\n");
return;
}
}
}
}
int main(){
int i;
char str[10];
int res[10];
printf("Enter a string: ");
scanf("%s",str);
for (i = 0; i < strlen(str); i++)
{
res[0] = i;
perm(str, res, 1);
}
return 0;
}
4) Counting inversions in a array:
http://stackoverflow.com/questions/337664/counting-inversions-in-an-array
5) Find the element appearing more than n/2 times in an array in O(n) time.
http://keithschwarz.com/interesting/code/?dir=majority-element
6) Find the median of two sorted arrays
http://www.youtube.com/watch?v=_H50Ir-Tves
7) Each node of a linked list is of struct data type
struct node {
struct node *next;
struct node *what;
void *data;
};
what is a pointer to a random node in the list. Duplicate the list in O(N) time without using any extra space.
http://stackoverflow.com/questions/1924741/deep-copy-linkedlist-without-destroying-original-list-and-extra-storage-space-u/1924968#1924968
8) Given an array of size N, find the sub-sequence of elements of size L in the array which has the maximum sum in O(N) time.
http://stackoverflow.com/questions/7861387/maximum-contiguous-subsequence-sum-of-at-least-length-l
http://stackoverflow.com/questions/337664/counting-inversions-in-an-array
5) Find the element appearing more than n/2 times in an array in O(n) time.
http://keithschwarz.com/interesting/code/?dir=majority-element
6) Find the median of two sorted arrays
http://www.youtube.com/watch?v=_H50Ir-Tves
7) Each node of a linked list is of struct data type
struct node {
struct node *next;
struct node *what;
void *data;
};
what is a pointer to a random node in the list. Duplicate the list in O(N) time without using any extra space.
http://stackoverflow.com/questions/1924741/deep-copy-linkedlist-without-destroying-original-list-and-extra-storage-space-u/1924968#1924968
8) Given an array of size N, find the sub-sequence of elements of size L in the array which has the maximum sum in O(N) time.
http://stackoverflow.com/questions/7861387/maximum-contiguous-subsequence-sum-of-at-least-length-l
Tuesday, November 6, 2012
Cache Basics
For uniprocessor:
http://www.ccs.neu.edu/course/com3200/parent/NOTES/cache-basics.html
What constrains the size of L1 cache?
1) Space constraints - Since L1 is located on CPU chip it needs to small.
2) Larger the cache more is the access latency.
3) Larger cache have more soft errors. We remove ECC bits from L1 cache by keeping it small. This also lowers the latency of L1.
TLB and Page Tables
http://users.dickinson.edu/~braught/courses/cs354f97/Classes/Class17/Class17LN.html
http://www.ccs.neu.edu/course/com3200/parent/NOTES/cache-basics.html
What constrains the size of L1 cache?
1) Space constraints - Since L1 is located on CPU chip it needs to small.
2) Larger the cache more is the access latency.
3) Larger cache have more soft errors. We remove ECC bits from L1 cache by keeping it small. This also lowers the latency of L1.
TLB and Page Tables
http://users.dickinson.edu/~braught/courses/cs354f97/Classes/Class17/Class17LN.html
Sunday, November 4, 2012
Mbps vs MBps vs MiBps
I forget what is what every year, so this is for future me:
http://en.wikipedia.org/wiki/Data_rate_units
http://en.wikipedia.org/wiki/Data_rate_units
Thursday, October 25, 2012
Find duplicates in N logk time in k sorted arrays
Construct heap of size k using first elements of each array and then keep on adding elements and calling delmin()
Thursday, October 18, 2012
Friday, September 21, 2012
Sunday, September 16, 2012
Monday, August 13, 2012
Printing on the network
lpstat -a : Lists all available printers
lpr -P <printer_name> <file_name> : Print file
lpq -P <printer_name> : See printer queue
lprm : To cancel a print job
enscript -P <printer_name> -2rhC -f Courier7 <file_name> : Good layout for printing source code
lpr -P <printer_name> <file_name> : Print file
lpq -P <printer_name> : See printer queue
lprm : To cancel a print job
enscript -P <printer_name> -2rhC -f Courier7 <file_name> : Good layout for printing source code
Sunday, August 5, 2012
Saturday, August 4, 2012
Monday, July 30, 2012
Wednesday, July 25, 2012
Wednesday, June 27, 2012
I created a shell script in Windows and ran it on Linux...
and it did not work. Everything seems to be in order. The problem is in Windows the EOL marker is \r\n and in Linux it is \n. You need to install dos2unix to convert your file to unix specifications.
http://www.virtualhelp.me/linux/164-dos2unix-missing-ubuntu-1004
sudo apt-get install dos2unixAlso see:
http://www.virtualhelp.me/linux/164-dos2unix-missing-ubuntu-1004
Tuesday, June 26, 2012
Learn Python
Awesome videos by Nick Parlante. All righty
1) http://www.youtube.com/watch?v=tKTZoB2Vjuk
2) http://www.youtube.com/watch?v=EPYupizJYQI
3) http://www.youtube.com/watch?v=haycL41dAhg
4) http://www.youtube.com/watch?v=kWyoYtvJpe4
5) http://www.youtube.com/watch?v=uKZ8GBKmeDM
6) http://www.youtube.com/watch?v=Nn2KQmVF5Og
7) http://www.youtube.com/watch?v=IcteAbMC1Ok
1) http://www.youtube.com/watch?v=tKTZoB2Vjuk
2) http://www.youtube.com/watch?v=EPYupizJYQI
3) http://www.youtube.com/watch?v=haycL41dAhg
4) http://www.youtube.com/watch?v=kWyoYtvJpe4
5) http://www.youtube.com/watch?v=uKZ8GBKmeDM
6) http://www.youtube.com/watch?v=Nn2KQmVF5Og
7) http://www.youtube.com/watch?v=IcteAbMC1Ok
Monday, June 25, 2012
Towards a Next Generation Data Center Architecture: Scalability and Commoditization
This paper is the precursor to the VL2 paper. Lots of interesting ideas. My random thoughts:
Preliminaries:
Redundancy models
N+1
N+1 means we have 1 backup device per group, so we can tolerate one failure. Any component can replace
any of the components, but only once. Minimum number of components is N.
1+1
1+1 means that every component has one dedicated backup. Each component can be replaced by the one
and only other backup device. Minimum number of components is 1.
In a 1+1 system the standby server has the ability to maintain the state, e.g. in a phone network the standby server would be able to keep ongoing calls as it constantly monitors the state of all activities in the active server. In an N+1 system there's only one standby for N active servers and the call state is lost in case of a switchover.
Destination NAT/Half-NAT + Direct Server Return + Source NAT/Full NAT
Read http://lbwiki.com/index.php/DSR for understanding DSR. DSR requires the load balancer and server to be in the same Layer 2 domain. Source/Destination NAT require that all packets pass through them in reverse direction as well. Therefore, the servers should set up the load balancer as their default route. This severely limits the number of servers which a load balancer can support.
- Workload inside a data center changes frequently which leads to sporadic congestion. VLB helps accommodate arbitrary traffic patterns.
- An Ethernet port is 10% to 50% the cost of an equivalent Layer 3 port
- 16K MAC entries can stored in commodity switches.
Preliminaries:
Redundancy models
N+1
N+1 means we have 1 backup device per group, so we can tolerate one failure. Any component can replace
any of the components, but only once. Minimum number of components is N.
1+1
1+1 means that every component has one dedicated backup. Each component can be replaced by the one
and only other backup device. Minimum number of components is 1.
In a 1+1 system the standby server has the ability to maintain the state, e.g. in a phone network the standby server would be able to keep ongoing calls as it constantly monitors the state of all activities in the active server. In an N+1 system there's only one standby for N active servers and the call state is lost in case of a switchover.
Destination NAT/Half-NAT + Direct Server Return + Source NAT/Full NAT
Read http://lbwiki.com/index.php/DSR for understanding DSR. DSR requires the load balancer and server to be in the same Layer 2 domain. Source/Destination NAT require that all packets pass through them in reverse direction as well. Therefore, the servers should set up the load balancer as their default route. This severely limits the number of servers which a load balancer can support.
This figure is taken from the excellent book Load Balancing Servers, Firewalls, and Caches |
- Workload inside a data center changes frequently which leads to sporadic congestion. VLB helps accommodate arbitrary traffic patterns.
- No link oversubscription: Any server can communicate with any other server at full NIC speed (minimum of sender and receiver's NIC, usually 1 Gbps)
- This is probably the earliest paper proposing the idea of software load balancers on commodity servers in data centers. Advantages: Scale out. Flexible load balancing algorithms. Cheap Disadvantages: Can't forward at line rate.
- Packets coming from external sources are sent to an Ingress Server which performs packet encapsulation after contacting the directory service. This is a hidden hardware cost we must incur. Interestingly, VL2 does not even mention anything about Load Balancers or communication with external hosts.
- Used MAC-in-MAC tunneling instead of IP-in-IP tunneling used in VL2. The switches know about other switches (not hosts) and directly connected servers only.
- OS is changed but not the applications. The network stack performs encapsulation after examining each packet. Would be slower than usual.
- Used MAC-in-MAC tunneling instead of IP-in-IP tunneling used in VL2. The switches know about other switches (not hosts) and directly connected servers only.
- OS is changed but not the applications. The network stack performs encapsulation after examining each packet. Would be slower than usual.
- Hose traffic model: No server is asked to send or receive more than what its network interface allows.
- Monsoon proposes that we run IS-IS with the controller, so that it knows where every host is located in order to run the directory service.
- Monsoon proposes that we run IS-IS with the controller, so that it knows where every host is located in order to run the directory service.
Random Facts
- Networking gear consumes 10-20% of the data center's equipment budget- An Ethernet port is 10% to 50% the cost of an equivalent Layer 3 port
- 16K MAC entries can stored in commodity switches.
DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers
I started reading this paper a month ago but put it down after 2 pages. I couldn't follow the paper. I tried reading it again a week ago and put it down again. Why is this paper so hard to read? Why am I so stupid? I found two blog posts where students had reproduced the results in the paper and had found the paper very clear and easy to read. I was like
Read blog post 1 2
So if anyone out there is reading this post, I would really appreciate it if you can provide links to blog posts, videos or ppts which explain this paper in an easy way. I want to understand this paper in depth.
Read blog post 1 2
So if anyone out there is reading this post, I would really appreciate it if you can provide links to blog posts, videos or ppts which explain this paper in an easy way. I want to understand this paper in depth.
Sunday, June 24, 2012
Hedera: Dynamic Flow Scheduling for Data Center Networks
To repeat the experiments get the code from:
http://reproducingnetworkresearch.wordpress.com/2012/06/09/hedera-2/
or
http://reproducingnetworkresearch.wordpress.com/2012/06/06/hedera/
The paper is straightforward and fairly easy to read.
Things learnt
- VLB does per-flow instead of per-packet spreading of flows to preserve packet ordering. TCP treats unordered packets as a signal of congestion.
- The whole idea of the paper is that Elephant flows must be scheduled/placed carefully to achieve good aggregate bandwidth.
- Global first fit works quite well and there might be no need to run the complicated simulated annealing. The only disadvantage might be the running time of the algorithm.
- NetFPGA Openflow switches have 2 hardware tables: 32 entry TCAM that does matching on variable length prefixes and 32K entry SRAM that matches flow entries with fully qualified 10 tuples
http://reproducingnetworkresearch.wordpress.com/2012/06/09/hedera-2/
or
http://reproducingnetworkresearch.wordpress.com/2012/06/06/hedera/
The paper is straightforward and fairly easy to read.
Things learnt
- VLB does per-flow instead of per-packet spreading of flows to preserve packet ordering. TCP treats unordered packets as a signal of congestion.
- The whole idea of the paper is that Elephant flows must be scheduled/placed carefully to achieve good aggregate bandwidth.
- Global first fit works quite well and there might be no need to run the complicated simulated annealing. The only disadvantage might be the running time of the algorithm.
- NetFPGA Openflow switches have 2 hardware tables: 32 entry TCAM that does matching on variable length prefixes and 32K entry SRAM that matches flow entries with fully qualified 10 tuples
git checkout new branch
I always forget the command for this so I am writing it here:
To view the branches:
To view the branches:
git branch -aTo checkout a branch:
git checkout -b branch_namethen do
git pull origin branch_name
Friday, June 22, 2012
Floodless in SEATTLE: A Scalable Ethernet Architecture for Large Enterprises
This is a SIGCOMM 2008 paper:
http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf
Just for fun, you can see network conference acceptance stats here
I had read the paper before but I felt too lazy to write a post about it. I finally got around to it. This paper can be described in two words: Consistent Hashing. Ok maybe three: Consistent Hashing Ethernet. Definitely four: Consistent Hashing Ethernet Scalable. Read my earlier post on Consistent Hashing for refreshing your concepts.
The basic premise of the paper is very simple but things gets convoluted when you introduce caching, timeouts, mobility and failures. I felt the paper could have addressed these scenarios lucidly. Also, I felt the paper was badly organized especially in the middle sections. The good thing is it only requires changes to switch architecture and control plane and not the end hosts but if a simple protocol like TRILL took what about 7 years? to get adopted by Cisco this will take another decade or so. Here is the rundown.
Why is Ethernet good?
- Plug-and-play
- Host mobility
Why is Ethernet bad?
- Source MAC learning requires flooding. DHCP, ARP packets are flooded which waste link bandwidth and processing resources.
- Broadcast storms because there is no TTL in the header.
- Spanning tree computation under-utilizes links. Frames are not always sent via shortest paths. Root bridge must be selected carefully as it carries most of the traffic.
- Lot of security vulnerabilities and privacy concerns. Snooping on other people's traffic. ARP spoofing. DOS attacks.
L3/L2 architecture: Pools of Ethernet domains connected by IP routers.
- Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.
- Host mobility is restricted only within the subnet
Enter VLAN
- No restrictions on host mobility. Though we need to trunk the VLAN at all the bridges.
- Scopes broadcasts to small domains
- Each subnet is treated as a VLAN (http://www.youtube.com/watch?v=PpuWBh3EnVM)
- Need to compute multiple spanning trees now and choose root bridges carefully so as to effectively utilize all links and balance load
- Limited number: 4094
- Switches still need to maintain forwarding table data for all hosts in VLANs provisioned to them
The switches run link-state routing protocol (OSPF) among themselves. This is scalable as it is used to maintain only the switch level topology and not disseminate end-host information. Although, as the network gets bigger it would hit a limit because of large forwarding tables and LSA broadcasts will cause scalability issues. The paper proposes a remedy for that by dividing the switches into regions (very similar to OSPF areas). Since we are running OSPF between switches, if a switch fails all the other switches will eventually know about it. Also, OSPF helps route packets along the shortest path.
We use consistent hashing to evenly distribute and store (IP,MAC) and (MAC,host_switch_id) key value pairs among the existing switches. Each switch chooses the MAC address of one of its interfaces as its unique network wide switch_id. Since, we use consistent hashing the information is almost equally spread across all the switches. So, instead of all switches storing all the information they now store only a subset of the information. Switches cache information to avoid doing repeated lookups (more below). Powerful switches can have more nodes in the key space (virtual switches) to hold more information. We can also hash information for services like PRINTER and DHCP_SERVER and store it on switches or implement anycasting by storing the same IP address with different MAC addresses or change the MAC address for fail-over. This is a tangent though.
It uses loop-free multicasting to handle general broadcasts and groups instead of VLANs (broadcasting in a group sends unicasts to group members). This was little hand wavy in the paper though.
H - Hash function used by all switches
H(k) = rk resolver switch for key k. If (k,v) = (MAC,host_switch_id), rk is called location resolver
We have two (k,v) stores (IP,MAC) and (MAC,host_switch_id).
How is the packet sent from src to dst. Does the host_switch see the MAC address and tunnel the packet to appropriate switch or do they attach a separate header like TRILL? The paper does not say anything about this.
What information does each switch store?
- (MAC,port) key value pairs of directly attached hosts.
- (IP,MAC) and (MAC,host_switch_id) key value pairs of directly attached hosts. Switch must delete these once it detects the host MAC is unreachable. The location resolver informs the switch if the host has migrated to another switch. In this case the switch updates entry and forwards the packets to the new switch and informs other switches of this change, and possibly deletes the entry after timeout.
- (MAC,host_switch_id) key value pairs for the part of key space it is responsible for.
- (MAC,host_switch_id) key value pairs for the MACs to which hosts directly attached to the switch are sending packets. This is not very big because most hosts communicate with a small number of popular hosts. This is removed after timeout.
- (IP,MAC) or (IP,MAC,host_switch_id) translation for the part of key space it is responsible for. This information is used only for replying to ARP requests.
When a host ha joins switch sa
The switch publishes (IPa,MACa) and (MACa,sa) to resolvers determined by H. It is now responsible for monitoring the liveness of resolver(how frequently is this done?) and republishing the host information. What if the publisher is the resolver?
The switches also receive the information through OSPF if another switch dies or new switch enters the network. They go through the cached (k,v) pairs and remove those entries where v = died switch. If a new switch has come up and (k,v) have been re-published to it, the old switch still maintains the (k,v), to answer queries from other switches which still don't know about the new switch, and removes the entry after a fixed timeout. How is this timeout decided?
ha attached to switch sa wants to communicate to hb attached to switch sb
- ha broadcasts an ARP request. sa intercepts and forwards it to the resolver. Resolver sends back MACb and if optimization is enabled sb. sa sends ARP reply to ha.
- sa now tunnels all packets to MACb to sb now.
Without optimization:
sa calculates H(MACb) and tunnels the packet to the resolver (IP tunneling I guess). Resolver de-tunnels and finds that the packet must be sent to MACb which it knows how to reach. So, it tunnels the packet to sb and tells sa also so that it can cache this information to use the shortest path from next time. In case a switch comes alive after reboot it uses this process to repopulate its cache.
Host Dynamics
3 types of dynamics:
1) Host changes location
The new host switch updates the (MAC,host_switch_id) entry to reflect the changes.
2) Host changes MAC
Update (IP,MAC) and deletes old (MAC,host_switch_id) entry and inserts a new one. Other hosts might have still cached the old information. So, the new switch maintains a MAC revocation list, and sends a gratuitous ARP to hosts sending packets packets to old MAC. It also informs the sending host switch about the entry (NewMAC, switchid). It can still forward the packets sent to old MAC to the host in the meantime.
3) Host changes IP
Host switch deletes old (IP,MAC) and inserts the new one
If both location and MAC change: When new switch registers (old_IP, NewMAC), the address resolver calculates (oldMAC, oldSwitchid) and tells old switch about MAC address change. Old switch now maintains revocation list sending gratuitous ARPs to senders along with sending new location of host (newMAC,newswitchid) to sending switch.
One drawback of SEATTLE is that it does not support ECMP so it does not utilize links efficiently (even if they are idle).
http://www.cs.princeton.edu/~chkim/Research/SEATTLE/seattle.pdf
Just for fun, you can see network conference acceptance stats here
I had read the paper before but I felt too lazy to write a post about it. I finally got around to it. This paper can be described in two words: Consistent Hashing. Ok maybe three: Consistent Hashing Ethernet. Definitely four: Consistent Hashing Ethernet Scalable. Read my earlier post on Consistent Hashing for refreshing your concepts.
The basic premise of the paper is very simple but things gets convoluted when you introduce caching, timeouts, mobility and failures. I felt the paper could have addressed these scenarios lucidly. Also, I felt the paper was badly organized especially in the middle sections. The good thing is it only requires changes to switch architecture and control plane and not the end hosts but if a simple protocol like TRILL took what about 7 years? to get adopted by Cisco this will take another decade or so. Here is the rundown.
Why is Ethernet good?
- Plug-and-play
- Host mobility
Why is Ethernet bad?
- Source MAC learning requires flooding. DHCP, ARP packets are flooded which waste link bandwidth and processing resources.
- Broadcast storms because there is no TTL in the header.
- Spanning tree computation under-utilizes links. Frames are not always sent via shortest paths. Root bridge must be selected carefully as it carries most of the traffic.
- Lot of security vulnerabilities and privacy concerns. Snooping on other people's traffic. ARP spoofing. DOS attacks.
L3/L2 architecture: Pools of Ethernet domains connected by IP routers.
- Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.
- Host mobility is restricted only within the subnet
Enter VLAN
- No restrictions on host mobility. Though we need to trunk the VLAN at all the bridges.
- Scopes broadcasts to small domains
- Each subnet is treated as a VLAN (http://www.youtube.com/watch?v=PpuWBh3EnVM)
- Need to compute multiple spanning trees now and choose root bridges carefully so as to effectively utilize all links and balance load
- Limited number: 4094
- Switches still need to maintain forwarding table data for all hosts in VLANs provisioned to them
The switches run link-state routing protocol (OSPF) among themselves. This is scalable as it is used to maintain only the switch level topology and not disseminate end-host information. Although, as the network gets bigger it would hit a limit because of large forwarding tables and LSA broadcasts will cause scalability issues. The paper proposes a remedy for that by dividing the switches into regions (very similar to OSPF areas). Since we are running OSPF between switches, if a switch fails all the other switches will eventually know about it. Also, OSPF helps route packets along the shortest path.
We use consistent hashing to evenly distribute and store (IP,MAC) and (MAC,host_switch_id) key value pairs among the existing switches. Each switch chooses the MAC address of one of its interfaces as its unique network wide switch_id. Since, we use consistent hashing the information is almost equally spread across all the switches. So, instead of all switches storing all the information they now store only a subset of the information. Switches cache information to avoid doing repeated lookups (more below). Powerful switches can have more nodes in the key space (virtual switches) to hold more information. We can also hash information for services like PRINTER and DHCP_SERVER and store it on switches or implement anycasting by storing the same IP address with different MAC addresses or change the MAC address for fail-over. This is a tangent though.
It uses loop-free multicasting to handle general broadcasts and groups instead of VLANs (broadcasting in a group sends unicasts to group members). This was little hand wavy in the paper though.
H - Hash function used by all switches
H(k) = rk resolver switch for key k. If (k,v) = (MAC,host_switch_id), rk is called location resolver
We have two (k,v) stores (IP,MAC) and (MAC,host_switch_id).
How is the packet sent from src to dst. Does the host_switch see the MAC address and tunnel the packet to appropriate switch or do they attach a separate header like TRILL? The paper does not say anything about this.
What information does each switch store?
- (MAC,port) key value pairs of directly attached hosts.
- (IP,MAC) and (MAC,host_switch_id) key value pairs of directly attached hosts. Switch must delete these once it detects the host MAC is unreachable. The location resolver informs the switch if the host has migrated to another switch. In this case the switch updates entry and forwards the packets to the new switch and informs other switches of this change, and possibly deletes the entry after timeout.
- (MAC,host_switch_id) key value pairs for the part of key space it is responsible for.
- (MAC,host_switch_id) key value pairs for the MACs to which hosts directly attached to the switch are sending packets. This is not very big because most hosts communicate with a small number of popular hosts. This is removed after timeout.
- (IP,MAC) or (IP,MAC,host_switch_id) translation for the part of key space it is responsible for. This information is used only for replying to ARP requests.
When a host ha joins switch sa
The switch publishes (IPa,MACa) and (MACa,sa) to resolvers determined by H. It is now responsible for monitoring the liveness of resolver(how frequently is this done?) and republishing the host information. What if the publisher is the resolver?
The switches also receive the information through OSPF if another switch dies or new switch enters the network. They go through the cached (k,v) pairs and remove those entries where v = died switch. If a new switch has come up and (k,v) have been re-published to it, the old switch still maintains the (k,v), to answer queries from other switches which still don't know about the new switch, and removes the entry after a fixed timeout. How is this timeout decided?
ha attached to switch sa wants to communicate to hb attached to switch sb
- ha broadcasts an ARP request. sa intercepts and forwards it to the resolver. Resolver sends back MACb and if optimization is enabled sb. sa sends ARP reply to ha.
- sa now tunnels all packets to MACb to sb now.
Without optimization:
sa calculates H(MACb) and tunnels the packet to the resolver (IP tunneling I guess). Resolver de-tunnels and finds that the packet must be sent to MACb which it knows how to reach. So, it tunnels the packet to sb and tells sa also so that it can cache this information to use the shortest path from next time. In case a switch comes alive after reboot it uses this process to repopulate its cache.
Host Dynamics
3 types of dynamics:
1) Host changes location
The new host switch updates the (MAC,host_switch_id) entry to reflect the changes.
2) Host changes MAC
Update (IP,MAC) and deletes old (MAC,host_switch_id) entry and inserts a new one. Other hosts might have still cached the old information. So, the new switch maintains a MAC revocation list, and sends a gratuitous ARP to hosts sending packets packets to old MAC. It also informs the sending host switch about the entry (NewMAC, switchid). It can still forward the packets sent to old MAC to the host in the meantime.
3) Host changes IP
Host switch deletes old (IP,MAC) and inserts the new one
If both location and MAC change: When new switch registers (old_IP, NewMAC), the address resolver calculates (oldMAC, oldSwitchid) and tells old switch about MAC address change. Old switch now maintains revocation list sending gratuitous ARPs to senders along with sending new location of host (newMAC,newswitchid) to sending switch.
One drawback of SEATTLE is that it does not support ECMP so it does not utilize links efficiently (even if they are idle).
Thursday, June 21, 2012
Data Center TCP (DCTCP)
SIGCOMM 2010 paper:
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf
This is certainly one of the well written papers I have read in a long time. The paper is very easy to read (except the analysis in section 3.3) with powerful results. The paper brings into focus the bufferbloat problem which has received attention lately in home networks. This paper shows its existence in data center networks as well.
Preliminaries
TCP Incast
Read blog post by Brad Hedlund
The paper analyzes measurements from a 6000 server data center and studies performance anomalies. The network should provide low latency to short message flows, high burst tolerance for incast flows (queries) and high throughput for long lived bandwidth-hungry background flows. DCTCP is to be use strictly for intra-DC flows. The primary insight of the paper is that the switch buffer occupancy of the large flows should be kept as low as possible without affecting their throughput. This helps in two ways:
1) For delay sensitive short messages this reduces network latency
2) For query traffic which can cause Incast, it helps ensure that sufficient buffer space is available to absorb the micro-bursts and hence avoid packet loss => which causes latency because we have to wait fro TCP re-transmission timeout to expire. Reducing the RTO can fix the incast problem (see paper) but does not help with point 1 above.
Most commodity switches are shared memory switches that have logically common bugger for all switch ports/interfaces (fate sharing). The MMU manages fairness by setting a maximum amount of memory an interface can take. If more packets arrive beyond this limit they are dropped (Incast).
Buffer pressure: Since the packet buffer space is shared among all interfaces, long flows on other interfaces can reduce the space available for bursts of short flows on free interfaces leading to loss and timeouts. This does not require the incoming flows to be synchronized.
DCTCP uses the ECN bit which is set by switches when their queue length exceeds a threshold and senders reduce the congestion window in proportion to congestion (i.e. fraction of marked packets). The paper gives accurate guidelines for selecting the values of g and K.
Thorough evaluations with key takeaways after each experiment.
Tell me why?
1) Why did the authors not compare the performance of their algorithm to TCP Vegas. It would have been interesting to see how DCTCP fares against it though I agree with the authors claim measuring slight increases in RTT in a data center environment can be difficult.
2) The authors show that jittering reduces the response time at higher percentiles at the cost of increasing median response time. Why is this bad if all the responses meet the application deadline?
3) Why do authors use K as instantaneous queue length instead of average queue length? Don't they want to avoid effects of instantaneous packet bursts?
4) The authors say that using Delayed ACKs is important, however, another paper identified them as a serious problem in DCs.
Data center numbers for people who don't have access to data centers
1) The latency targets for data center applications like web search, retail, advertising and recommendation systems are ~10ms to ~100ms. Applications use Partition/Aggregate workflow pattern. Tasks not completed before deadline are cancelled and affect the quality of the final result.
2) 99.91% of traffic is TCP traffic.
3) In absence of queuing, intra-rack RTT = 100μs and inter-rack RTT = 250μs.
4) Each rack in a cluster holds ~40 servers. The data center in this paper has 44 servers per rack. There are typically 20 to 80 VMs per hypervisor.
5) ToR switches have 4MB of shared buffer for ToR switch with 48 1Gbps and 2 10Gbps ports.
6) Median number of concurrent flows active in a 50ms window on each worker node is 36.
7) To avoid TCP Incast developers limit size of response to 2 KB and introduce application level jittering to get asynchronous responses from worker nodes.
Notes to future me
Read TCP Vegas paper
How Facebook handles scale
Remember TCP Large Send Offload
Random quote:
The biggest challenge is to find a novel solution to a novel problem which people in academia deem novel
http://research.microsoft.com/en-us/um/people/padhye/publications/dctcp-sigcomm2010.pdf
This is certainly one of the well written papers I have read in a long time. The paper is very easy to read (except the analysis in section 3.3) with powerful results. The paper brings into focus the bufferbloat problem which has received attention lately in home networks. This paper shows its existence in data center networks as well.
Preliminaries
TCP Incast
Read blog post by Brad Hedlund
The paper analyzes measurements from a 6000 server data center and studies performance anomalies. The network should provide low latency to short message flows, high burst tolerance for incast flows (queries) and high throughput for long lived bandwidth-hungry background flows. DCTCP is to be use strictly for intra-DC flows. The primary insight of the paper is that the switch buffer occupancy of the large flows should be kept as low as possible without affecting their throughput. This helps in two ways:
1) For delay sensitive short messages this reduces network latency
2) For query traffic which can cause Incast, it helps ensure that sufficient buffer space is available to absorb the micro-bursts and hence avoid packet loss => which causes latency because we have to wait fro TCP re-transmission timeout to expire. Reducing the RTO can fix the incast problem (see paper) but does not help with point 1 above.
Most commodity switches are shared memory switches that have logically common bugger for all switch ports/interfaces (fate sharing). The MMU manages fairness by setting a maximum amount of memory an interface can take. If more packets arrive beyond this limit they are dropped (Incast).
Buffer pressure: Since the packet buffer space is shared among all interfaces, long flows on other interfaces can reduce the space available for bursts of short flows on free interfaces leading to loss and timeouts. This does not require the incoming flows to be synchronized.
DCTCP uses the ECN bit which is set by switches when their queue length exceeds a threshold and senders reduce the congestion window in proportion to congestion (i.e. fraction of marked packets). The paper gives accurate guidelines for selecting the values of g and K.
Thorough evaluations with key takeaways after each experiment.
Tell me why?
1) Why did the authors not compare the performance of their algorithm to TCP Vegas. It would have been interesting to see how DCTCP fares against it though I agree with the authors claim measuring slight increases in RTT in a data center environment can be difficult.
2) The authors show that jittering reduces the response time at higher percentiles at the cost of increasing median response time. Why is this bad if all the responses meet the application deadline?
3) Why do authors use K as instantaneous queue length instead of average queue length? Don't they want to avoid effects of instantaneous packet bursts?
4) The authors say that using Delayed ACKs is important, however, another paper identified them as a serious problem in DCs.
Data center numbers for people who don't have access to data centers
1) The latency targets for data center applications like web search, retail, advertising and recommendation systems are ~10ms to ~100ms. Applications use Partition/Aggregate workflow pattern. Tasks not completed before deadline are cancelled and affect the quality of the final result.
2) 99.91% of traffic is TCP traffic.
3) In absence of queuing, intra-rack RTT = 100μs and inter-rack RTT = 250μs.
4) Each rack in a cluster holds ~40 servers. The data center in this paper has 44 servers per rack. There are typically 20 to 80 VMs per hypervisor.
5) ToR switches have 4MB of shared buffer for ToR switch with 48 1Gbps and 2 10Gbps ports.
6) Median number of concurrent flows active in a 50ms window on each worker node is 36.
7) To avoid TCP Incast developers limit size of response to 2 KB and introduce application level jittering to get asynchronous responses from worker nodes.
Notes to future me
Read TCP Vegas paper
How Facebook handles scale
Remember TCP Large Send Offload
Random quote:
The biggest challenge is to find a novel solution to a novel problem which people in academia deem novel
Monday, June 18, 2012
Bisection bandwidth of a data center
Read:
http://etherealmind.com/bisectional-bandwidth-l2mp-trill-bridges-design-value/
Assuming all the ports are 1Gbps, from the Fat-tree paper:
So, the architecture provides full bisection bandwidth and there is no oversubscription at the aggregation and edge switches.
http://etherealmind.com/bisectional-bandwidth-l2mp-trill-bridges-design-value/
Assuming all the ports are 1Gbps, from the Fat-tree paper:
So, the architecture provides full bisection bandwidth and there is no oversubscription at the aggregation and edge switches.
Sunday, June 17, 2012
Applying NOX to the Datacenter
Get the paper here
The paper demonstrates that NOX can be used to implement various data center architectures (Portland and VL2) and addressing schemes; along with the fact that it scales well, can provide fine-grained QoS, access control and middlebox traversal. The paper has many good insights.
1) Data center networks are a special case networking paradigm which allow new architectures, specialized addressing schemes and easy modification of the hypervisor and network stack. This paves way for easier innovation as shown by recent research work some of which I have described in this blog. Also, since the data center is a single administrative domain using a centralized controller is better than using distributed dynamic protocols.
2) The paper describes three types of data centers: Bare-metal (w/o any hypervisor), virtualized (~20VMs per physical server) and virtualized-multitenant (like Amazon EC2)
3) Networking requirements of a data center:
i) Scaling: allocate more servers/VMs effortlessly as the demand increases. The data centers architecture and addressing schemes should be able to scale for at least a million VMs or hundreds of thousands of servers. With bigger scale the forwarding table entries grow, the switches/routers require more ports and hence become costlier and broadcasts become expensive.
ii) VMs should be able to migrate without any service interruption
iii) Fault tolerance to avoid service interruption
4) Next the paper describes VL2 and Portland. Here are the cons:
In Portland every core switch is connected to every pod with a single link: this will not scale well as it requires too many ports on core routers. VL2 runs OSPF over all its switches to build forwarding tables. This does not scale well as LSP which are broadcasted are expensive. VM migration in VL2 is not easy and requires the controller to push the new state to all the switches which are sending traffic to the VM. In both schemes ARP and new flows prompt sending packet to the controller which is not scalable in my opinion. The controller should be used only for bootstrapping and on link failures only. The middleboxes will take care of QoS and load-balancing. Doing fine grain control is not scalable
5) NOX can operate in both proactive and reactive mode. In proactive mode the controller computes all the forwarding data beforehand and pushes it onto the switches. This incurs no latency unlike the reactive approach where the first packet of each new flow is sent to the controller for processing.
6) An interesting point in this section was that for larger networks multiple NOX controllers can act in parallel where each new flow packet is sent to the designated controller in reactive mode. The controller work independently but must maintain global consistency about the network topology and host address mapping. Achieving consistency will incur more latency and complication, although it depends on the controller application under analysis.
The paper says an individual controller can handle 30K new flows per second with ~10ms latency.
A low cost 1Gbps switch can support 250K source/destination pair flow entries.
How to calculate over-subscription ratio: http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#wp1088848
7) In case of link/switch failure the controller has to install new flow entries at all the switches. The complexity and response time will increase with multiple failures. Also, installing low priority flow entries before hand will not work in case of multiple failures.
The paper demonstrates that NOX can be used to implement various data center architectures (Portland and VL2) and addressing schemes; along with the fact that it scales well, can provide fine-grained QoS, access control and middlebox traversal. The paper has many good insights.
1) Data center networks are a special case networking paradigm which allow new architectures, specialized addressing schemes and easy modification of the hypervisor and network stack. This paves way for easier innovation as shown by recent research work some of which I have described in this blog. Also, since the data center is a single administrative domain using a centralized controller is better than using distributed dynamic protocols.
2) The paper describes three types of data centers: Bare-metal (w/o any hypervisor), virtualized (~20VMs per physical server) and virtualized-multitenant (like Amazon EC2)
3) Networking requirements of a data center:
i) Scaling: allocate more servers/VMs effortlessly as the demand increases. The data centers architecture and addressing schemes should be able to scale for at least a million VMs or hundreds of thousands of servers. With bigger scale the forwarding table entries grow, the switches/routers require more ports and hence become costlier and broadcasts become expensive.
ii) VMs should be able to migrate without any service interruption
iii) Fault tolerance to avoid service interruption
4) Next the paper describes VL2 and Portland. Here are the cons:
In Portland every core switch is connected to every pod with a single link: this will not scale well as it requires too many ports on core routers. VL2 runs OSPF over all its switches to build forwarding tables. This does not scale well as LSP which are broadcasted are expensive. VM migration in VL2 is not easy and requires the controller to push the new state to all the switches which are sending traffic to the VM. In both schemes ARP and new flows prompt sending packet to the controller which is not scalable in my opinion. The controller should be used only for bootstrapping and on link failures only. The middleboxes will take care of QoS and load-balancing. Doing fine grain control is not scalable
5) NOX can operate in both proactive and reactive mode. In proactive mode the controller computes all the forwarding data beforehand and pushes it onto the switches. This incurs no latency unlike the reactive approach where the first packet of each new flow is sent to the controller for processing.
6) An interesting point in this section was that for larger networks multiple NOX controllers can act in parallel where each new flow packet is sent to the designated controller in reactive mode. The controller work independently but must maintain global consistency about the network topology and host address mapping. Achieving consistency will incur more latency and complication, although it depends on the controller application under analysis.
The paper says an individual controller can handle 30K new flows per second with ~10ms latency.
A low cost 1Gbps switch can support 250K source/destination pair flow entries.
How to calculate over-subscription ratio: http://www.cisco.com/en/US/docs/solutions/Enterprise/Data_Center/DC_Infra2_5/DCInfra_3.html#wp1088848
7) In case of link/switch failure the controller has to install new flow entries at all the switches. The complexity and response time will increase with multiple failures. Also, installing low priority flow entries before hand will not work in case of multiple failures.
Saturday, June 16, 2012
Simple Expander graph constructions
An expander graph is a sparse graph with high connectivity properties. I will show code for simple constructions of some families of Expander Graphs using Python networkx library:
1) LPS graph
From Michael Nielsen's blog
In this example the family of graphs is indexed by a prime number, . The set of vertices for the graph is just the set of points in , the field of integers modulo . We construct a -regular graph by connecting each vertex to and . The vertex is connected to and . According to the lecture notes by Linial and Wigderson, this was proved to be a family of expanders by Lubotsky, Phillips and Sarnak in 1988.
Python code
import networkx as nxg=nx.Graph(g) #remove parallel edges
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys
def miller_rabin_pass(a, s, d, n):
a_to_power = pow(a, d, n)
if a_to_power == 1:
return True
for i in xrange(s-1):
if a_to_power == n - 1:
return True
a_to_power = (a_to_power * a_to_power) % n
return a_to_power == n - 1
def miller_rabin(n):
d = n - 1
s = 0
while d % 2 == 0:
d >>= 1
s += 1
for repeat in xrange(20):
a = 0
while a == 0:
a = random.randrange(n)
if not miller_rabin_pass(a, s, d, n):
return False
return True
def ramanujan_graph(p):
if not miller_rabin(p):
print "p is not prime"
return None
g = nx.Graph()
for i in range(p):
g.add_node(i)
for i in range(p):
if i == 0:
g.add_edge(0,p-1)
g.add_edge(0,1)
else:
g.add_edge(i,i-1)
g.add_edge(i,(i+1)%p)
g.add_edge(i,(i**(p-2))%p)
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g
g = ramanujan_graph(41)
PL.figure()
nx.draw_circular(g)
plt.show()
Paley graph of order q (where q is a prime or integer power of a prime) has q nodes, each with degree (q-1)/2. Paley graph is undirected when q % 4 = 1. The set of vertices for the graph is just the set of points in . Two vertices a,b in graph are connected if a-b = x2 % p where x lies in [1,p). See also: http://mathworld.wolfram.com/PaleyGraph.html
Python code
3) Margulis construction
4) Hypercube graphimport networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys
def paley_graph(p):
if not p%4 == 1:
return None
square_list = []
for i in range((p-1)/2):
square_list.append((i**2)%p)
g = nx.Graph()
for i in range(p):
g.add_node(i)
for i in range(p):
for j in square_list:
g.add_edge(i,(i+j)%p)
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g
g = paley_graph(17)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()
3) Margulis construction
import networkx as nx
import matplotlib
import matplotlib.pyplot as plt
import pylab as PL
import random, sys
def margulis_graph(n):
#Graph has n^2 vertices
g = nx.Graph()
for i in range(n):
for j in range(n):
g.add_node((i,j))
for (i,j) in g.nodes():
g.add_edge((i,j),((i+2*j)%n,j))
g.add_edge((i,j),(i,(2*i+j)%n))
g.add_edge((i,j),(i,(2*i+j+1)%n))
g.add_edge((i,j),((i+2*j+1)%n,j))
g=nx.Graph(g) #remove parallel edges
g.remove_edges_from(g.selfloop_edges()) #remove self loops
return g
g = margulis_graph(5)
print nx.diameter(g)
PL.figure()
nx.draw_circular(g)
plt.show()
Wikipedia. Use nx.hypercube_graph(n)5) Superconcentrators
Smaller Explicit Superconcentrators
Acknowledgments
I would like to thank Abhishek Bhowmick and Ankit Garg for helpful discussions.
Saturday, June 9, 2012
Installing ns2 on Ubuntu 11.10
Prerequisites
Install development libraries for X11:
sudo apt-get install libx11-dev libxt-dev g++
Download ns-allinone package from here:
http://www.isi.edu/nsnam/ns/ns-build.html#allinone
Untar the package
tar -xvf ns-allinone-2.35.tar.gz
Install the package:
sudo ./install
Input the required path information in .bashrc file like this
export PATH=$PATH:<Place your paths here>
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: <place the LD_LIBRARY_PATHS>
export TCL_LIBRARY=$TCL_LIBRARY:<place the TCL_LIBRARY>
After these steps, you can now run the ns validation suite with
cd ns-2.35; ./validate
Wednesday, June 6, 2012
RBridges and TRILL
http://www.ieee-infocom.org/2004/Papers/26_1.PDF
http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_14-3/143_trill.html
Ivan Pepelnjak on TRILL:
"bridging forwarding paradigms are used to forward host-to-host data traffic, including building MAC address table by listening to data traffic and flooding of packets sent to unknown MAC destination. TRILL therefore retains most of the non-scalable properties of transparent bridging"
Brad Hedlund on TRILL in data center:
http://bradhedlund.com/2010/05/07/setting-the-stage-for-trill/
Cisco FabricPath:
http://www.youtube.com/watch?v=M85WYV8i3jg
http://www.cisco.com/web/about/ac123/ac147/archived_issues/ipj_14-3/143_trill.html
Ivan Pepelnjak on TRILL:
"bridging forwarding paradigms are used to forward host-to-host data traffic, including building MAC address table by listening to data traffic and flooding of packets sent to unknown MAC destination. TRILL therefore retains most of the non-scalable properties of transparent bridging"
Brad Hedlund on TRILL in data center:
http://bradhedlund.com/2010/05/07/setting-the-stage-for-trill/
Cisco FabricPath:
http://www.youtube.com/watch?v=M85WYV8i3jg
Saturday, June 2, 2012
A Scalable, Commodity Data Center Network Architecture
First read this nice article on data center architectures by Brad Hedlund:
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
The objective of the paper is the following: Achieve full bisection bandwidth (if the network is segmented into two equal parts, this is the bandwidth between the two parts) between all the hosts in a network using commodity Ethernet (incrementally scalable) switches in face of over-subscription.
Here is a traditional L3/L2 design:
There are typically 20-40 servers per rack. Each server is connected to a ToR switch via 1 Gbps link. ToR switches have 10 Gbps uplinks to ARs. ToR have 48 ports while AR and CRs have 128 ports. AR and CR have only 10 Gbps ports. So, the target is to have DC with 25,000+ hosts with a bisection bandwidth of 1 Gbps.
The paper proposes a Clos Network/Fat-tree network as a new network architecture for achieving this goal. They propose a L3/L2 architecture:
k - Number of ports of the switches used in the fat-tree
There are k pods, each pod has 2 layers of k/2 switches. Each edge switch is connected to k/2 hosts. There are (k/2)2 CRs. The topology can support k/2*k/2*k = k3/4 hosts. Between any 2 hosts there are (k/2)2 paths. We must spread the traffic on these paths somehow to achieve the bisection bandwidth.
Hosts connected to the same edge switch form a /24 subnet. VMs can migrate only within their own subnet. No talk of VLANs or migration between subnets in the paper.
Addressing (For a 10/8 network)
CR - 10.k.j.i - i and j lie in the range [1,k/2]
Pod switch - 10.pod.switch.1 (switch number lies in [0,k-1])
Host - 10.pod.switch.ID (ID lies in [2,k/2+1]) - This is very wasteful
Routing
Here is how the routing table looks like for a pod switch:
- There are k/2 prefixes for the subnets in the same pod and k/2 suffixes for k/2 hosts. Prefixes are preferred to suffixes.
- For intra-pod traffic the lower level pod switches send the packet to the upper level (evenly split between upper layer switches) which sends it back to the other edge switch. For inter-pod traffic the ARs spread the traffic among the CRs based on the last byte of the IP address. TCP ordering is not broken as packets always follow the same path to CR.
- The CR aggregate network prefixes of all the subnets (i.e. k /16 networks) which should not take too much space in the forwarding tables. The traffic spreading takes only on the route from edge to CRs.
- The paper advocates the use of a central controller (Openflow anyone?) for generating the routing tables and statically storing it on the switches. Since, the routing tables will not change this seems alright.
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
The objective of the paper is the following: Achieve full bisection bandwidth (if the network is segmented into two equal parts, this is the bandwidth between the two parts) between all the hosts in a network using commodity Ethernet (incrementally scalable) switches in face of over-subscription.
Here is a traditional L3/L2 design:
There are typically 20-40 servers per rack. Each server is connected to a ToR switch via 1 Gbps link. ToR switches have 10 Gbps uplinks to ARs. ToR have 48 ports while AR and CRs have 128 ports. AR and CR have only 10 Gbps ports. So, the target is to have DC with 25,000+ hosts with a bisection bandwidth of 1 Gbps.
The paper proposes a Clos Network/Fat-tree network as a new network architecture for achieving this goal. They propose a L3/L2 architecture:
k - Number of ports of the switches used in the fat-tree
There are k pods, each pod has 2 layers of k/2 switches. Each edge switch is connected to k/2 hosts. There are (k/2)2 CRs. The topology can support k/2*k/2*k = k3/4 hosts. Between any 2 hosts there are (k/2)2 paths. We must spread the traffic on these paths somehow to achieve the bisection bandwidth.
Hosts connected to the same edge switch form a /24 subnet. VMs can migrate only within their own subnet. No talk of VLANs or migration between subnets in the paper.
Addressing (For a 10/8 network)
CR - 10.k.j.i - i and j lie in the range [1,k/2]
Pod switch - 10.pod.switch.1 (switch number lies in [0,k-1])
Host - 10.pod.switch.ID (ID lies in [2,k/2+1]) - This is very wasteful
Routing
Here is how the routing table looks like for a pod switch:
- There are k/2 prefixes for the subnets in the same pod and k/2 suffixes for k/2 hosts. Prefixes are preferred to suffixes.
- For intra-pod traffic the lower level pod switches send the packet to the upper level (evenly split between upper layer switches) which sends it back to the other edge switch. For inter-pod traffic the ARs spread the traffic among the CRs based on the last byte of the IP address. TCP ordering is not broken as packets always follow the same path to CR.
- The CR aggregate network prefixes of all the subnets (i.e. k /16 networks) which should not take too much space in the forwarding tables. The traffic spreading takes only on the route from edge to CRs.
- The paper advocates the use of a central controller (Openflow anyone?) for generating the routing tables and statically storing it on the switches. Since, the routing tables will not change this seems alright.
Friday, June 1, 2012
VL2: A Scalable and Flexible Data Center Network
The paper also tries to propose an architecture which allows the data center to become a single large Layer-2 domain. I will not go into the advantages of doing so - low cost, zero-configuration (apart from DHCP), make services scalable, easy VM migration etc.
This paper raises some important research questions in the field of data center networking. It combines a measurement study and a new architecture.
Q1. How to provide high capacity throughput to intra-dc flows in face of over-subscription?
Q2. How to isolate the traffic of one service from another? What prevents one service from flooding the network with its own traffic?
The paper does not really solve this problem
Q3. How to make Ethernet broadcasts scalable? VLANs don't scale well.
Only ARP and DHCP are made scalable. No talk about multicast.
System design:
1) Clos topology (links between AR and CR form a complete bipartite graph)
2) Virtual Layer-2 (hence the name VL2)
3) VLB and ECMP
4) Central directory service
Here is the topology:
The CR, AR and ToR use OSPF to learn about the switch-level topology and shortest path routing.
The measurement study done for the data center they were analyzing showed two properties:
1) Traffic pattern inside a DC is very unpredictable and changes rapidly
2) Ratio of traffic volume between servers to traffic volume leaving/entering the DC is 4:1
VL2 routing
- VLB is used to spread traffic across multiple CRs. Take a random path from a source ToR to a CR and from CR to destination ToR switch.
- Use anycast address at the CR and ToR to enable ECMP forwarding.
- Two address spaces - application-specific addresses (AAs) and location-specific addresses (LAs). The directory system maintains a mapping from AA to LA. Every server in the DC has a VL2 agent running in the network stack. The agent traps the ARP request and instead sends a unicast to the directory service and caches its reply. It is also responsible for tunneling (IP in IP) the packet to the ToR switch of the destination host. The agent also rate-limits broadcast traffic to prevent storms.
- DHCP is implemented using relay agents in each ToR and a central DHCP server. The directory service is notified of the LA-AA mapping when a new server joins the network and also after a server migrates to a new rack (by looking for gratuitous ARP).
- The directory service has two components:
1) Read optimized directory servers (store soft state which may be stale)
2) Write optimized strongly consistent RSM servers. They run Paxos for reliability and fault tolerance.
Here is how routing is done -
VL2 utilizes both VLB and ECMP (each one complementing other's weakness). If you use only VLB:
Each agent is provided a random intermediate router address. We don't need anycast addresses. Each router interface can have different IP. OSPF takes care of reachability. We stack up the IP headers at source and off we go. Now, what happens if the intermediate router fails. We will have to update a huge amount of VL2 agents to use a new random router.
If we use ECMP alone: Configure each router's interface to have the same unicast address (10.0.0.1). ECMP makes sure that any one of these can be taken as the next hop. In this case we don't need to do multiple encapsulations. A single encapsulation with H(ft) | dest_ToR_IP is enough. But commodity switches support only 16-way ECMP, so we can't spread traffic across all the intermediate routers.
Solution is to combine both VLB and ECMP: We define several anycast addresses, each associated with only as many intermediate switches as ECMP can accommodate. The agent is given a random intermediate router IP address. We perform multiple encapsulations. OSPF and ECMP reacts upon link/switch failure, eliminating the need to notify agents.
Drawbacks
1) Bipartite graph will limit scalability of this approach.
2) Does not talk about how to tackle multiple tenants each with same private IP address.
3) Configuring routers for OSPF is not scalable. This is an error prone process. Not sure if OSPF will be scalable (each router stores all the LSAs and broadcasts HELLO packets). Need to configure areas, which creates further configuration overhead.
4) Every outgoing packet needs to encapsulated (also the agent computes the hash and puts it in each packet). This increases latency, reduces useful bandwidth. If we have multiple layers in the data center hierarchy I don't think this approach will scale.
5) Hosts which need to communicate with outside world are given both LA and AA. This essentially fixes their position in the topology and further increases the burden on OSPF (these IPs are not aggregated).
6) VL2 has eventual consistency when it comes to AA-LA mappings. This implies that VM migration will have more downtime and packet loss.
Random
- A commodity switch stores only 16K entries in its routing table
- ToR informs the directory server in case it receives a packet for which it has no mapping which triggers the directory server to update the VL2 agent caches. This can lead to DOS attacks on the directory servers.
- The servers had delayed ACKs turned on which should be turned off.
- What is a hose model? Not really clear from the paper. The reference cited talks about it in terms of VPNs.
This paper raises some important research questions in the field of data center networking. It combines a measurement study and a new architecture.
Q1. How to provide high capacity throughput to intra-dc flows in face of over-subscription?
Q2. How to isolate the traffic of one service from another? What prevents one service from flooding the network with its own traffic?
The paper does not really solve this problem
Q3. How to make Ethernet broadcasts scalable? VLANs don't scale well.
Only ARP and DHCP are made scalable. No talk about multicast.
System design:
1) Clos topology (links between AR and CR form a complete bipartite graph)
2) Virtual Layer-2 (hence the name VL2)
3) VLB and ECMP
4) Central directory service
Here is the topology:
The CR, AR and ToR use OSPF to learn about the switch-level topology and shortest path routing.
The measurement study done for the data center they were analyzing showed two properties:
1) Traffic pattern inside a DC is very unpredictable and changes rapidly
2) Ratio of traffic volume between servers to traffic volume leaving/entering the DC is 4:1
VL2 routing
- VLB is used to spread traffic across multiple CRs. Take a random path from a source ToR to a CR and from CR to destination ToR switch.
- Use anycast address at the CR and ToR to enable ECMP forwarding.
- Two address spaces - application-specific addresses (AAs) and location-specific addresses (LAs). The directory system maintains a mapping from AA to LA. Every server in the DC has a VL2 agent running in the network stack. The agent traps the ARP request and instead sends a unicast to the directory service and caches its reply. It is also responsible for tunneling (IP in IP) the packet to the ToR switch of the destination host. The agent also rate-limits broadcast traffic to prevent storms.
- DHCP is implemented using relay agents in each ToR and a central DHCP server. The directory service is notified of the LA-AA mapping when a new server joins the network and also after a server migrates to a new rack (by looking for gratuitous ARP).
- The directory service has two components:
1) Read optimized directory servers (store soft state which may be stale)
2) Write optimized strongly consistent RSM servers. They run Paxos for reliability and fault tolerance.
Here is how routing is done -
VL2 utilizes both VLB and ECMP (each one complementing other's weakness). If you use only VLB:
Each agent is provided a random intermediate router address. We don't need anycast addresses. Each router interface can have different IP. OSPF takes care of reachability. We stack up the IP headers at source and off we go. Now, what happens if the intermediate router fails. We will have to update a huge amount of VL2 agents to use a new random router.
If we use ECMP alone: Configure each router's interface to have the same unicast address (10.0.0.1). ECMP makes sure that any one of these can be taken as the next hop. In this case we don't need to do multiple encapsulations. A single encapsulation with H(ft) | dest_ToR_IP is enough. But commodity switches support only 16-way ECMP, so we can't spread traffic across all the intermediate routers.
Solution is to combine both VLB and ECMP: We define several anycast addresses, each associated with only as many intermediate switches as ECMP can accommodate. The agent is given a random intermediate router IP address. We perform multiple encapsulations. OSPF and ECMP reacts upon link/switch failure, eliminating the need to notify agents.
Drawbacks
1) Bipartite graph will limit scalability of this approach.
2) Does not talk about how to tackle multiple tenants each with same private IP address.
3) Configuring routers for OSPF is not scalable. This is an error prone process. Not sure if OSPF will be scalable (each router stores all the LSAs and broadcasts HELLO packets). Need to configure areas, which creates further configuration overhead.
4) Every outgoing packet needs to encapsulated (also the agent computes the hash and puts it in each packet). This increases latency, reduces useful bandwidth. If we have multiple layers in the data center hierarchy I don't think this approach will scale.
5) Hosts which need to communicate with outside world are given both LA and AA. This essentially fixes their position in the topology and further increases the burden on OSPF (these IPs are not aggregated).
6) VL2 has eventual consistency when it comes to AA-LA mappings. This implies that VM migration will have more downtime and packet loss.
Random
- A commodity switch stores only 16K entries in its routing table
- ToR informs the directory server in case it receives a packet for which it has no mapping which triggers the directory server to update the VL2 agent caches. This can lead to DOS attacks on the directory servers.
- The servers had delayed ACKs turned on which should be turned off.
- What is a hose model? Not really clear from the paper. The reference cited talks about it in terms of VPNs.
Thursday, May 31, 2012
Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric
This is an influential paper in the Data Center arena. You can also see a video presentation of the paper here.
The basic idea of the paper is very simple: Make Ethernet scalable by making it hierarchical like IP. Thats it. I personally think SEATTLE is a more elegant solution to the problem.
First two sections give the same rhetoric about how L2 is simple but not scalable and how L3 is scalable but hinders mobility and is prone to configuration errors. The Related Work section is thorough.
Design:
1) Fat-tree topology
2) Fabric manger (centralized directory service)
3) Distributed location discovery protocol (basically send packets like LSPs in OSPF for checking if a switch/host is alive and also so that switches learn their pod and position numbers)
Paper defines a Pseudo MAC address (48 bits) - PMAC - which encodes the location of a host in the topology. It is of type pod.position.port.vmid (hierarchical like IP address)
Lets see how a packet travels from a host (1, 10.0.1.2) in Pod 0 to a host (2, 10.1.1.2) in Pod 1
- Host 1 sends an ARP: Who has 10.1.1.2?
- The edge switch intercepts the ARP and if it sees a source MAC it hasn't seen before it generates a PMAC for it (edge switch assigns a monotonically increasing vmid to MACs it sees on a particular port) and creates an entry in its local PMAC table mapping the host's AMAC (actual MAC) and IP to its PMAC. It then communicates the IP-PMAC mapping to Fabric manager. This data is used to reply to ARP requests from other hosts and for translating the PMAC to AMAC on packets destined to the host. The edge switch also asks the Fabric manager if it knows the PMAC for 10.1.1.2.
If it does:
The edge switch constructs an ARP reply with the PMAC from fabric manager and sends it to the host. Host learns the PMAC (48 bits) instead of AMAC. That is why PMAC is 48 bits long.
If it does not:
ARP is broadcasted like it is in standard Ethernet. The target host replies with its AMAC. The edge switch of target host rewrites the AMAC with PMAC before forwarding to querying host and fabric manager (This is how the fabric manager gets populated after failure)
- 10.0.1.2 sends the packet to the PMAC it has learnt. The last hop edge switch translates the PMAC to AMAC before sending it to the target host. End hosts don't see their own PMACs but cache PMACs of other hosts.
Forwarding is pretty simple. If the PMAC is in the same Pod but different position forward it to your aggregation switch. If it is in different Pod aggregation switch will forward it to the core switch. The switch through the LDP learns which switch is connected to it on which port. The forwarding data on a switch is aggregatable like IP so it does not take up much space.
The PMACs can support expansion of the DC (adding another Pod). The multi-rooted tree topology however should remain fixed which is a fair assumption.
A good note:
When considering a distributed versus centralized protocol there is an inherent trade off between protocol simplicity and system robustness.
The fabric manager contains soft state data of the following type:
Now the fabric manager will contain 1 million entries (corresponding to all the hosts in the DC). Is this scalable? DHT anyone?
The switches send LDMs to each other and the fabric manager also monitors connectivity to each switch module. Is all this broadcast traffic scalable?
Drawbacks:
1) What about traffic coming in from the Internet? How do you scalably handle all that traffic?
2) The paper says that their approach works for multi-rooted trees and they take fat-tree topology as an example but that is easier said than done. If there is one more layer between the aggregation and core layer their LDP protocol cannot work without the help from fabric manager.
3) VL2 runs OSPF and Portland LDP which requires some configuration and broadcasts. Also, both schemes use indirection which involves caching of information for speedup. So, when the cache is invalid, the central controller/fabric manger has to take care of things and this is tedious.
4) How does it account for MB in a data center? What if a MB is filtering on dst MAC or something?
5) Cannot be implemented in conjunction with existing data center topology because legacy switches don't understand MAC prefixes or LDP. Even existing Openflow switches don't understand LDPs so these packets will have to be processed by the controller. Not backwards compatible.
The basic idea of the paper is very simple: Make Ethernet scalable by making it hierarchical like IP. Thats it. I personally think SEATTLE is a more elegant solution to the problem.
First two sections give the same rhetoric about how L2 is simple but not scalable and how L3 is scalable but hinders mobility and is prone to configuration errors. The Related Work section is thorough.
Design:
1) Fat-tree topology
2) Fabric manger (centralized directory service)
3) Distributed location discovery protocol (basically send packets like LSPs in OSPF for checking if a switch/host is alive and also so that switches learn their pod and position numbers)
Paper defines a Pseudo MAC address (48 bits) - PMAC - which encodes the location of a host in the topology. It is of type pod.position.port.vmid (hierarchical like IP address)
Lets see how a packet travels from a host (1, 10.0.1.2) in Pod 0 to a host (2, 10.1.1.2) in Pod 1
- Host 1 sends an ARP: Who has 10.1.1.2?
- The edge switch intercepts the ARP and if it sees a source MAC it hasn't seen before it generates a PMAC for it (edge switch assigns a monotonically increasing vmid to MACs it sees on a particular port) and creates an entry in its local PMAC table mapping the host's AMAC (actual MAC) and IP to its PMAC. It then communicates the IP-PMAC mapping to Fabric manager. This data is used to reply to ARP requests from other hosts and for translating the PMAC to AMAC on packets destined to the host. The edge switch also asks the Fabric manager if it knows the PMAC for 10.1.1.2.
If it does:
The edge switch constructs an ARP reply with the PMAC from fabric manager and sends it to the host. Host learns the PMAC (48 bits) instead of AMAC. That is why PMAC is 48 bits long.
If it does not:
ARP is broadcasted like it is in standard Ethernet. The target host replies with its AMAC. The edge switch of target host rewrites the AMAC with PMAC before forwarding to querying host and fabric manager (This is how the fabric manager gets populated after failure)
- 10.0.1.2 sends the packet to the PMAC it has learnt. The last hop edge switch translates the PMAC to AMAC before sending it to the target host. End hosts don't see their own PMACs but cache PMACs of other hosts.
Forwarding is pretty simple. If the PMAC is in the same Pod but different position forward it to your aggregation switch. If it is in different Pod aggregation switch will forward it to the core switch. The switch through the LDP learns which switch is connected to it on which port. The forwarding data on a switch is aggregatable like IP so it does not take up much space.
The PMACs can support expansion of the DC (adding another Pod). The multi-rooted tree topology however should remain fixed which is a fair assumption.
A good note:
When considering a distributed versus centralized protocol there is an inherent trade off between protocol simplicity and system robustness.
The fabric manager contains soft state data of the following type:
Now the fabric manager will contain 1 million entries (corresponding to all the hosts in the DC). Is this scalable? DHT anyone?
The switches send LDMs to each other and the fabric manager also monitors connectivity to each switch module. Is all this broadcast traffic scalable?
Drawbacks:
1) What about traffic coming in from the Internet? How do you scalably handle all that traffic?
2) The paper says that their approach works for multi-rooted trees and they take fat-tree topology as an example but that is easier said than done. If there is one more layer between the aggregation and core layer their LDP protocol cannot work without the help from fabric manager.
3) VL2 runs OSPF and Portland LDP which requires some configuration and broadcasts. Also, both schemes use indirection which involves caching of information for speedup. So, when the cache is invalid, the central controller/fabric manger has to take care of things and this is tedious.
4) How does it account for MB in a data center? What if a MB is filtering on dst MAC or something?
5) Cannot be implemented in conjunction with existing data center topology because legacy switches don't understand MAC prefixes or LDP. Even existing Openflow switches don't understand LDPs so these packets will have to be processed by the controller. Not backwards compatible.
To Read
http://www.cs.princeton.edu/courses/archive/fall10/cos561/papers/Seattle08.pdf
http://conferences.sigcomm.org/hotnets/2009/papers/hotnets2009-final103.pdf
http://www.cs.uccs.edu/~zbo/teaching/CS522/Projects/SIGCOMM10-DataCenterTCP.pdf
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
http://bradhedlund.com/2011/05/01/tcp-incast-and-cloud-application-performance/
http://en.wikipedia.org/wiki/TRILL_(computing)
http://cloudscaling.com/pdf/IaaS_Building_Guide_v1.pdf
http://conferences.sigcomm.org/hotnets/2009/papers/hotnets2009-final103.pdf
http://www.cs.uccs.edu/~zbo/teaching/CS522/Projects/SIGCOMM10-DataCenterTCP.pdf
http://bradhedlund.com/2011/04/21/data-center-scale-openflow-sdn/
http://bradhedlund.com/2011/05/01/tcp-incast-and-cloud-application-performance/
http://en.wikipedia.org/wiki/TRILL_(computing)
http://cloudscaling.com/pdf/IaaS_Building_Guide_v1.pdf
Tuesday, May 22, 2012
Thursday, May 17, 2012
Force ssh to not use public key
Force ssh to not authenticate using a public key:
ssh -o "PubkeyAuthentication no" user@server
Wednesday, May 16, 2012
Why does Ethernet have a minimum frame size of 64 bytes?
The IEEE 802.3 states that there can be a maximum of five physical segments between any two nodes, with four repeaters between the nodes. When using 10Base5 cabling, each physical segment can be up to 500 meters long. Therefore, an Ethernet network’s maximum linear length is 2500 meters.
Speed of light in copper is 2/3 of the speed of light in vacuum.
In order to detect collision an adapter must transmit for at least 25 microseconds.
At 10Mbps
After mth collision, choose K randomly from {0, …, 2m-1}
and wait for K*512 bit times before trying again
For 10Mbps = K * 51.2 microseconds
So
12 bytes - Src+Dst MAC address
2 bytes - Ethernet type
4 bytes - CRC
----------
18 bytes
----------
Preamble = 8 bytes, 7 bytes of 10101010 + 1 byte of Start of frame delimiter 10101011
IFG - 12 bytes
Jamming signal - 4 bytes
Now IP + TCP header = 40 bytes. Therefore, the OS must pad the data before sending it to the adapter.
Speed of light in copper is 2/3 of the speed of light in vacuum.
In order to detect collision an adapter must transmit for at least 25 microseconds.
At 10Mbps
After mth collision, choose K randomly from {0, …, 2m-1}
and wait for K*512 bit times before trying again
For 10Mbps = K * 51.2 microseconds
So
12 bytes - Src+Dst MAC address
2 bytes - Ethernet type
4 bytes - CRC
----------
18 bytes
----------
Preamble = 8 bytes, 7 bytes of 10101010 + 1 byte of Start of frame delimiter 10101011
IFG - 12 bytes
Jamming signal - 4 bytes
_-
This leaves 46 bytes as the packet lengthNow IP + TCP header = 40 bytes. Therefore, the OS must pad the data before sending it to the adapter.
Tuesday, May 15, 2012
Chord: A scalable peer-to-peer lookup service for Internet applications
I have decided to write down reviews of all the papers I read from now on. This is for the sole purpose of improving my understanding of the paper, help me put my thoughts down in a coherent fashion and improve my writing skills.
This paper was published in SIGCOMM 2001 and has been cited more than 9000 times. I will go over the basic algorithm and system model first and finally note down my evaluation.
Lets get down some definitions first:
successor(k) - the successor node of key k is the node whose identifier is equal to or follows k in identifier space.
n.successor - successor of a node n is the next node in the identifier space. successor(0) = 1, successor(1) = 3, successor(3) = 0. Similarly we have n.predecessor
In the figure the green dots represent nodes.
m - number of bits in the key/node identifier. Both nodes and keys are mapped to the same identifier space.
Each node maintains m entries in a routing table called finger table.
The ith finger of node n or the ith entry in the finger table of node n is denoted by n.finger[i].node
The ith entry in the finger table of node n = s = successor(n + 2i-1) where 1 <= i <= m.
n.successor = n.finger[1].node
n.finger[i].start = (n + 2i-1) mod m
n.finger[i].interval = [n.finger[i].start, n.finger[i+1].start)
In the right figure, m = 3. The finger table of node 0 will have:
1st entry: 0.finger[1].start = (0 + 20) mod m = 1 => successor(1) = 1
2nd entry: 0.finger[2].start = (0 + 21) mod m = 2 => successor(2) = 3
3rd entry: 0.finger[3].start = (0 + 22) mod m = 4 => successor(4) = 0
0.finger[4].start = (0 + 23) mod 8 = 0
1st interval: [1, 2)
2nd interval: [2, 4)
3rd interval: [4, 0)
See the paper for pseudocode for routing and joining.
The finger table also contains the keys that the node is responsible for storing and the node's successor and predecessor. The finger table entry include both the Chord identifier and the IP address, port number of the node.
Here is how the finger table is updated:
Node 6 joins -
Node 1 leaves -
See the stabilization algorithm in this ppt
Here is how I envision Chord will work in a P2P system:
When a new node joins a system it contacts a well known server to help it find its successor. When the node has been integrated into the system it must allocate some space for storing the key value store. If it wants to share something, the cryptographic hash (SHA-1, HMAC) of the file name it wants to share must be stored in the system as a key. The value corresponding to this key will be its IP address and port number. The system should store multiple copies of this (key, value) pair for high availability. This might be done using a different hash function. You can have multiple Chord networks using a different hash functions. You can also simply store each (key, value) pair in designated node and it k successors.
If someone wishes to download this file they must calculate a hash of the search query to get the key and see if it exists in the system. If it does, the value will tell them the node IP address which has the data. If they download the file, their own IP address should get added to all the copies of the (key, value) pair.
I suppose this is how magnet links work too. Though Bittorrents are based on a different DHT model called Kademlia which uses a different routing algorithm but is also based on consistent hashing like Chord. I will be posting about it soon.
An obvious disadvantage of this way of storing data is that we cannot perform a widcard searches like *.mp3. I don't know how that might be done.
For a system with N nodes Chord proves that it can resolve all lookups via O(log N) messages by maintaining information about O(log N) other nodes (in a finger table). It also requires O(log^2 N) messages to re-establish the routing invariants and finger tables in case any node joins or leaves the system.
P2P systems have very high churn rates. The large number of messages that need to be exchanged and the changes required in finger table every time a node joins or leaves the system, in my opinion, makes it infeasible in practice. It would be better to use Chord for the Ultrapeer network only, since they have low churn rate.
This paper was published in SIGCOMM 2001 and has been cited more than 9000 times. I will go over the basic algorithm and system model first and finally note down my evaluation.
Lets get down some definitions first:
successor(k) - the successor node of key k is the node whose identifier is equal to or follows k in identifier space.
n.successor - successor of a node n is the next node in the identifier space. successor(0) = 1, successor(1) = 3, successor(3) = 0. Similarly we have n.predecessor
In the figure the green dots represent nodes.
m - number of bits in the key/node identifier. Both nodes and keys are mapped to the same identifier space.
Each node maintains m entries in a routing table called finger table.
The ith finger of node n or the ith entry in the finger table of node n is denoted by n.finger[i].node
The ith entry in the finger table of node n = s = successor(n + 2i-1) where 1 <= i <= m.
n.successor = n.finger[1].node
n.finger[i].start = (n + 2i-1) mod m
n.finger[i].interval = [n.finger[i].start, n.finger[i+1].start)
In the right figure, m = 3. The finger table of node 0 will have:
1st entry: 0.finger[1].start = (0 + 20) mod m = 1 => successor(1) = 1
2nd entry: 0.finger[2].start = (0 + 21) mod m = 2 => successor(2) = 3
3rd entry: 0.finger[3].start = (0 + 22) mod m = 4 => successor(4) = 0
0.finger[4].start = (0 + 23) mod 8 = 0
1st interval: [1, 2)
2nd interval: [2, 4)
3rd interval: [4, 0)
See the paper for pseudocode for routing and joining.
The finger table also contains the keys that the node is responsible for storing and the node's successor and predecessor. The finger table entry include both the Chord identifier and the IP address, port number of the node.
Here is how the finger table is updated:
Node 6 joins -
Node 1 leaves -
See the stabilization algorithm in this ppt
Here is how I envision Chord will work in a P2P system:
When a new node joins a system it contacts a well known server to help it find its successor. When the node has been integrated into the system it must allocate some space for storing the key value store. If it wants to share something, the cryptographic hash (SHA-1, HMAC) of the file name it wants to share must be stored in the system as a key. The value corresponding to this key will be its IP address and port number. The system should store multiple copies of this (key, value) pair for high availability. This might be done using a different hash function. You can have multiple Chord networks using a different hash functions. You can also simply store each (key, value) pair in designated node and it k successors.
If someone wishes to download this file they must calculate a hash of the search query to get the key and see if it exists in the system. If it does, the value will tell them the node IP address which has the data. If they download the file, their own IP address should get added to all the copies of the (key, value) pair.
I suppose this is how magnet links work too. Though Bittorrents are based on a different DHT model called Kademlia which uses a different routing algorithm but is also based on consistent hashing like Chord. I will be posting about it soon.
An obvious disadvantage of this way of storing data is that we cannot perform a widcard searches like *.mp3. I don't know how that might be done.
For a system with N nodes Chord proves that it can resolve all lookups via O(log N) messages by maintaining information about O(log N) other nodes (in a finger table). It also requires O(log^2 N) messages to re-establish the routing invariants and finger tables in case any node joins or leaves the system.
P2P systems have very high churn rates. The large number of messages that need to be exchanged and the changes required in finger table every time a node joins or leaves the system, in my opinion, makes it infeasible in practice. It would be better to use Chord for the Ultrapeer network only, since they have low churn rate.
Monday, May 14, 2012
Bloom Filters
Concepts:
http://finger-tree.blogspot.com/
Bloom Filter calculator:
http://hur.st/bloomfilter
Python library:
https://github.com/jaybaird/python-bloomfilter/
http://finger-tree.blogspot.com/
Bloom Filter calculator:
http://hur.st/bloomfilter
Python library:
https://github.com/jaybaird/python-bloomfilter/
Subscribe to:
Posts (Atom)