Pages

Thursday, May 31, 2012

Portland: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

This is an influential paper in the Data Center arena. You can also see a video presentation of the paper here.

The basic idea of the paper is very simple: Make Ethernet scalable by making it hierarchical like IP. Thats it. I personally think SEATTLE is a more elegant solution to the problem.

First two sections give the same rhetoric about how L2 is simple but not scalable and how L3 is scalable but hinders mobility and is prone to configuration errors. The Related Work section is thorough.

Design:
1) Fat-tree topology


2) Fabric manger (centralized directory service)

3) Distributed location discovery protocol (basically send packets like LSPs in OSPF for checking if a switch/host is alive and also so that switches learn their pod and position numbers)

Paper defines a Pseudo MAC address (48 bits) - PMAC - which encodes the location of a host in the topology. It is of type pod.position.port.vmid (hierarchical like IP address)

Lets see how a packet travels from a host (1, 10.0.1.2) in Pod 0 to a host (2, 10.1.1.2) in Pod 1

- Host 1 sends an ARP: Who has 10.1.1.2?
- The edge switch intercepts the ARP and if it sees a source MAC it hasn't seen before it generates a PMAC for it (edge switch assigns a monotonically increasing vmid to MACs it sees on a particular port) and creates an entry in its local PMAC table mapping the host's AMAC (actual MAC) and IP to its PMAC. It then communicates the IP-PMAC mapping to Fabric manager. This data is used to reply to ARP requests from other hosts and for translating the PMAC to AMAC on packets destined to the host. The edge switch also asks the Fabric manager if it knows the PMAC for 10.1.1.2.
If it does:
The edge switch constructs an ARP reply with the PMAC from fabric manager and sends it to the host. Host learns the PMAC (48 bits) instead of AMAC. That is why PMAC is 48 bits long.
If it does not:
ARP is broadcasted like it is in standard Ethernet. The target host replies with its AMAC. The edge switch of target host rewrites the AMAC with PMAC before forwarding to querying host and fabric manager (This is how the fabric manager gets populated after failure)
- 10.0.1.2 sends the packet to the PMAC it has learnt. The last hop edge switch translates the PMAC to AMAC before sending it to the target host. End hosts don't see their own PMACs but cache PMACs of other hosts.

Forwarding is pretty simple. If the PMAC is in the same Pod but different position forward it to your aggregation switch. If it is in different Pod aggregation switch will forward it to the core switch. The switch through the LDP learns which switch is connected to it on which port. The forwarding data on a switch is aggregatable like IP so it does not take up much space. 
The PMACs can support expansion of the DC (adding another Pod). The multi-rooted tree topology however should remain fixed which is a fair assumption.

A good note:
When considering a distributed versus centralized protocol there is an inherent trade off between protocol simplicity and system robustness.

The fabric manager contains soft state data of the following type:

Now the fabric manager will contain 1 million entries (corresponding to all the hosts in the DC). Is this scalable? DHT anyone?
The switches send LDMs to each other and the fabric manager also monitors connectivity to each switch module. Is all this broadcast traffic scalable?

Drawbacks:
1) What about traffic coming in from the Internet? How do you scalably handle all that traffic?
2) The paper says that their approach works for multi-rooted trees and they take fat-tree topology as an example but that is easier said than done. If there is one more layer between the aggregation and core layer their LDP protocol cannot work without the help from fabric manager.
3) VL2 runs OSPF and Portland LDP which requires some configuration and broadcasts. Also, both schemes use indirection which involves caching of information for speedup. So, when the cache is invalid, the central controller/fabric manger has to take care of things and this is tedious.
4) How does it account for MB in a data center? What if a MB is filtering on dst MAC or something?
5) Cannot be implemented in conjunction with existing data center topology because legacy switches don't understand MAC prefixes or LDP. Even existing Openflow switches don't understand LDPs so these packets will have to be processed by the controller. Not backwards compatible.


To Read

Thursday, May 17, 2012

How does the whole business of DNS and domain names work?

How do domain names actually work?

Force ssh to not use public key


Force ssh to not authenticate using a public key:
ssh -o "PubkeyAuthentication no" user@server

Wednesday, May 16, 2012

Why does Ethernet have a minimum frame size of 64 bytes?

The IEEE 802.3 states that there can be a maximum of five physical segments between any two nodes, with four repeaters between the nodes. When using 10Base5 cabling, each physical segment can be up to 500 meters long. Therefore, an Ethernet network’s maximum linear length is 2500 meters.


Speed of light in copper is 2/3 of the speed of light in vacuum.


In order to detect collision an adapter must transmit for at least 25 microseconds.


At 10Mbps 
After mth collision, choose K randomly from {0, …, 2m-1}
and wait for K*512 bit times before trying again
For 10Mbps = K * 51.2 microseconds


So 


12 bytes  - Src+Dst MAC address
2 bytes - Ethernet type
4 bytes - CRC
----------
18 bytes
----------


Preamble = 8 bytes, 7 bytes of 10101010 + 1 byte of Start of frame delimiter 10101011
IFG - 12 bytes
Jamming signal - 4 bytes
_- 
This leaves 46 bytes as the packet length


Now IP + TCP header = 40 bytes. Therefore, the OS must pad the data before sending it to the adapter.

Tuesday, May 15, 2012

Chord: A scalable peer-to-peer lookup service for Internet applications

I have decided to write down reviews of all the papers I read from now on. This is for the sole purpose of improving my understanding of the paper, help me put my thoughts down in a coherent fashion and improve my writing skills.

This paper was published in SIGCOMM 2001 and has been cited more than 9000 times. I will go over the basic algorithm and system model first and finally note down my evaluation.

Lets get down some definitions first:


successor(k) - the successor node of key k is the node whose identifier is equal to or follows k in identifier space.
n.successor - successor of a node n is the next node in the identifier space. successor(0) = 1, successor(1) = 3, successor(3) = 0. Similarly we have n.predecessor
In the figure the green dots represent nodes.

m - number of bits in the key/node identifier. Both nodes and keys are mapped to the same identifier space.
Each node maintains m entries in a routing table called finger table.


The ith finger of node n or the ith entry in the finger table of node n is denoted by n.finger[i].node
The ith entry in the finger table of node n = s = successor(n + 2i-1) where 1 <= i <= m.

n.successor = n.finger[1].node
n.finger[i].start = (n + 2i-1) mod m
n.finger[i].interval = [n.finger[i].start, n.finger[i+1].start)

click to enlarge


In the right figure, m = 3. The finger table of node 0 will have:
1st entry: 0.finger[1].start = (0 + 20) mod m = 1 => successor(1) = 1
2nd entry: 0.finger[2].start = (0 + 21) mod m = 2 => successor(2) = 3
3rd entry: 0.finger[3].start = (0 + 22) mod m = 4 => successor(4) = 0

0.finger[4].start = (0 + 23) mod 8 = 0

1st interval: [1, 2)
2nd interval: [2, 4)
3rd interval: [4, 0)

See the paper for pseudocode for routing and joining.

The finger table also contains the keys that the node is responsible for storing and the node's successor and predecessor. The finger table entry include both the Chord identifier and the IP address, port number of the node.

Here is how the finger table is updated:

Node 6 joins -




Node 1 leaves -


See the stabilization algorithm in this ppt

Here is how I envision Chord will work in a P2P system:
When a new node joins a system it contacts a well known server to help it find its successor. When the node has been integrated into the system it must allocate some space for storing the key value store. If it wants to share something, the cryptographic hash (SHA-1, HMAC) of the file name it wants to share must be stored in the system as a key. The value corresponding to this key will be its IP address and port number. The system should store multiple copies of this (key, value) pair for high availability. This might be done using a different hash function. You can have multiple Chord networks using a different hash functions. You can also simply store each (key, value) pair in designated node and it k successors.
If someone wishes to download this file they must calculate a hash of the search query to get the key and see if it exists in the system. If it does, the value will tell them the node IP address which has the data. If they download the file, their own IP address should get added to all the copies of the (key, value) pair.
I suppose this is how magnet links work too. Though Bittorrents are based on a different DHT model called Kademlia which uses a different routing algorithm but is also based on consistent hashing like Chord. I will be posting about it soon.
An obvious disadvantage of this way of storing data is that we cannot perform a widcard searches like *.mp3. I don't know how that might be done.

For a system with N nodes Chord proves that it can resolve all lookups via O(log N) messages by maintaining information about O(log N) other nodes (in a finger table). It also requires O(log^2 N) messages to re-establish the routing invariants and finger tables in case any node joins or leaves the system.

P2P systems have very high churn rates. The large number of messages that need to be exchanged and the changes required in finger table every time a node joins or leaves the system, in my opinion, makes it infeasible in practice. It would be better to use Chord for the Ultrapeer network only, since they have low churn rate.

Sunday, May 13, 2012

JavaScript video tutorials

JavaScript type coercion

Rainbow tables 101

Curl useful options


 -I/--head
              (HTTP/FTP/FILE) Fetch the HTTP-header only! HTTP-servers feature
              the command  HEAD which this uses to get nothing but the header
              of a document. When used on a FTP or FILE  file, curl  displays
              the file size and last modification time only.


$ curl -I www.google.com
HTTP/1.1 200 OK
Date: Sun, 13 May 2012 16:32:51 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=821eaece51276b96:FF=0:TM=1336926771:LM=1336926771:S=cOtp8nzi-o3wIqdf; expires=Tue, 13-May-2014 16:32:51 GMT; path=/; domain=.google.com
Set-Cookie: NID=59=MaFem8YcG2Vg2RmRmOC36pnCSzI3sJ-9T-0x5ueuAZJCRS2PK3poumUU7dYCio0BDBQTdsy2N5TwdF4QqQyrah0_HVDV_WCK0hjzBzOUSYvO-bBgD2aCHKMWOc_fOo1_; expires=Mon, 12-Nov-2012 16:32:51 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Transfer-Encoding: chunked

Free online courses from top US universities

Thursday, May 10, 2012

Smurf Attack

From Wikipedia:
The Smurf attack is a way of generating significant computer network traffic on a victim network. This is a type of denial-of-service attack that floods a system via spoofed broadcast ping messages.
This attack relies on a perpetrator sending a large amount of ICMP echo request (ping) traffic to IP broadcast addresses, all of which have a spoofed source IP address of the intended victim. If the routing device delivering traffic to those broadcast addresses delivers the IP broadcast to all hosts (for example via a layer 2 broadcast), most hosts on that IP network will take the ICMP echo request and reply to it with an echo reply, multiplying the traffic by the number of hosts responding.


Use -b option:

ping -b 255.255.255.255


There are some special multicast groups:

224.0.0.1 is the all-hosts group. If you ping that group, all multicast capable hosts on the network should answer, as every multicast capable host must join that group at start-up on all it's multicast capable interfaces.
224.0.0.2 is the all-routers group. All multicast routers must join that group on all it's multicast capable interfaces.
224.0.0.4 is the all DVMRP routers, 224.0.0.5 the all OSPF routers, 224.0.013 the all PIM routers, etc.
All this special multicast groups are regularly published in the "Assigned Numbers" RFC.


If you don't receive a reply it may be because the machines are ignoring the broadcast pings. You can change this setting in Linux by typing

echo "0" > /proc/sys/net/ipv4/icmp_echo_ignore_broadcasts


Tuesday, May 8, 2012

Send mail from Gmail using command line

I tested it on Windows.

What you need:
stunnel 
telnet

Install stunnel. Run it.

Right click on icon in System Tray. Click on Edit stunnel.conf. Uncomment the lines corresponding to [gmail-smtp] and make these changes:


[gmail-smtp]
client = yes
accept = 127.0.0.1:2301
connect = smtp.gmail.com:465

Right click on icon in System Tray. Click on Reload stunnel.conf.

Now from command prompt type: telnet localhost 2301
If telnet is not available see the next post on how to enable it.

If all goes well you should see the following in the telnet window:

220 mx.google.com ESMTP

Type:
HELO google
250 mx.google.com at your service

Type:
AUTH LOGIN
334 VXNlcm5hbWU6
This is Base64 for:
334 Username:

Type your username in Base64. Use this website for performing the conversion.

You'll now see
334 UGFzc3dvcmQ6
which is Base64 for (you guessed it!) Password:

If the u/n pwd is accepted you'll see:
xxxxxx 2.7.0 Accepted

You're in!!

Type
MAIL FROM:<username@gmail.com>
250 2.1.0 OK 

The angular brackets are important.

RCPT TO:<friend@school.edu>
250 2.1.5 OK

DATA
354  Go ahead
From: User <username@gmail.com>
To: Friend <friend@school.edu>
Subject: Hello!
Body
.
250 2.0.0 OK

End body with a . in a new line

QUIT
221 2.0.0 closing connection


Pretty cool right?

EDIT:
For doing  AUTH PLAIN use:
Python code:
from base64 import b64encode
b64encode('authorization-id' + '\0' + 'authentication-id' + '\0' + 'passwd')

The client may leave the authorization-id empty to indicate that it is the same as the authentication identity

For doing AUTH CRAM-MD5 use the following code:
http://www.net-track.ch/opensource/cmd5/

How to enable telnet on Windows 7?

If you get the following error on running telnet from command prompt:

C:\Users\xxxx>telnet
'telnet' is not recognized as an internal or external command,
operable program or batch file.

You need to enable the telnet client by following these steps:



Have you downloaded something illegal using torrents recently?

They know about it!

http://www.youhavedownloaded.com/

Monday, May 7, 2012

Simple multicast client and server

Multicast Server/Listener

#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h> /* close */

#define SERVER_PORT 1500
#define MAX_MSG 100

int main(int argc, char *argv[]) {

  int sd, rc, n, cliLen;
  struct ip_mreq mreq;
  struct sockaddr_in cliAddr, servAddr;
  struct in_addr mcastAddr;
  struct hostent *h;
  char msg[MAX_MSG];

  if(argc!=2) {
    printf("usage : %s <mcast address>\n",argv[0]);
    exit(0);
  }

  /* get mcast address to listen to */
  h=gethostbyname(argv[1]);
  if(h==NULL) {
    printf("%s : unknown group '%s'\n",argv[0],argv[1]);
    exit(1);
  }

  memcpy(&mcastAddr, h->h_addr_list[0],h->h_length);

  /* check given address is multicast */
  if(!IN_MULTICAST(ntohl(mcastAddr.s_addr))) {
    printf("%s : given address '%s' is not multicast\n",argv[0],
           inet_ntoa(mcastAddr));
    exit(1);
  }

  /* create UDP socket */
  sd = socket(AF_INET,SOCK_DGRAM,0);
  if(sd<0) {
    printf("%s : cannot create socket\n",argv[0]);
    exit(1);
  }

  /* bind port */
  servAddr.sin_family=AF_INET;
  servAddr.sin_addr.s_addr=htonl(INADDR_ANY);
  servAddr.sin_port=htons(SERVER_PORT);
  if(bind(sd,(struct sockaddr *) &servAddr,sizeof(servAddr))<0)  
  {
    printf("%s : cannot bind port %d \n",argv[0],SERVER_PORT);
    exit(1);
  }

  /* join multicast group */
  mreq.imr_multiaddr.s_addr=mcastAddr.s_addr;
  mreq.imr_interface.s_addr=htonl(INADDR_ANY);

  rc = setsockopt(sd,IPPROTO_IP,IP_ADD_MEMBERSHIP,
                  (void *) &mreq, sizeof(mreq));
  if(rc<0) {
    printf("%s : cannot join multicast group '%s'",argv[0],
           inet_ntoa(mcastAddr));
    exit(1);
  }
  else {
    printf("%s : listening to mgroup %s:%d\n",
           argv[0],inet_ntoa(mcastAddr), SERVER_PORT);


    /* infinite server loop */
    while(1) {
      cliLen=sizeof(cliAddr);
      n = recvfrom(sd,msg,MAX_MSG,0,(struct sockaddr *) &cliAddr,&cliLen);
      if(n<n) {
        printf("%s : cannot receive data\n",argv[0]);
        continue;
      }

      printf("%s : from %s:%d on %s : %s\n",argv[0],
             inet_ntoa(cliAddr.sin_addr),ntohs(cliAddr.sin_port),
             argv[1],
             msg);
    }/* end of infinite server loop */

  }

return 0;

}


Multicast Client/Sender

#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h> /* close */

#define SERVER_PORT 1500
#define MAX_MSG 100

int main(int argc, char *argv[]) {

  int sd, rc, i;
  unsigned char ttl = 1;
  struct sockaddr_in cliAddr, servAddr;
  struct hostent *h;

  if(argc<3) {
    printf("usage %s <mgroup> <data1> <data2> ... <dataN>\n",argv[0]);
    exit(1);
  }

  h = gethostbyname(argv[1]);
  if(h==NULL) {
    printf("%s : unknown host '%s'\n",argv[0],argv[2]);
    exit(1);
  }

  servAddr.sin_family = h->h_addrtype;
  memcpy((char *) &servAddr.sin_addr.s_addr, h->h_addr_list[0],h->h_length);
  servAddr.sin_port = htons(SERVER_PORT);

  /* check if dest address is multicast */
  if(!IN_MULTICAST(ntohl(servAddr.sin_addr.s_addr))) {
    printf("%s : given address '%s' is not multicast \n",argv[0],
           inet_ntoa(servAddr.sin_addr));
    exit(1);
  }

  /* create socket */
  sd = socket(AF_INET,SOCK_DGRAM,0);
  if (sd<0) {
    printf("%s : cannot open socket\n",argv[0]);
    exit(1);
  }
  /* bind any port number */
  cliAddr.sin_family = AF_INET;
  cliAddr.sin_addr.s_addr = htonl(INADDR_ANY);
  cliAddr.sin_port = htons(0);
  if(bind(sd,(struct sockaddr *) &cliAddr,sizeof(cliAddr))<0) {
    perror("bind");
    exit(1);
  }

  if(setsockopt(sd,IPPROTO_IP,IP_MULTICAST_TTL, &ttl,sizeof(ttl))<0) {
    printf("%s : cannot set ttl = %d \n",argv[0],ttl);
    exit(1);
  }

  printf("%s : sending data on multicast group '%s' (%s)\n",argv[0],
         h->h_name,inet_ntoa(*(struct in_addr *) h->h_addr_list[0]));


  /* send data */
  for(i=2;i<argc;i++) {
    rc = sendto(sd,argv[i],strlen(argv[i])+1,0,
                (struct sockaddr *) &servAddr, sizeof(servAddr));

    if (rc<0) {
      printf("%s : cannot send data %d\n",argv[0],i-1);
      close(sd);
      exit(1);
    }

  }/* end for */

  /* close socket and exit */
  close(sd);
  exit(0);
}
 


$ ./mcastserver 224.0.0.200
./mcastserver : listening to mgroup 224.0.0.200:1500
./mcastserver : from 192.168.10.52:51654 on 224.0.0.200 : hello
./mcastserver : from 192.168.10.52:51654 on 224.0.0.200 : bye

$ ./mcastclient 224.0.0.200 hello bye
./mcastclient : sending data on multicast group '224.0.0.200' (224.0.0.200)



The switch periodically sends Query (Type 0x11) messages to broadcast address 224.0.0.1
When a server joins a new multicast group it sends a Membership report (Type 0x16) to all the hosts.
When a server leaves the group it sends a Leave Group (0x17) message to broadcast address 224.0.0.2

Also read:
http://ntrg.cs.tcd.ie/undergrad/4ba2/multicast/antony/
http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/html_single/Multicast-HOWTO.html


Why a loglog plot is any day better than a linear plot?