Pages

Sunday, October 6, 2013

C pointer to const and const pointer

const int * ptr1 --> pointer to a constant. You cannot change the value of integer ptr1 is pointing to but you can change ptr1 to point to something else
int const * ptr2 --> Same thing
int * const ptr3 --> const pointer. You can change the value of integer ptr3 is pointing to but cannot change ptr3 to point to something else
http://duramecho.com/ComputerInformation/WhyHowCppConst.html

Friday, August 2, 2013

Vim cheatsheet

* - search word under cursor
:!time - execute shell command time
~ - change case of character under cursor
0 - goto start of line
$ - goto end of line
Ctrl+f - Page Down
Ctrl+b - Page up
e - forward one token
b - back one token
o - Add blank line after current cursor position and go to insert mode
O - Add blank line before current cursor position and go to insert mode
V1< or V3>: Indent a region
. - Repeat previous actions
df + key - Delete all characters in the line up till and including <key>
Shift+Delete - Delete all characters until end of line

When you don't know how many lines you wish to yank/cut use these commands:
mk - mark line
y'k - yank (copy)
d'k - cut

:e <filename> - open file for editing
:sp - split window horizontally
Ctrl+wv or :vs - split window vertically
Ctrl+ww - switch window
Ctrl+w = - Resize split windows to adjust to screen resolution
Ctrl+w r - Rotate windows in a row

Commenting multiple lines:
Ctrl+v to go into visual mode and select the lines you wish to comment. Pressing : will give you this prompt:
:'<,'> promt which you should extend to :'<,'>s/^/#/
Uncommenting lines:
Ctrl+v to go into visual mode and select the commented characters. Press x to delete them

http://vim.wikia.com/wiki/Cut/copy_and_paste_using_visual_selection
http://vim.wikia.com/wiki/Using_tab_pages

Monday, April 22, 2013

SPAIN: COTS Data-Center Ethernet for Multipathing over Arbitrary topologies

Drawbacks:
Not scalable
Requires end host modifications
Does not handle failures well

Wednesday, April 10, 2013

Java array creation running time


I created integer array of different sizes and measured the running time of array creation (eg. int[] x = new int[67108864])

67108864      0.052
134217728    0.102
268435456    0.206
536870912    0.41

It is linear.
To be fair this depends on the JVM version (I have java version "1.6.0_27") and there are algorithms out there which can give you constant time but take more memory (http://eli.thegreenplace.net/2008/08/23/initializing-an-array-in-constant-time/)


Monday, April 8, 2013

PAST: Scalable Ethernet for Data Centers

Here is the paper
PAST stands for Per-Address Spanning Tree routing algorithm which is the core idea behind the paper. Much of the introduction is similar to SEATTLE.

Pure Ethernet
Pros:
1) Self-configuration (plug-and-play)
2) Allows host mobility without hassles.
Cons:
1) Not scalable (Flooding based learning does not scale for more than 1000 hosts and also the MAC table becomes huge because each switch has to know how to divert traffic to all hosts and MAC addresses don't aggregate). We can limit broadcasts by using VLANs but they are limited (4094).
2) Spanning trees limits the available bandwidth by removing links to avoid forwarding loops.

IP
Pros:
1) Allows ECMP forwarding (limited to layer 3 networks currently). More efficient use of available bandwidth.
2) IP addresses are aggregatable and hence scalable.
Cons:
1) Difficult to configure and manage (router config, DHCP server config --> error prone)
2) Host mobility is an issue. Live Migration is needed for fault tolerance and to achieve high host utilization.


Ethernet + IP
Group Ethernet LANs via IP (run OSPF among IP routers). Layer 3 routers allow for ECMP routing and provide scalability.
Cons:
Subnetting limits host mobility to within a single LAN (subnet). Scalability becomes an issue as tenants grow. Subnetting wastes IP addresses and increases configuration overhead both of routers and DHCP servers.



Problem statement
We want ease and flexibility of Ethernet and the scalability and performance of IP while using inexpensive commodity hardware (using proprietary hardware is harder to deploy and loses the advantage of economies of scale) and should work with arbitrary topologies.

PAST relies solely on destination MAC and VLAN tags to forward the packets. So, it can use the larger MAC tables (SRAM) instead of TCAMs (higher area and power per entry than SRAM).

The paper provides a good overview of Trident switch chip and a useful table on table sizes of some of the available commercial 10Gbps switches:

http://www.intel.com/content/www/us/en/switch-silicon/ethernet-switch-fm6000-series.html
IntelFM6000 support 64K IPv4 routes as well.


Lack of information about switch internals makes it difficult for networking researchers to consider constraints of real hardware. This paper fills this gap (for Trident switch chip).


L2 table performs an exact match lookup on VLAN ID and dest MAC address. This table is much larger compared to TCAM tables. Output of the table is either an output port or a group.

TCAMs can wildcard match on most packet header fields. Rewrite TCAM supports output actions that modify packet headers. Forwarding TCAM is used to choose an output port or group. Trident can support 1K ECMP groups each with maximum 4 outport ports in the ECMP Group Table.

Switches contain a control plane processor that is responsible for programming the switch chip, listening to controller messages and participating in control plane protocols like spanning tree, OSPF etc.

IBM RackSwitch G8264 has unique capability to install Openflow rules that exact match on only dest MAC and VLAN ID in L2 table. The switch can install 700-1600 rules/s and each rule installation takes 2-12ms. Each time a packet matches no rule it is sent to the controller. The limit on this feature is 200 packets/s. Therefore reactive forwarding will not provide acceptable performance.

Given the small size of TCAMs any routing mechanism that requires the flexibility of a TCAM must aggregate routes. The paper argues that given the large size of L2 tables we can fit one entry per routable address in every switch. I don't believe this argument at all. Portland started out with the same argument and basically said that lets bring the benefits of aggregation to MAC addresses. But by doing so you again limit host placement (and in Portland's case topology also). To get around this, Layer 2 and 3 schemes make use of indirection to separate location and application addresses which leads to familiar set of inefficiencies.
SPAIN and TRILL are not able to exploit the advantages of multiple paths in topologies like Fat-Tree, Jellyfish because STP constructs a single tree to forward packets along to avoid forwarding loops.
TRILL runs IS-IS to build shortest path routes between switches but it uses broadcast for address resolution, limiting its scalability.

Multipath routing and Valiant load balancing are used to fully exploit available bandwidth in a network.
If there are multiple shortest paths between two hosts we can use ECMP to increase path diversity, avoid hotspots and flow collisions. But ECMP is applicable in architectures which find shortest path like IP routing and TRILL (IS-IS).
VLB forwards traffic minimally to a random switch after which the traffic follows minimal path to its destination.

Core idea
Many possible spanning trees can be built for an address if the topologies have high path diversity. If we make individual per-address trees as disjoint as possible we can improve aggregate network utilization.

Baseline: Destination-rooted shortest path trees.
Build a BFS spanning tree for every address in the network. This spanning tree is rooted at the destination host, so provides minimum hop-count path from any point in the network to that destination. No links are ever disabled in the topology. Every switch forwards the traffic to the host via the path calculated using STP. Forward and reverse paths can be asymmetric (since STP is calculated for each host).

Three variants
PAST-R
Pick a uniformly random next hop in BFS. Intuitively we are spreading the traffic across all the links since each destination will have a random spanning tree.

PAST-W
Use weighted randomization in BFS: Weigh the next hop selection by considering how many children next hop-switch has.

Above two variants are similar to ECMP. Unlike ECMP which enables load balancing at per-flow PAST enables load balancing on per-destination granularity.

NM-PAST
This is similar to VLB and achieves high performance under adversarial workloads. Select a random intermediate switch i and use it as the root for the BFS spanning tree for each host h. The switches along the path in the tree from h to i are then updated to route traffic towards h and not i, making h the sink of the tree.

Evaluations found no significant difference between PAST-R and PAST-W. NM-PAST performs better than ECMP and VLB.


Implementation

PAST uses the Openflow controller-dumb switch architecture. All switches forward the ARP messages to the controller. The controller sends back ARP replies. Ditto for DHCP. Switches run LLDP, so the controller knows the entire topology. PAST is topology independent


What happens when?
1) A host joins/migrates: Migrated host sends a gratuitous ARP which is forwarded to controller. The controller calculates the new tree for the host and updates the rules on the switches.
2) A link/switch fails: All hell breaks loose. We need to recompute all affected trees (if the switch is in the core layer this can be very costly). Biggest drawback of the paper. The paper says the algorithm can compute trees for all hosts in a network with 8,000 hosts in 300ms and 100,000 hosts in 40 seconds. Installation of a single rule on the switch takes 12ms. They say it takes up to a minute for them to recover from a single failure.



Drawbacks

1) The authors say their switch does not need special hardware but they depend on using the IBM G8264 switches which is the only switch that currently supports installing Openflow rules in the Ethernet forwarding tables.
2) Installing a recomputed tree can lead to routing loops. This can be prevented by removing rules corresponding to that tree from the switches first and then use barrier command to make sure they are removed before installing new tree rules. Wastes time.
3) PAST requires one Ethernet table entry per routable address, so the scalability of the architecture is limited to the Ethernet table size (around 100K max). The paper hopes that PAST will scale well in future networks because it uses SRAM-based Ethernet tables.
SPAIN another paper similar to PAST requires a switch to store all possible (VLAN, destination MAC) pairs hence limiting the scalability of these approaches. The paper says that the switches resort to flooding in case of table overflows which is not acceptable in a big data center.


Wednesday, April 3, 2013

CloudNaas: A Cloud Networking Platform for Enterprise Applications

Here is the paper
The paper is not well written. It is all over the place and is missing many details. The introduction moves in circles telling the same thing again and again. It leaves you confused about what problem the paper is trying to solve. It is only on page 4 when the paper talks about the implementation that you understand what the paper's about.

I have rewritten the premise of the paper in my words:

Network admin has an enterprise network. He wants to migrate it to the cloud but he is reluctant because clouds don't support MBs and private IPs (absent or limited control available to customers to configure the network). The cloud provider need to reduce the need to rewrite applications when moving them to the cloud. We should allow customers to use private IP addresses that they were using in their enterprise networks. The paper says another key issue is to allow broadcasts - I don't think this is an issue as broadcasts are generally bad and should be avoided unless absolutely necessary. There are ways for getting around ARP and DHCP broadcasts by making them unicasts to the network controller/directory service. Also, you can replace broadcasts with IP multicasts.

Now, CloudNaaS comes into the picture: It asks the admin for the network policy and tries to satisfy the needs of this logical topology using the cloud's physical resources.


CloudNaaS is an under the cloud implementation i.e. it is implemented by the cloud provider
Allow customer to specify network requirements (nothing new)
1) Tenants specify bandwidth requirements for applications hosted in the cloud.
2) Specify MB traversal

Components:

Cloud controller
Takes user network policy and
Network controller
Monitor and configure the network devices
Decide optimal placement of VMs in the cloud to satisfy network policies - Monitor and change to meet policies

Drawbacks
1) Multiple VMs can have the same private IP address. How do you distinguish between them? Each VM has a public and private IP address and the software switch inside the hypervisor rewrites the IP address of incoming and outgoing packets (How does it work with tunneling). On migration these rules are updated. So, communication still happens using public IP address. To allow MB traversal, CloudNaaS installs rules in software switches of source VM and subsequent MBs which simply tunnel the packet through the policy path. This works but now you will need policy rules at each hop which means changing policies is inflexible, you use more switch memory, installing certain policies is infeasible. You still have to identify the previous hop somehow but the paper does not talk about it.
2) This assumes the MB is a layer 3 device. How do I handle transparent firewalls/NAT? These devices will now have to be installed at choke points (the paper says they provide NAT service at cloud gateway) within the data center network which has disadvantages as explained in PLayer paper.
3) The software switch attached to the MB has to know the privateIP<-->public IP translation for all the hosts of the tenant. Migration is still not seamless. How does a LB work? It can't use DSR/DNAT/Tunneling
4) Bandwidth reservation schemes are not very useful in a cloud with high churn rate. They are either overly conservative and lead to low network utilization or overly lenient and have bad isolation as a result.
5) Not super clear about the communication matrix business and how is VM placement decided using that?
6) The paper does not talk about how the core network topology is set up. It implicitly assumes it somehow works in a scalable manner. The policies are pushed to the edge switches in the hypervisors. The paper suggests the optimization of breaking up the address space and allocating each subnet to servers attached to a ToR. This is exactly what a traditional hierarchical data center looks like (doesn't allow seamless VM migration). Also, migration changes the public IP address of a host and this requires major rewriting of rules on all switches.

Things learnt from the paper
The hardware Openflow switches use expensive and power hungry TCAMs while the software switches use DRAM. As a result the software switch can store many more rules. So, try and push more intelligence to the edge of the network while treating the core as a simple and dumb packet pusher.


Tuesday, April 2, 2013

Paxos demystifyied

A Policy-aware Switching Layer for Data Centers

A good paper with good ideas about the role and position of middleboxes in data center topology. The premise of the paper is simple. Middleboxes are cause of lot of agony in data centers. 78% of downtime in data center is cause by misconfiguration. This is because of human involvement in implementing the middlebox policies in a data center. Typical methods used to ensure correct MB traversal are:
1) Overloading path selection mechanisms (adjusting weights in spanning tree protocol)
2) Put MBs at choke points
3) Place MBs at all possible choke points/ incorporate MB functionality in switches - costly.

Problems
1) Ad-hoc practices like overloading existing path selection mechanisms are hard to configure and maintain (think link/switch failure). Other practices: remove links to create choke points: complex and we lose fault tolerance, separate VLANs with MB at inter-connection points: violates efficiency, can no longer do seamless VM migration.
2) Inefficiency - packet should not traverse MBs it is not supposed to - this wastes resources and can also cause incorrect behavior.
3) Inflexibility - New instances of MB can be deployed to balance load (how to make this automatic) or policy changes in future. Doing this will require human intervention - leads to errors. 
4) If MB fails this leads to network partition (since MBs are placed at choke points)

What we need
1) Correctness - Traffic should traverse middleboxes in the sequence specified by the network administrator under all network conditions.
2) Flexibility - The sequences of middleboxes should be easily reconfigured as application requirements change.
3) Efficiency - Traffic should not traverse unnecessary middleboxes

Solution
1) Don't place MBs at choke points, instead attach them to pswitches and configure switches to implement MB policies i.e. Take MBs off the physical network path. Data centers have low network latency so sending packets to off-path MBs is not costly
2) Separate logical topology from physical topology or in other words separate policy from reachability.

Architecture
Consists of Policy controller, Middlebox controller, and pswitches

Policy controller accepts policies (from network admin) and converts it into pswitch rules

Policies are of the form:
[Start Location, Traffic selector (5-tuple)] --> sequence of MB types

Pswitch rules are of the form:
[Previous Hop, Traffic selector (5-tuple)] --> Next Hop

Middlebox controller monitors the liveness of MBs and informs pswitches about addition or failure of MBs.

Pswitches
perform three operations:
1) Identify the previous hop traversed by the frame - based on their source MAC addresses (this can cause problems if the MB modifies the source MAC address) or incoming interface.
2) Determine the next hop to be traversed by the frame
3) Forward the frame to its next hop - using L2 encapsulation - A redirected frame is encapsulated in a new Ethernet frame identified by a new EtherType code. The dst MAC is set to next MB or server and src MAC is set to original frame or last MB instance traversed. Preserving original MAC address is required for correctness by some MBs.
Forwarding also allows balancing load across multiple instances of same MB type (specify multiple next hops and use 5-tuple consistent hashing to distribute traffic among them) which is resilient to MB failure
If the next hop is a transparent device (say firewall) then we need to identify it somehow using a MAC and set up the dst MAC of packets going to it. Also, the next hop after the firewall will need to identify the previous hop as the firewall somehow. The paper gets around this by giving a fake MAC address to such devices which is registered with the MB controller when the device comes up for the first time. If the device is attached to a non-pswitch however, we then need a SrcMacRewriter in between the MB and the switch. This is a stateless device which inserts a special source MAC address that can uniquely identify the MB.

Drawbacks of the paper
1) The number of rules stored in pswitches does not scale very well. Each pswitch is essentially storing rules for all the policies implemented in the data center. This will not scale well if we move to public cloud with multiple tenants.
2) Leaves MB unmodified but modifies switches (cannot be adopted in its current form)
3) The paper does not talk much about how pswitches fit in to a traditional data center topology (it briefly says that they'll replace layer-2 switches). The examples they give are focused more on simple line topologies. There is a disconnect here.
4) Uses flooding-based L2-learning switch which does not scale well. Broadcasts become a problem.
5) Extra SrcMacRewriter boxes for transparent MBs.
6) Some MBs are stateful (firewalls), so we need to make the packets in both directions (forward and reverse) go through the same MB. The pswitches must make sure this condition is met.
7) If the policy changes in future, ensuring that all MBs switch to new policy simultaneously is not possible*. Some frames will violate middlebox traversal guarantee (i.e. they might not traverse either the previous or the new policy). This is an eventual consistency model and has security vulnerabilities. An end-to-end model will be better here. We just change the policy at the end and get stronger consistency guarantees.
8) Since, pswitches use consistent hashing, adding new MBs of same type causes re-assignment of existing flows. This is bad for stateful MBs. 

* It is possible to get the consistency guarantees even in this case but it requires essentially doubling the number of rules installed on the switches. (http://conferences.sigcomm.org/sigcomm/2012/paper/sigcomm/p323.pdf)

Tuesday, March 5, 2013

Frequently used svn commands

svn checkout URL

svn co --username URL

svn status 

svn add

svn commit

See log/comments from a previous version

svn log -v -r 40

export EDITOR=vim

Tuesday, February 5, 2013

Testing LVS-TUN using VMWare Player


Here is the setup I used:
4 VMs, 1 client, 1 director and 2 Realservers
client has one NIC in NAT mode
director has 1 NICs in NAT mode
Realservers have one NIC each in NAT mode
The Director and Realservers need not be in the same Layer 2 domain. Install arptable using apt-get on realservers.

Director

eth0      Link encap:Ethernet  HWaddr 00:0c:29:07:96:cf  
          inet addr:192.168.25.135  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe07:96cf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3379 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2201 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1295133 (1.2 MB)  TX bytes:233436 (233.4 KB)
          Interrupt:19 Base address:0x2000 

eth0:110  Link encap:Ethernet  HWaddr 00:0c:29:07:96:cf  
          inet addr:192.168.25.110  Bcast:192.168.25.110  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0x2000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:168 (168.0 B)  TX bytes:168 (168.0 B)


Install ipvsadm using apt-get. Restart director.
sudo bash -c 'echo 0 > /proc/sys/net/ipv4/ip_forward'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/all/send_redirects'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/default/send_redirects'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/eth0/send_redirects'
sudo /sbin/ifconfig eth0:110 192.168.25.110 broadcast 192.168.25.110 netmask 255.255.255.255
sudo /sbin/route add -host 192.168.25.110 dev eth0:110

Then set up the load balancer
sudo /sbin/ipvsadm -C
sudo /sbin/ipvsadm -A -t 192.168.25.110:8080 -s rr
sudo /sbin/ipvsadm -a -t 192.168.25.110:8080 -r 192.168.25.131:8080 -i -w 1
sudo /sbin/ipvsadm -a -t 192.168.25.110:8080 -r 192.168.25.140:8080 -i -w 1

$ sudo /sbin/ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.25.110:8080 rr
  -> 192.168.25.131:8080          Tunnel   1      0          0         
  -> 192.168.25.140:8080          Tunnel   1      0          0         


Client 
eth0      Link encap:Ethernet  HWaddr 00:0c:29:d0:bc:7f  
          inet addr:192.168.25.128  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fed0:bc7f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2245 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1104 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1327858 (1.3 MB)  TX bytes:100896 (100.8 KB)
          Interrupt:19 Base address:0x2000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)


RealServer 1

eth0      Link encap:Ethernet  HWaddr 00:0c:29:ae:54:3c  
          inet addr:192.168.25.140  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feae:543c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2064 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2620 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:259049 (259.0 KB)  TX bytes:238162 (238.1 KB)
          Interrupt:19 Base address:0x2024 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

tun1      Link encap:IPIP Tunnel  HWaddr   
          inet addr:192.168.25.110  P-t-P:192.168.25.110  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP  MTU:1480  Metric:1
          RX packets:48 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3352 (3.3 KB)  TX bytes:0 (0.0 B)


RealServer 2
eth0      Link encap:Ethernet  HWaddr 00:0c:29:b9:69:38  
          inet addr:192.168.25.131  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feb9:6938/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:299 errors:0 dropped:0 overruns:0 frame:0
          TX packets:226 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:27935 (27.9 KB)  TX bytes:31852 (31.8 KB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

tun1      Link encap:IPIP Tunnel  HWaddr   
          inet addr:192.168.25.110  P-t-P:192.168.25.110  Mask:255.255.255.255
          UP POINTOPOINT RUNNING NOARP  MTU:1480  Metric:1
          RX packets:48 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:3352 (3.3 KB)  TX bytes:0 (0.0 B)

Then configure Realservers for Tunneling to work properly
RealServer 1

$ sudo apt-get install ipip
$ sudo modprobe ipip
$ modprobe tun
$ sudo ip tunnel add tun1 mode ipip local 192.168.25.140
$ sudo ifconfig tun1 192.168.25.110 broadcast 192.168.25.110 netmask 255.255.255.255
$ sudo ifconfig tun1 up
$ sudo route add -host 192.168.25.110/32 dev tun1


Reverse path filter was introduced to support Strong Send and Receive. Which now most of the Operating Systems have as default setting. In Strong Send and Receive OS transmits outgoing packet only from the same interface from which it has received it. In tunneling packets are received on tunnel interface but they go out from different interface. Hence, if RPF is enabled on back-end servers response will not be delivered to clients. Hence, Reverse Path Filtering should be disabled on Realservers.
This is very important. Things will not work and you will pull your hair out for two days unless you do the following

$ sudo bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
$ sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/tun1/rp_filter'

$ sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/default/rp_filter'
$ sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/all/rp_filter'


Handle ARP problem
$ sudo arptables -F
$ sudo arptables -A INPUT -d 192.168.25.110 -j DROP
$ sudo arptables -L -n
Chain INPUT (policy ACCEPT)
-j DROP -d 192.168.25.110 

Chain OUTPUT (policy ACCEPT)

Chain FORWARD (policy ACCEPT)

Do likewise for RealServer 2

Then start the web server on RealServer 1 and 2
bruce@ubuntu:~/webserver$ cat index.html 
<html>
<head>
<meta http-equiv="Pragma" content="no-cache">
<!-- Pragma content set to no-cache tells the browser not to cache the page
This may or may not work in IE -->

<meta http-equiv="expires" content="0">
<!-- Setting the page to expire at 0 means the page is immediately expired
Any vales less then one will set the page to expire some time in past and
not be cached. This may not work with Navigator -->
</head>
<title>Fake WWW server 1</title>
<body>
This is fake WWW server 1
</body>
</html>
bruce@ubuntu:~/webserver$ python -m SimpleHTTPServer 8080
or install telnet and use that instead.
$ sudo apt-get install telnetd
$ sudo /etc/init.d/openbsd-inetd restart

Now connect to director from client. I use lynx

openflow@mininet-vm:~$ sudo ip neigh flush all
openflow@mininet-vm:~$ lynx -dump http://192.168.25.110:8080/
   This is fake WWW server 1

openflow@mininet-vm:~$ lynx -dump http://192.168.25.110:8080/
   This is fake WWW server 2


Wireshark capture on RealServer 1 shows IP Tunneling in progress

Capture on tun1




















Capture on eth0



Sunday, February 3, 2013

Testing LVS-DR using VMWare Player

Here is the setup I used:
4 VMs, 1 client, 1 director and 2 Realservers
client has one NIC in NAT mode
director has 1 NICs in NAT mode
Realservers have one NIC each in NAT mode
The Director and Realservers need to be in the same Layer 2 domain. Install arptable using apt-get on realservers.

Director


eth0      Link encap:Ethernet  HWaddr 00:0c:29:07:96:cf  
          inet addr:192.168.25.135  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe07:96cf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:3379 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2201 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1295133 (1.2 MB)  TX bytes:233436 (233.4 KB)
          Interrupt:19 Base address:0x2000 

eth0:110  Link encap:Ethernet  HWaddr 00:0c:29:07:96:cf  
          inet addr:192.168.25.110  Bcast:192.168.25.110  Mask:255.255.255.255
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          Interrupt:19 Base address:0x2000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:2 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:168 (168.0 B)  TX bytes:168 (168.0 B)


Install ipvsadm using apt-get. Restart director.
sudo bash -c 'echo 0 > /proc/sys/net/ipv4/ip_forward'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/all/send_redirects'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/default/send_redirects'
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/conf/eth0/send_redirects'
sudo /sbin/ifconfig eth0:110 192.168.25.110 broadcast 192.168.25.110 netmask 255.255.255.255
sudo /sbin/route add -host 192.168.25.110 dev eth0:110

Then set up the load balancer
sudo /sbin/ipvsadm -C
sudo /sbin/ipvsadm -A -t 192.168.25.110:8080 -s rr
sudo /sbin/ipvsadm -a -t 192.168.25.110:8080 -r 192.168.25.131:8080 -g -w 1
sudo /sbin/ipvsadm -a -t 192.168.25.110:8080 -r 192.168.25.140:8080 -g -w 1

$ sudo /sbin/ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.25.110:8080 rr
  -> 192.168.25.131:8080          Route   1      0          0         
  -> 192.168.25.140:8080          Route   1      0          0         


Client 
eth0      Link encap:Ethernet  HWaddr 00:0c:29:d0:bc:7f  
          inet addr:192.168.25.128  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fed0:bc7f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2245 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1104 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1327858 (1.3 MB)  TX bytes:100896 (100.8 KB)
          Interrupt:19 Base address:0x2000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)


RealServer 1
eth0      Link encap:Ethernet  HWaddr 00:0c:29:ae:54:3c  
          inet addr:192.168.25.140  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feae:543c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:223 errors:0 dropped:0 overruns:0 frame:0
          TX packets:350 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:31804 (31.8 KB)  TX bytes:36847 (36.8 KB)
          Interrupt:19 Base address:0x2024 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:16 errors:0 dropped:0 overruns:0 frame:0
          TX packets:16 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:1296 (1.2 KB)  TX bytes:1296 (1.2 KB)

lo:110    Link encap:Local Loopback  
          inet addr:192.168.25.110  Mask:255.255.255.255
          UP LOOPBACK RUNNING  MTU:16436  Metric:1


RealServer 2
eth0      Link encap:Ethernet  HWaddr 00:0c:29:b9:69:38  
          inet addr:192.168.25.131  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feb9:6938/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:299 errors:0 dropped:0 overruns:0 frame:0
          TX packets:226 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:27935 (27.9 KB)  TX bytes:31852 (31.8 KB)

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

lo:110    Link encap:Local Loopback  
          inet addr:192.168.25.110  Mask:255.255.255.255
          UP LOOPBACK RUNNING  MTU:16436  Metric:1


Then configure Realservers for Direct Server Return to work properly
RealServer 1
$ sudo bash -c 'echo 0 > /proc/sys/net/ipv4/ip_forward'
$ sudo /sbin/ifconfig lo:110 192.168.25.110 broadcast 192.168.25.110 netmask 255.255.255.255 up
$ sudo route add -host 192.168.25.110 dev lo:110
$ sudo arptables -F
$ sudo arptables -A INPUT -d 192.168.25.110 -j DROP
$ sudo arptables -L -n
Chain INPUT (policy ACCEPT)
-j DROP -d 192.168.25.110 

Chain OUTPUT (policy ACCEPT)

Chain FORWARD (policy ACCEPT)

Do likewise for RealServer 2

Then start the web server on RealServer 1 and 2
bruce@ubuntu:~/webserver$ cat index.html 
<html>
<head>
<meta http-equiv="Pragma" content="no-cache">
<!-- Pragma content set to no-cache tells the browser not to cache the page
This may or may not work in IE -->

<meta http-equiv="expires" content="0">
<!-- Setting the page to expire at 0 means the page is immediately expired
Any vales less then one will set the page to expire some time in past and
not be cached. This may not work with Navigator -->
</head>
<title>Fake WWW server 1</title>
<body>
This is fake WWW server 1
</body>
</html>
bruce@ubuntu:~/webserver$ python -m SimpleHTTPServer 8080

Now connect to director from client. I use lynx

openflow@mininet-vm:~$ sudo ip neigh flush all
openflow@mininet-vm:~$ lynx -dump http://192.168.25.110:8080/
   This is fake WWW server 1

openflow@mininet-vm:~$ lynx -dump http://192.168.25.110:8080/
   This is fake WWW server 2


Wireshark capture on RealServer 1 shows Direct Server Return in progress



Saturday, February 2, 2013

Testing LVS-NAT using VMWare Player

Here is the setup I used:
4 VMs, 1 client, 1 director and 2 Realservers
client has one NIC in NAT mode
director has 2 NICs - NAT and Host-only mode
Realservers have one NIC each in Host-only mode

Director

eth0      Link encap:Ethernet  HWaddr 00:0c:29:07:96:cf
          inet addr:192.168.25.135  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe07:96cf/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:45 errors:0 dropped:0 overruns:0 frame:0
          TX packets:145 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:5654 (5.6 KB)  TX bytes:19814 (19.8 KB)
          Interrupt:19 Base address:0x2000

eth1      Link encap:Ethernet  HWaddr 00:0c:29:07:96:d9
          inet addr:192.168.149.140  Bcast:192.168.149.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fe07:96d9/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:241 errors:0 dropped:0 overruns:0 frame:0
          TX packets:414 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:34918 (34.9 KB)  TX bytes:46641 (46.6 KB)
          Interrupt:19 Base address:0x2080

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

Install ipvsadm using apt-get. Restart director.
Then set up the load balancer
sudo bash -c 'echo 1 > /proc/sys/net/ipv4/ip_forward'
sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects'
sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects'
sudo bash -c 'echo 0 > /proc/sys/net/ipv4/conf/eth1/send_redirects'
sudo /sbin/ipvsadm -C
sudo /sbin/ipvsadm -A -t 192.168.25.135:8080 -s rr
sudo /sbin/ipvsadm -a -t 192.168.25.135:8080 -r 192.168.149.139:8080 -m -w 1
sudo /sbin/ipvsadm -a -t 192.168.25.135:8080 -r 192.168.149.138:8080 -m -w 1

Client 
eth0      Link encap:Ethernet  HWaddr 00:0c:29:d0:bc:7f  
          inet addr:192.168.25.128  Bcast:192.168.25.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:fed0:bc7f/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:2245 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1104 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1327858 (1.3 MB)  TX bytes:100896 (100.8 KB)
          Interrupt:19 Base address:0x2000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)


RealServer 1
eth0      Link encap:Ethernet  HWaddr 00:0c:29:ae:54:3c  
          inet addr:192.168.149.138  Bcast:192.168.149.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feae:543c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:226 errors:0 dropped:0 overruns:0 frame:0
          TX packets:130 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:25900 (25.9 KB)  TX bytes:16943 (16.9 KB)
          Interrupt:19 Base address:0x2024 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:954 (954.0 B)  TX bytes:954 (954.0 B)

RealServer 2
eth0      Link encap:Ethernet  HWaddr 00:0c:29:ae:54:3c  
          inet addr:192.168.149.139  Bcast:192.168.149.255  Mask:255.255.255.0
          inet6 addr: fe80::20c:29ff:feae:543c/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:226 errors:0 dropped:0 overruns:0 frame:0
          TX packets:130 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:25900 (25.9 KB)  TX bytes:16943 (16.9 KB)
          Interrupt:19 Base address:0x2024 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:10 errors:0 dropped:0 overruns:0 frame:0
          TX packets:10 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:954 (954.0 B)  TX bytes:954 (954.0 B)

Then setup default route on Realservers for NAT to work properly
RealServer 1
bruce@ubuntu:~$ sudo /sbin/route add default gw 192.168.149.140
bruce@ubuntu:~$ route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.149.140 0.0.0.0         UG    0      0        0 eth0
192.168.149.0   0.0.0.0         255.255.255.0   U     1      0        0 eth0

bruce@ubuntu:~$ ping -c 1 192.168.149.140
PING 192.168.149.140 (192.168.149.140) 56(84) bytes of data.
64 bytes from 192.168.149.140: icmp_req=1 ttl=64 time=3.15 ms

--- 192.168.149.140 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 3.155/3.155/3.155/0.000 ms
bruce@ubuntu:~$ ping -c 1 192.168.25.135
PING 192.168.25.135 (192.168.25.135) 56(84) bytes of data.
64 bytes from 192.168.25.135: icmp_req=1 ttl=64 time=0.568 ms

--- 192.168.25.135 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.568/0.568/0.568/0.000 ms

dushyant@ubuntu:~$ sudo bash -c 'echo 0 > /proc/sys/net/ipv4/ip_forward'
dushyant@ubuntu:~$ cat  /proc/sys/net/ipv4/ip_forward
0

Do likewise for RealServer 2
Then start the web server on RealServer 1 and 2
bruce@ubuntu:~/webserver$ cat index.html 
<html>
<head>
<meta http-equiv="Pragma" content="no-cache">
<!-- Pragma content set to no-cache tells the browser not to cache the page
This may or may not work in IE -->

<meta http-equiv="expires" content="0">
<!-- Setting the page to expire at 0 means the page is immediately expired
Any vales less then one will set the page to expire some time in past and
not be cached. This may not work with Navigator -->
</head>
<title>Fake WWW server 1</title>
<body>
This is fake WWW server 1
</body>
</html>
bruce@ubuntu:~/webserver$ python -m SimpleHTTPServer 8080
OR
bruce@ubuntu:~/webserver$ while true ; do nc -l 8080  < index.html ; done

Now connect to director from client. I use lynx

$ lynx -dump http://192.168.25.135:8080/
   This is fake WWW server 1

$ lynx -dump http://192.168.25.135:8080/
   This is fake WWW server 2

See on director:
bruce@ubuntu:~$ sudo /sbin/ipvsadm -l --stats
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
  -> RemoteAddress:Port
TCP  ubuntu-2.local:http-alt             4       24       20     2084     3828
  -> ubuntu.local:http-alt               2       12       10     1042     1914
  -> 192.168.149.139:http-alt            2       12       10     1042     1914

bruce@ubuntu:~$ sudo /sbin/ipvsadm -l --rate
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port                 CPS    InPPS   OutPPS    InBPS   OutBPS
  -> RemoteAddress:Port
TCP  ubuntu-2.local:http-alt             0        0        0        2        7
  -> ubuntu.local:http-alt               0        0        0        0        1
  -> 192.168.149.139:http-alt            0        0        0        2        6


Wireshark capture on RealServer 1 which shows that director uses destination NAT.