Building a Xen Cluster The goal: Make a managable cluster of machines work together to provide 99.999% availability for a set of virtual machines in the fastest way possible with current cheap commodity hardware. To this end, I've put a bit of energy into building a simple Xen cluster. This whitepaper is an attempt to document the effort. Xen is a hypervisor. Think of it as a microkernel done right. There exists Linux, NetBSD, and even an OpenSolaris port that run under the Xen hypervisor. The "host" machine is Domain 0 (Dom0), and is responsible for talking to hardware on the box and configuring and booting the Domain User (DomU) slices. Don't be confused by Dom0, however; the Xen hypervisor is the magician behind the scenes making this possible. Xen 3.0 has migration features: you can move a Xen DomU instance between physical Xen servers. To do this, however, you need a shared storage system, or some method of NAS/SAN visible to all nodes in the cluster. RedHat has a wonderful clustering platform with native clustered stupport for LVM2. Instead of GNBD, however, I've decided to use ATA-over-Ethernet for simplicity and speed. With this, we have a clusterable group of machines that share a common storage namespace (and can access each other's storage directly via the network), permitting native Xen domain migration. The following guides formed the basis of the above decision: - [RedHat Clustering Infrastructure](http://gfs.wikidev.net/Installation) - [Debian and Xen](http://www.cl.cam.ac.uk/Research/SRG/netos/xen/readmes/user/user.html#SECTION03500000000000000000) - [Xen on Sarge HOWTO](http://www.xmlvalidation.com/xen_howto_sarge.0.html) - [Debian and OpenSSI under Xen](http://openssi.org/cgi-bin/view?page=docs2/1.9/debian/xen-howto.txt) - [Xen + OpenSSI == XXen](http://openssi.org/cgi-bin/view?page=docs2/1.9/debian/xen-howto.txt) Step 0 - Hardware and Network When building any Linux cluster, the first step is laying out the topology and shared storage. To keep things simple, fast, and cheap, [ATA over Ethernet (AoE)](http://www.coraid.com/support/linux/) is really the best solution available at the moment. For simplicity, each server in the cluster will be given two network interfaces.An "internal" protected storage network, and an "external" firewalled public network. Step 1 - Building the base systems I use a debian based distro that I maintain in-house with an extensive hand-maintained repository of backports. The auto-install platform is roughly based on the [SystemImager](http://www.systemimager.org/) package, only heavily hacked to simplify maintenance and unify the install script across all of our builds in a flexible way (some day I hope to opensource it here somewhere soon). I strongly recommend that you have a running filesystem for root (/), usr, and var, that are _NOT_ encapsulated with lvm. You will understand why later. This would be a slightly different layout than our standard NKS setup: /dev/md0 - RAID1 - root (/) (1G) /dev/md1 - RAID10 - /usr (4G) /dev/md2 - RAID10 - /var (16G) /dev/md3 - RAID10 - everything else. You can do the following manually with a [Knoppix](http://www.knoppix.org) CD if you really want to: On a 4 drive Parallel ATA (PATA) setup you can generate the above using: $ cat - < root (hd0,0) grub> setup (hd0) grub> setup (hd1) grub> setup (hd2) grub> setup (hd3) Edit your /target/boot/grub/menu.lst so that it points to the kernel. Now you should have a bootable system. Unmount the /target mounted filesystems and reboot. You should now be running a base install of a distribution of Linux on your server that boots with grub and has an unused md storage device that spans the majority of free space (/dev/md3). Xen requires the former, and lvm2/aoe will require the latter. Step 2 - Grab the Xen and Cluster components To build the source below, we will need a compiler: # apt-get install gcc-3.4-dev libc6-dev Xen has a few dependencies: # apt-get install bridge-utils hotplug iproute python2.3-dev zlib1g-dev if you want to build the documentation as well, you'll need a few more (tetex, "ps2pdf" from gs-common, and "fig2dev" from transfig, and a recent version of perl with pod2man that supports the --name option). # apt-get install tetex gs-common transfig perl Now, grab the Xen "unstable" release and extract it. This includes a 2.6.12 kernel, which is required by the RedHat cluster tools (which we will discuss below). # wget http://www.cl.cam.ac.uk/Research/SRG/netos/xen/downloads/xen-unstable-src.tgz # tar xvzf xen-unstable-src.tgz Now, build the userspace Xen tools and an initial Dom0 kernel (we will rebuild it in the next step, don't worry too much about the .config file right now): # cd xen-unstable # make dist (everything builds) # ./install.sh Installing Xen from './dist/install' to '/'... All done. Checking to see whether prerequisite tools are installed... Xen CHECK-INSTALL Wed Nov 23 22:46:09 EST 2005 Checking check_brctl: OK Checking check_hotplug: OK Checking check_iproute: OK Checking check_python: OK Checking check_zlib_lib: OK All done. # make install Other bits that probably aren't required anymore: Now we're done with the Xen kernel and userspace tools. Lets move on to the RedHat cluster tools to build against the Xen Dom0 kernel. The stable RedHat cluster tools can be grabbed via CVS: # cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster login cvs Password: {enter "cvs"} # cvs -d :pserver:cvs@sources.redhat.com:/cvs/cluster checkout -r STABLE cluster When we build the cluster tools, we want to point the build at the source tree for the Xen Dom0 kernel so that it builds the appropriate kernel modules. First, some dependencies: # apt-get install libxml2-dev Then a small fix to get around the fact that a glibc 2.2 doesn't have an ifaddrs.h or getifaddrs()/freeifaddrs(): # cat > /usr/include/ifaddrs.h < Then we build: # cd cluster # ./configure --kernel_src=`pwd`/../xen-unstable/linux-2.6.12-xen0 # make install Now the software is ready. Both the Xen tools and the RedHat cluster tools are installed, and the Xen hypervisor and Dom0 kernel is built with the RedHat cluster kernel modules. Step 3 - Re-config and re-Build your kernel. You will need to change the Xen0 2.6.12 kernel so that it builds with devmapper (dm) support, and ATA over Ethernet (AoE): cd xen-unstable/linux-2.6.12-xen0 make ARCH=xen menuconfig clean bzImage modules cp -f arch/i386/boot/bzImage /boot/vmlinuz-2.6.12.2-xen0 cp -f System.map /boot/System.map-2.6.12.2-xen0 cp -f .config /boot/config-2.6.12.2-xen0 Once you're done rebuilding and preparing to install your kernel, you will also need to re-build the "dlm" and "cman" kernel modules as well: cd cluster ; ./configure --kernel_src=`pwd`/../xen-unstable/linux-2.6.12-xen0 make -C cluster/ install You will also need to add a boot menu option for this Xen kernel using the Xen 3.0 hypervisor: # /boot/grub/menu.lst Add a section like so: title Xen 3.0 / XenLinux 2.6.12.6 kernel /boot/xen-3.0.gz dom0_mem=256000 console=vga apic_verbosity=verbose noapic module /boot/vmlinuz-2.6.12.6-xen0 root=/dev/md0 noapic ro console=tty0 Note: this is why we don't use lilo. Getting lilo to work with command line arguments for both kernel (append=) and module (initrd=) is only the beginning of the pain. Use grub. Be happy. You are now ready to reboot with a cluster-ready Xen kernel. Step 4 - Build aoe-tools and vblade. Step 5 - Start a vblade on each server. On a cluster server, the goal is to share storage with other nodes in the cluster. Each cluster server node is going to share the entire /dev/md3 stripe as a single large block device to the other clvm'ed nodes. Each "shared" cluster stripe will be defined as an AoE shelf/slot. vblade 0 0 eth1 /dev/md3 This will create a device "/dev/etherd/e0.0" shared over the eth1 network interface between the cluster nodes on the shared private storage network. Only the other nodes will see this device, you must continue to reference it as /dev/md3 locally. LVM2 will automagically scan this device and include it when re-assembling the cluster volume group on boot. For production use, as vblade doesn't fork, the easiest way to keep vblade running is to add it to inittab as respawn. on node0: # echo "e0:2:respawn:/usr/sbin/vblade 0 0 eth1 /dev/md3" >> /etc/inittab # init q on node1: # echo "e1:2:respawn:/usr/sbin/vblade 0 1 eth1 /dev/md3" >> /etc/inittab # init q You should see output from the vblade starting appear in /var/log/daemon. On the other node, you should be able to aoe-discover and aoe-stat show the device: # aoe-interfaces eth1 # aoe-discover # aoe-stat e0.0 306.440GB eth1 up Note: as this is at the end of /etc/inittab, and running in runlevel 2, the rc2 script will need to finish first before init starts respawning vblade. To expose the aoe device to the network before this point (if you really must), just put this line _before_ the rc2 line in /etc/inittab. Step 7 - Get the cluster running. Step 7a - Create the cluster config: # vi /etc/cluster/cluster.conf This is an example 2 node configuration, with manual fencing: Step 7b - Start ccsd The ccs daemon keeps the configuration in sync between cluster nodes. /etc/init.d/ccsd start Step 7c - Join the cluster with cman The cman kernel module is the cluster manager. It uses dlm locking and heartbeat thread to form a quorum of nodes that are part of the cluster. # cman_tool join This will join, or create, a cluster. Step 8 - Get lvm running in a cluster configuration. lvm2 is an entirely userspace abstraction that users the devmapper kernel module to present volumes carved out of physical block device space. lvm2 has a cluster manager called "clvmd" that registers with cman to communicate with other cluster nodes to act in a cluster configuration. With clvmd, lvm2 becomes a cluster-wide naming system for volumes carved up out of network exposed block devices, and a locking engine for the same. # apt-get install lvm2 Or build from CVS: # cvs -d :pserver:cvs@sources.redhat.com:/cvs/lvm2 login cvs # cvs -d :pserver:cvs@sources.redhat.com:/cvs/lvm2 checkout LVM2 # cd LVM2 ; ./configure --with-clvmd=cman --with-confdir=/etc/lvm --prefix=/usr && make && make install After the cluster is configured and running ("ccsd" and "cman"), and lvm2 is installed, we need to edit /etc/lvm/lvm.conf to make this a cluster aware setup. # vi /etc/lvm/lvm.conf In devices {}, Add: filter = [ "a|/dev/etherd/*|" ] types = [ "aoe", 1024 ] sysfs_scan = 0 In global {}, comment out: # locking_type = 1 just below that, in global {}, uncomment or add: locking_library = "liblvm2clusterlock.so" locking_type = 2 library_dir = "/lib/lvm2" Then save, and start up clvmd (make sure cman is running first, and the node is part of the cluster): # clvmd & You can now scan for volume groups: # vgscan NOTE: lvm2 does _not_ scan AoE devices by default. In fact, if you have sysfs enabled it will _not_ find AoE devices at all, even if you add a filter that matches them. Moreover, lvm2 will only find AoE devices with a major as listed in /etc/modules: # grep aoe /proc/devices 152 aoechr 152 aoe This means that _all_ of the AoE devices you wish to scan must start with a major number of 152. If you look at /dev/etherd, you will see 16 "partition" devices for each shelf/slot device by default. Using 16 partitions, as AoE assigns minor numbers linearly, the crossover to major 153 happens just after "e1.5p14". This means that you really only have all of one shelf visible to lvm2, and part of a second (a maximum of 16 devices.. not good for a large cluster of more than 16 nodes). The "fix" is to edit drivers/block/aoe/aoe.h in your kernel source and replace "AOE_PARTITIONS 16" with "AOE_PARTITIONS 1": # perl -pi -e 'S/(AOE_PARTITIONS 1)6/$1/g' drivers/block/aoe/aoe.h Alternatively, set AOE_PARTITIONS=1 when building your kernel # make ARCH=xen AOE_PARTITIONS=1 oldconfig clean bzImage modules module_install Rebuild your kernel, then re-generate your /etc/ethered devices using the n_partitions variable: # n_partitions=1 aoe-mkdevs /dev/etherd This really fixes the problem, and lvm2 can scan all of the AOE shelf/slot devices!! Step 9 - Creating Physical Volumes, Volume Groups, and Volumes First, we create a Physical Volume for the local RAID10 stripe, then for the remote RAID10 stripe via AoE: pvcreate /dev/md3 pvcreate /dev/etherd/e0.1 Next, we create a Volume Group that contains both Physical Volumes: vgcreate vg /dev/md3 /dev/etherd/e1.0 Step 10 - Setup some GFS volumes - http://gfs.wikidev.net/Installation Step 10a - Get fenced running # fence_tool join Step 10b - Create the gfs filesystem: # gfs_mkfs -p lock_dlm -t : -j must match the cluster name used in CCS config is a unique name chosen now to distinguish this fs from others the number of journals in the fs, one for each node to mount a block device, smart: # lvcreate -n shared_smart -L 10G vg /dev/md3 # lvcreate -n shared_stupid -L 10G vg /dev/etherd/e0.1 # gfs_mkfs -p lock_dlm -t blenke:shared_smart -j 2 /dev/lv/shared_smart # gfs_mkfs -p lock_dlm -t blenke:shared_stupid -j 2 /dev/lv/shared_stupid On both: mkdir -p /shared/smart /shared/stupid # mount /dev/lv/shared_smart /shared/smart # mount /dev/lv/shared_stupid /shared/stupid Remember: GFS filesystems, while accessible by both nodes, ARE NOT MIRRORED. You create the GFS filesystem on a shared block device. If the block device happens to be on one server or the other, when that server is rebooted, the other nodes will be unable to access that filesystem. At the moment, there is no solution to this. For cluster mirroring, look for dm-mirror and the lvcreate -m option. The dm-mirror kernel module is made up of dm-raid1 and dm-log, which is being worked on by RedHat right now [LVM2 Mirroring](htp://developer.osdl.org/dev/clusters/docs/cluster_summit_mirror_paper.pdf) for RHEL4. Currently only pvmove and lvmcreate -m use this kernel module (if you have a recent lvm2 build), and you're really on your own. If you have a cluster of more than 3 nodes (more than 3 PVs in the cluster VG), you can create a mirrored volume. One PV will get one half of the mirror, one PV will get the other half of the mirror, and one PV will get the mirror log volume. # lvcreate -m 1 -n mirror1 --alloc anywhere -L 4G vg Logical volume "mirror1" created # lvscan ACTIVE '/dev/vg/mirror1' [4.00 GB] anywhere ACTIVE '/dev/vg/mirror1_mlog' [4.00 MB] anywhere ACTIVE '/dev/vg/mirror1_mimage_0' [4.00 GB] inherit ACTIVE '/dev/vg/mirror1_mimage_1' [4.00 GB] inherit Step 11: Create a XenU domain. Copy on Wrote (CoW) First off, lets decide how we're going to build our filesystems. While there is CopyOnWrite (CoW) support (LVM writable persistent snapshots), it isn't 100% reliable yet, and doesn't handle out-of-space conditions very well. To that end, I am going to avoid using this here: Creating the "virgin" backing store volume: # lvcreate -n virgin -L 4G vg # mkfs -t xfs /dev/vg/virgin # mount /dev/vg/virgin /mnt # debootstrap sarge /mnt http://source.rfc822.org/debian # vi /mnt/etc/fstab # umount /mnt Creating a clone filesystem: # lvcreate -s -n myclonedisk1 -L 1G /dev/vg/virgin This new volume ("myclonedisk1") can handle up to 1G of "block differences" before it runs out of space. To that end, you will need to periodically grow the block device depending on the space remaining: # lvextend +1G /dev/vg/myclonedisk1 Can you see the danger here? For each clone disk snapshot, you will need to monitor the space used to see if enough space remains, and grow it whenver the space approaches some kind of threshold. If something goes crazy and rapidly makes changes to a filesystem, you may not catch the change in time with a monitoring script in dom0, and you may get a fatally corrupted volume in the process. For this reason, I am avoiding it. XenU RAID1 vs dm-mirror Rather than use the somewhat experimental dm-mirror support for mirrored volumes, we're going to leave the mirroring up to the XenU domains to do themselves. Lets create a domain that runs on "smart", the first cluster node: Step 11a: Create some volumes. # lvcreate -n blenke-web-00_mirror0 -L 4G vg /dev/md3 # lvcreate -n blenke-web-00_mirror1 -L 4G vg /dev/etherd/e0.1 Step 11b: Fill the primary volume: # mkfs -t xfs /dev/vg/blenke-web-00_mirror0 # mount /dev/vg/blenke-web-00 /mnt # debootstrap sarge /mnt http://source.rfc822.org/debian # vi /mnt/etc/fstab # echo blenke-web-00 > /mnt/etc/hostname # echo blenke-web-00 > /mnt/etc/mailname # umount /mnt Step 11b: Create the xen config: # cat - /etc/xen/auto/blenke-web-00 kernel = "/boot/vmlinuz-2.6-xenU" memory = 64 cpu = -1 # Xen should allocate a proc to run on. vcpus = 1 # We only want 1 CPU for this domain (Xen 3.0 SMP!) name = "blenke-web-00" nics = 1 vif = [ 'mac=aa:00:0a:00:00:0a, bridge=xenbr0' ] ip = "10.0.0.10" disk = [ 'phy:vg/blenke-web-00_mirror0,sda1,w', 'phy:vg/blenke-web-00_mirror1,sda2,w' ] root = "/dev/md0 ro" EOF (MORE STUFF HERE) Step 13: Write operational scripts to maintain your cluster: Xen Daemon start # xend start Xen Daemon stop # xend stop Xen Daemon restart # xend restart Xen Daemon status # xend status Xen help # xm help Xen list # xm list Xen list consoles # xm consoles Xen attach to console # xm console mydomain Xen suspend-to-file # xm save mydomain mydomain.xen Xen restore-from-file # xm restore mydomain mydomain.xen Xen live migration # xm migrate --live mydomain destination.ournetwork.com Xen memory resize # xm set-mem mydomain 32 (MORE STUFF HERE)