The Nimbus Boot Process

The nimbus boot process can be divided broadly into 3 stages. The first stage varies by the type of system while stages two and three are the same for all Nimbus clusters. The stages will be reviewed in order.

Stage 1

Stage 1 begins with power on to the node. In systems with a standard BIOS, stage 1 is a PXE boot. For LinuxBIOS systems, Etherboot is used instead.
- PXE boot
  
  PXE is a standardized net boot method supported by many modern motherboards with built in ether net, as well as a number of better Ethernet cards through their option ROM. Since this support seems to be limited to fast ether net, Nimbus clusters with Gig ether and PXE boot generally use a 'boot net' consisting of built in fast ether net and a hub for PXE boot.
  1. DHCP
    
    PXE boot begins by performing DHCP. The master runs a DHCP server daemon which will provide a temporary IP address on the boot net and instructions to load the pxelinux.0 binary boot loader. Please see IETF RFC 2131 for details.
  2. TFTP
    
    The boot ROM uses TFTP to fetch /tftpboot/pxelinux.0 from the master, then transfers control to it. Please see IETF RFC 1350 for details.
  3. PXElinux
    
    pxelinux will first fetch it's configuration from the /tftpboot/pxelinux.cfg directory on the master. It will first attempt to load a file named for the assigned IP address expressed as hexadecimal (e.g. if address is 192.168.2.10 it will try C0A8020A ) If that does not exist, it will try more general network addresses by dropping octets starting with the most specific (e.g. C0A802 then C0A8 then C0 ) if all fail, it will finally load a configuration file named default. In all Nimbus clusters to date, only default is used.
    
    The configuration file will specify a kernel image to load, an initial ram disk, and a command line for the kernel. In a Nimbus cluster, the kernel will be in /tftpboot/slave/bzImage and the ramdisk /tftpboot/slave/initrd.
    
    Finally, pxelinux will use TFTP to fetch the specified kernel and ram disk, load into memory and transfer control to the kernel.
- Etherboot
  
  In a LinuxBIOS system, etherboot is built into the LinuxBIOS image on the flash chip. The Etherboot process, being somewhat more specialized, follows a somewhat simpler boot procedure. In addition, Etherboot support has been added for all Gig ethernet cards used by LinuxLabs in a Nimbus system. For that reason, a boot net is unnecessary for Gig ethernet clusters.
  1. DHCP
    
    Like PXE, Etherboot begins by sending out a DHCP request which is answered by dhcpd on the master. A temporary IP address in the 192.168.2.0/24 net is assigned for booting. In Gigabit systems, the Gig NIC will be assigned the alias address of 192.168.2.1. This is done for greater consistancy between different clusters, and to keep the address space of the bproc/cluster system seperate from the address space used in booting. One advantage is that the need for a specialized dhcpd which understands node assignments is avoided.
    
    In the Etherboot case, the DHCP response will also include instructions to load and execute /tftpboot/slave.elf. slave.elf is a combination of kernel and ramdisk formatted as a standalone ELF formatted executable binary. This image is made with mkelfimage.
  2. TFTP
    
    Etherboot uses TFTP ro retrieve the elf image from the master, loads it as speciufied in the binary headers, and transfers control to the kernel contained in the ELF.
Stage 2

Stage two begins with a normal kernel initialization, and completes with the beoboot process which is run in place of the normal /sbin/init for a typical UNIX system in a series of steps.
- Kernel initialization
- Root filesystem
  
  The kernel decompresses and mounts the initial ramdisk as / in the usual manner.
- /sbin/init
  
  The kernel runs /sbin/init from the ramdisk as process 1 as is typical of a UNIX system. The init provided to a Nimbus slave is actually the beoboot program from the Clustermatic system. It is a sort of all in one program consisting of:
  1. Init code
  2. mon
  3. bpslave
  init performs several steps in order to become part of the cluster.
  1. Drivers
    
    The ramdisk contains the file /config.boot. This file contains a list of all included drivers, and the PCI id numbers they are associated with. It also contains explicit insmod commands. beoboot first performs any explicit insmods, then enumerates the PCI bus(es), insmoding any driver matching the PCI ids found. This must include any needed network drivers that are not built into the kernel. In Nimbus, ethernet drivers will be modularized to support a small variety of supported hardware without wasting memory on unneeded drivers.
  2. RARP
    
    Once all drivers are loaded, init will send out RARP requests on all available ethernet interfaces. bpmaster on the master will respond to the RARP on the primary network only (Gig ether or fast ether). It will assign an IP address based on a list of MAC addresses in /etc/beowulf/config. However, if the node is new to the cluster, bpmaster will not find it in the lookup, so will not respond.
    
    New nodes are handled by rarp_catcher. rarp_catcher Will see that the node's MAC address has no entry in the config, and will add it to the end of the list. It will then cause bpserver to reload its configuration (by calling /etc/init.d/beowulf reload). Once reloaded, bpmaster will respond as usual.
  3. Helper daemons
    
    Once an IP address is established and assigned to the appropriate eth interface, init will fork and run mon. Mon is the local daemon portion of the Supermon system which is responsable for reporting on the node's status when requested. Currently, this is the only helper daemon loaded.
  4. Bproc support
    
    Next, init loads the vmadump and bproc modules. These provide the kernel side support for the Bproc system.
  5. bpslave
    
    init then runs bpslave. Bpslave is the user space portion of the bproc system to be run on the slave nodes. Bpslave is responsable for fulfilling requests for a process to migrate to the node, maintaiuning the redirection of stdin, stdout, and stderr back to the master, and for maintaining a small flow of status information required by bpmaster (primarily responding to periodic heartbeat packets). bpslave will then form a TCP connection to bpmaster on the master node. Once that connection is established, the slave's status changes from down to booting and boot moves to Stage 3.
Stage 3

Stage 3 is the final stage of a Slave node boot. The master node is in control of the Slave node at this point.
- node_up
  
  Once the status changes to booting, The MASTER executes /usr/lib/beoboot/bin/node_ip $NODE, where $NODE is the node number that just connected. node_up is a shell script.
  
  node_up is responsable for performing all steps needed to prepare the slave node for use. Naturally, this setup consists of a number of steps. A fatal error in any of these steps will stop the boot process and the node will transition to the error state.
  1. Set the clock.
    
    node_up runs /usr/lib/beoboot/bin/bdate $NODE.
  2. Loopback
    
    The loopback network interface (127.0.0.1) is configured.
  3. proc
    
    If /proc is not already mounted, do so now.
  4. Additional modules
    
    Then, any additional modules that were not required to reach this point, but will be needed in the running slave are loaded. This may include any specialized hardware drivers, the real time clock, any filesystem drivers not needed for the initrd, etc.
  5. fstab
    
    All default filesystems in /etc/beowulf/fstab are now mounted. The specified root filesystem is mounted as /rootfs, and all others under that. Mountpoints will be created if they do not already exist.
  6. basic libraries
    
    The basic libraries that will be needed on the slave are now copied over from the master.
  7. device special nodes
    
    Any devices that will be needed are created in /rootfs/dev now using bpsh $NODE mknod /rootfs/dev....
  8. /etc
    
    /rootfs/etc is populated with timezone info, nsswitch, and any other basic config files that will be used by the system libraries.
  9. chroot
    
    If all has gone well, the master will command the slave's bpslave daemon to chroot to /rootfs to continue normal operation. Once the chroot is performed, the initrd is thrown away. Any running processes in the process table will be hidden now (bpslave, mon, and other halpers become untouchable and do not show up in ps).
  10. rcS
    
    node_up will now run /usr/lib/beoboot/init.d/rcS $NODE. rcS is very much like the /etc/rc script except that it expects the node number as a parameter, and will run most commands on the slave. rcS will run each setup script in /usr/lib/beoboot/init.d in order passing it the parameters '$NODE start'. These scripts will complete the node's setup.
  11. finished
    
    Once all of this completes successfully, the node's status will transition from booting to up. The node is now ready for user programs.
  Once booting completes or fails (the node is in the up or error state), a log of the boot process will be available on the master in /var/log/beowulf/node.<n> where <n> is the node number. This is the first place to look when diagnosing a node in the error state.

Stage 1

PXE boot

DHCP

TFTP

PXElinux

Etherboot

DHCP

TFTP

Stage 2

Kernel initialization

Root filesystem

/sbin/init

Init code

mon

bpslave

Drivers

RARP

Helper daemons

Bproc support

bpslave

Stage 3

node_up