Linux Labs
Beowulf Distribution
Codename "Nimbus"
Cluster Management Overview
Last revision: Wednesday, August 21, 2002, TRUSTY
- Architecture Overview
- Bpsh
- Boot
- Wakinyan Monitor
- Advice on Booting Behavior
- Other Important Configuration Files
- Other Information
- Architecture of Nimbus (vis a vis Scyld)
- The Beostat daemon has been replaced with the supermon utilities from LANL this is a very lightweight /proc based system that uses virtually no system resources; as opposed to its rather onerous predecessor.
- Wakinyan Monitor: A graphical monitor that both saves on screen space and has ambient temperature output.
- 2.4.19 Linux kernel for cutting-edge stability and latest feature set (i.e. hyperthreading capability for Xeon-based clusters).
- bproc has been updated to the advanced LANL version with the following features:
- Unified ProcessIDentification space more complete. Bproc system daemons are fully hidden once a node boots.
- OLD: when a process spawns on a slave node, it initializes, then a new PID is issued
- NEW: all system processes disappear, and PIDs are global on all nodes
- Access control is now available on a node-by-node basis:
- User / Group / Other (i.e. chmod uga) on slave nodes themselves
- Permissions are checked on nodes for job eligibility by users.
- This is useful in a shared cluster where not everyone can use all the nodes
- rarpcatcher replaces beosetup
- The status info is put into /etc/beowulf/config
- This process runs on startup
- It is always running as daemon, so when nodes are added on the fly, the process HUPs (restarts) the beowulf system and adds in the new nodes.
- NEW: all node data is only lost in the case of a complete trashing of your filesystem- this due to ext3 filesystem. We have experienced ZERO corruption in extensive testing. This is opposed to older versions of the software which took a rather cavalier attitude towards node filesystem data.
- ALSO: Regarding I. Above, this makes boot much cleaner
- NOTE: If one or more of your nodes has important data, issue a sync command before you power cycle
- REMEMBER: the only non-persistent data stored on nodes are the libraries and system files that are copied to the node at boot time.
- Important Cluster utilities
- All commands accept a Node specification syntax
- bpsh is the primary user interface into bproc. This is a sort of shell program similar to bash or tcsh that allows you to issue commands across all nodes on the network, or to selected nodes as described herein:
- bpsh <nodespec> command
- bpsh -h (help)
- bpsh -n : no redirect, like rsh.
- bpsh will accept all rsh syntax, i.e. you could actually issue this string and expect everything to work in order to convert inscript, a rsh based script, to bpsh : (sed -e "s/rsh/bpsh/g" < inscript > outscript)
- man page
- bpsh run environment
- bpcp is a bproc equivilent to rcp.
- bpstat Display node status.
- bpctl Change node status
- Master boots like typical RedHat system
- Slave Booting procedure and sequence.
- Supermon System
- Communicates over TCP/IP
- mon daemon
- supermon daemon
- light weight
- Data format
- Lisp like
- Human readable
- Extansable
- Kernel modules
- supermon_proc
- sensors
- mon embedded in beoboot
- libsexpr
- Wakinyan monitor
- This program lives in /usr/bin/wakinyanmon
- Part of Supermon system.
- A GTK application.
- Node Status Display
- A horizontal yellow line means the node is down
- diagonal yellow means the node is booting
- green check means the node is up
- red X means the node has an error condition
- CPU load
- Disk load
- Memory used
- Swap
- Net
- Temperatures
- CPU 0
- CPU 1
- Northbridge
- Advice on booting behavior
- Are all the ports flashing on (if you have one) your GIG-E switch? This is GOOD! This means the arps are working.
- (NOTE: the error most common is failed attempt to mount unavailable NFS share)
- IMPORTANT NOTE: Booting a cluster always seems to take longer to boot than it actually does. Dont despair! Just standby a bit. Wait a minute. Get a cup of coffee. All is well, 99% of the time!
- Would you like to watch a node boot? This is also good for debugging nodes. You are going to monitor the Serial Console!
- Minicom on master is ready to go. Run it.
- The settings sbould already be TTYS0, 115200, N81 vt100
- Find your null modem cable. A null modem cable is shipped with every cluster.
- The Leftmost serial port on the master plus into the leftmost serial port on the target slave.
- Other important Config files:
- /etc/beowulf/config.boot the file of last resort this gives a list of the PCI ids and driver names
- Command "beoboot -p" this program grabs the kernel from /etc/beowulf/config and creates new images in the /tftpboot/slave boot directory
- If you have problems unsolvable with reboot or halt, toggle the power on/off manually.
- Other info
- bpsh command execution paths are strictly by canonical directory names- follow:
- Are you in /home/sysadmin on the master?
- Does this exist on the slave?
- Then the process you are executing runs In This Current Working Directory (cwd).
- Are you in /home? /home always exists in an NFS mount.
- You will be in the same directory if not an NFS mount. For example, if you are in /scratch on the master, you will execute in /scratch on the slave (/scratch exists on all machines)
- If you are not in a similar canonical directory your path will be / on the slave.
- Mirroring the Master to the secondary master.
- Failover procedure for secondary masters:
- Connect any RAID devices to the secondary master.
- Connect the external net connection of the master to eth0 on the secondary.
- Connect eth1 to the booting switch network (plus monitor, keyboard).
- Reboot.
- If necessary, hit the spacebar to skip PXE boot errors in this procedure.
- Some BIOSes require hitting F2 to turn off PXE in the bios (boot menu) and to make the HD the primary boot method.
- Want to run PVM ? simply run start-pvm and this launches in all nodes for legacy apps.
- Want to run MPI ?
- The newer MPI (1.5) uses all_cpus=1 rather than "MPI=" for using all CPUs.
- Example: all_cpus=1 progrname params (linked against MPI)
- Note that the master is node -1
- Want to run Lahey FORTRAN compiler?
- Lahey resides in /usr/local/lf95
- PGI in in /usr/pgi (pgi c, pgi f90, etc.) flexlm and environment variables set by default to just work. See the docs for more info.
- Partioning of nodes:
- nodes are /dev/hda1 one fs
- /dev/hda2 is swap
- primary and secondary masters are - /dev/hda3 (/) ; /dev/hda5 (/var) ; /dev/hda6 (/usr)
- Depending on your cluster specification, the primary is pre-setup to also be a slave node.
- Rebuilding from source RPMs
- Supplimental materials:
- External resources