[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: thanks and a long question



Greetings,

The reboot of the master is not actually necessary. Instead, you can just
do:

/etc/init.d/beowulf restart 
on the master and reboot the slaves. Note that the restart command will
crash any running jobs on the cluster (of course, so does rebooting the
master :-)

For item 4, it may be the size of the library at issue, or it may be
confused by the number of library paths. I have seen that before (in
particular w/ the Intel compiler libraries). It may be that I will need to
modify the node_up script to preload /usr/root/PRO/lib. I will be happy to
take care of that. 

Alternatively, placing the attached scripts into
/usr/lib/beoboot/bin (make sure to chmod +x the scripts) should cause the
nodes to preload the needed library and make sure they can find them.

The instructions for running X should not be necessary. I suppose since
the X libs are linked against, they get loaded even when the command
options say don't use X.

Hope the eveninng beer was good (he says over the half-pot sized cup of
morning coffee).

G'day,
sjames



On Wed, 6 Nov 2002, Gerard P. Gilfoyle wrote:

> hi steven,
> 
>    here is the latest. 
> 
> 1. i added the /usr/X11R6/lib to the libraries list in 
> /etc/beowulf/config so it looks like this.
> 
> libraries /lib /usr/lib /usr/X11R6/lib 
> 
> i then powered down the slaves, rebooted the master (is this necessary?
> i thought this might be overkill), and powered up slaves 0-1.
> 
> slave 0 seemed to be acting fine so i tried to run root using the
> following command.
> 
> pscm1:gilfoyle> bpsh 0 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
> 
> and got
> 
> /usr/root/PRO/bin/root.exe: error while loading shared libraries:
> libCore.so: cannot open shared object file: No such file or directory
> 
> 2. ok. this is a library in /usr/root/PRO/lib so i just add that to the
> library list which now looks like this.
> 
> libraries /lib /usr/lib /usr/X11R6/lib /usr/root/PRO/lib
> 
> i power down slaves 0-1, reboot the master, and power up slaves 0-1. i
> have
> set the root environment variable LD_LIBRARY_PATH to /usr/root/PRO/lib.
> i
> now get the following.
> 
> pscm1:gilfoyle>  bpsh 0 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
> root: error while loading shared libraries: libXpm.so.4: cannot open
> shared object file: No such file or directory
> 
> this was the original Xlib that couldn't be found even though its area 
> (/usr/X11R6/lib) is now in the libraries list.
> 
> 3. ok, can i try doing this with the LD_LIBRARY_PATH variable? i set it
> the following way in my .cshrc file.
> 
> setenv LD_LIBRARY_PATH ${ROOTSYS}/lib  <-- ROOTSYS is the root area
> setenv LD_LIBRARY_PATH ${LD_LIBRARY_PATH}:/usr/X11R6/lib 
> 
> i power down slaves 0-1, reboot the master (i'm getting faster with
> this), and
> power back up slaves 0-1.
> 
> now i try to run root with
> 
> bpsh 0 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
> 
> and it hangs and i can no longer communicate with the slave. i checked
> before
> running root that i could run bpsh commands on the slave and it worked
> fine.
> 
> 4. a wild guess. is there a limit on the number of libraries i can add
> in
> /etc/beowulf/config. my next idea was to copy all the libraries into
> /usr/lib. do you think this would work? i have to go home now and put
> our
> 4-year-old to bed.
> 
> 5. you last message described how to use X on the slave. i don't think
> we
> need to do this. the '-b' option in root is meant to run root without
> any graphics and i have done this many times even on the cluster before
> the
> upgrade. i guess they build one version of root and run it in graphics
> and non-graphics modes. if you think we need to do the things you
> describe,
> let me know.
> 
> let me know what you think.
> 
> time for a beer.
> 
> jerry
> 
> 
> steven james wrote:
> > 
> > Greetings,
> > 
> > I am happy I could help.
> > 
> > A long question deserves a long answer, so here goes :-)
> > 
> > You were on the right track by putting the Xlib in with the regular libs.
> > 
> > The issue is that slave nodes recieve their files from the master's /lib
> > and /usr/lib directory.
> > 
> > This is a configuration option in /etc/beowulf/config.
> > 
> > Addint /usr/X11R6/lib there, then restarting the cluster should make it
> > find the library.
> > 
> > It is normally not included since it's unusual for a cluster app to want
> > to use X (other than the root process of a parallel visualization
> > app, that is).
> > 
> > This is not necessarily a problem, just an unusual situation that needs
> > configuring. Depending on exactly how it does it's thing, you may also
> > need to set the DISPLAY environment variable explicitly to 192.168.1.1:0
> > 
> > It may also be necessary to use xhost to permit the node to use the
> > Xservices on the master. For your example of node 3, you would want
> > xhost +n3
> > 
> > before running root. (A useful note, the nss libs are patched so that
> > n<node_number> will correctly resolve to the node's IP address.
> > 
> > If you are wanting to have the X connection re-directed to a workstation
> > somewhere, we'll need to set the master up to forward outgoing connections
> > from compute nodes so the X connection can get through.
> > 
> > G'day,
> > sjames
> > 
> > 
> > On Wed, 6 Nov 2002, Gerard P. Gilfoyle wrote:
> > 
> > > Hi Steven,
> > >
> > >    Thanks for all your help on Monday with the upgrade of the Richmond
> > > cluster. I have spent yesterday and today getting all our software
> > > tools up and running and I have run into a problem. We use a code
> > > called root to analyze our physics data both interactively and in
> > > batch. It was written at CERN (a large, international particle physics
> > > lab in Europe). I can run root on the master (pscm1) in interactive
> > > mode and in batch with no problems. However, when I try to run it in
> > > batch on a cluster node it can't find a library. The commands and
> > > error message are below.
> > >
> > > running root in batch on master:
> > >
> > >        root -b -q run_eod3.C  <-- this works
> > >
> > > The '-b' means batch and '-q' means the next thing is a file
> > > containing commands for the data analysis.
> > >
> > > running root in batch on a slave 3:
> > >
> > >        bpsh 3 root -b -q /scratch/gilfoyle/e5/24028/run_eod3.C
> > >
> > > error message from the previous command:
> > >
> > > root: error while loading shared libraries: libXpm.so.4: cannot open
> > > shared object file: No such file or directory
> > >
> > > The library libXpm.so.4 is located in /usr/X11R6/lib/ on pscm1 so
> > > presumably this is an environment variable problem. I have tried
> > > various fixes, but all have failed. Some of the things I tried are
> > > listed below.
> > >
> > > 1. root uses a library whose location is defined by the environment
> > > variable LD_LIBRARY_PATH which will point to an area like
> > > /usr/root/lib/. I have tried adding /usr/X11R6/lib/ to this path and
> > > even putting libXpm.so.4 in with the normal root libraries, but I get
> > > the same failure.
> > >
> > > 2. After the upgrade on Monday, we created user directories and
> > > account in the /home area, but I realized later the disk partition
> > > containing /home was too small. I moved the home directories to
> > > /usr/home. I speculated that the slave was not finding the correct
> > > .cshrc file so I created a temporary /home/gilfoyle
> > > area, put all the files in there (including the .cshrc file), and
> > > tried running root on the slave from that new directory. I get the
> > > same error message.
> > >
> > > Do you have any thoughts on what a solution could be???
> > >
> > > I will also contact the root developers to see if they have run into
> > > this problem.
> > >
> > > thanks-in-advance,
> > >
> > > jerry
> > >
> > >
> > 
> > --
> > -------------------------steven james, director of research, linux labs
> > ... ........ ..... ....                     230 peachtree st nw ste 701
> > the original linux labs                             atlanta.ga.us 30303
> >       -since 1995                              http://www.linuxlabs.com
> >                                    office 404.577.7747 fax 404.577.7743
> > -----------------------------------------------------------------------
> 
> 

-- 
-------------------------steven james, director of research, linux labs
... ........ ..... ....                     230 peachtree st nw ste 701
the original linux labs                             atlanta.ga.us 30303
      -since 1995                              http://www.linuxlabs.com
                                   office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------

#!/bin/sh
#
# Erik Hendriks <hendriks@lanl.gov>
#
# $Id: setup_libs,v 1.3 2001/10/15 22:06:47 hendriks Exp $
#
# This is a very simple script to copy shared libraries to nodes in a
# cluster.  This is broken out like this so that it can easily be run
# if libraries are updated after a node is booted.
#
# Possible future features:
#  * take -a to run on all nodes.
#  * take some argument to redo the library list and then update
#    all the nodes.

cd /
# Argument sanity checking
if [ "$1" = "" ] ; then
    echo "Usage: setup_libs <nodenumber> [rootfs]"
    exit 1
fi

NODE=$1
ROOTFS=$2
PATH=/sbin:/usr/sbin:$PATH

if [ -z "$ROOTFS" ] ; then ROOTFS=/ ; fi

echo "setup_libs: Copying libraries to node $NODE..."
if ! bplib -l | sed -e 's!^/!!' | tar cf - -T - | \
    bpsh $NODE tar -C $ROOTFS -xf - ; then
    echo 1>&2 "Library copy to node $NODE failed.  (rootfs=$ROOTFS)"
    exit 1
fi

echo "setup_libs: Copying ld.so.conf..."
bpcp /etc/ld.so.conf $NODE:/etc

echo "setup_libs: Running ldconfig on node $NODE..."
if ! bpsh $NODE ldconfig -r $ROOTFS ; then
    echo 1>&2 "Running ldconfig on $NODE failed. (rootfs=$ROOTFS)"
    exit 1
fi

# Transfer library list to the remote node.
echo "setup_libs: Transfering in-kernel library list to node $NODE..."
if ! bplib -l | bpsh $NODE bplib -a - ; then
    echo 1>&2 "Failed to setup library list on $NODE."
    exit 1
fi

exit 0
#!/bin/sh
#---------------------------------------------------------------------
# Erik Arjan Hendriks <hendriks@lanl.gov>
# Copyright (C) 2000 Scyld Computing Corporation
# 
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
# 
# $Id: node_up,v 1.17 2002/01/04 00:39:59 hendriks Exp $
#---------------------------------------------------------------------
umask 022			# Default umask for this stuff.
cd /

# Argument sanity checking
if [ "$1" = "" ] ; then
    echo "Usage: node_up <nodenumber>"
    exit 1
fi

NODE=$1
CONFIG=/etc/beowulf/config
BINDIR=/usr/lib/beoboot/bin

# Usage: beoconfig tag [config_file]
beoconfig() {
    local FILE=$2
    if [ -z "$FILE" ] ; then FILE=${CONFIG} ; fi
    if [ ! -f ${FILE} ] ; then
        echo "Warning: ${FILE} file not found." >&2
	return
    fi
    # These sed bits:
    #  - strip spaces
    #  - strip leading + trailing space
    #  - if line starts with $1, strip off $1 and print it.
    sed -ne "s/#.*//" < ${FILE} \
	 -e "s/^[[:space:]]\+//;s/[[:space:]]\+\$//" \
         -e "/^$1[[:space:]]/{s/^$1[[:space:]]\+//;p;}"
}

die() {
    if [ -n "$1" ] ; then
        echo 1>&2 "$1"
    fi
    if [ -n "$2" ] ; then
        echo 1>&2 "Fatal error performing: $*"
    fi
    if [ -n "$MOUNTED" ] ; then
        umount $INITRD_BUILD
        rmdir  $INITRD_BUILD
    fi
    exit 1
}

run_cmd() {
    eval "$*" || die "" "$*"
}

# A message for the console on the remote end.
bpsh $NODE --stdout /dev/console \
  echo -e "node_up: This is node $NODE.\nnode_up: boot log available in /var/log/beowulf/node.$NODE on the master."

#---------------------------------------------------------------------
# First things first... set the system clock
echo "node_up: Setting system clock."
run_cmd $BINDIR/bdate $NODE

# mapping of ram devices at this point.
# /dev/ram0 <- initrd goes here

#run_cmd bpsh $NODE mount -nt proc none /proc

# XXX We need a way to figure out what interface is up at this point
# so that we know which one to slap a netmask onto.
echo "node_up: TODO set interface netmask."

# ... and kick on that loop back interface
echo "node_up: Configuring loopback interface."
run_cmd bpsh $NODE ifconfig lo 127.0.0.1 netmask 255.0.0.0
run_cmd bpsh $NODE route add -net 127.0.0.0 netmask 255.0.0.0 lo

#---------------------------------------------------------------------
# Kernel Modules
#
# We should probably pay attention to "insmod" lines in the config
# file here...
KVER=`bpsh $NODE uname -r`	# Make note of the remote kernel version
for module in `$BINDIR/pcilookup $NODE`; do
    modprobe --node $NODE $module
done

#---------------------------------------------------------------------
# File Systems
#

# We need a way for setup_fs to let us know where the root filesystem
# is mounted... 
$BINDIR/setup_fs $NODE || exit 1

# Populate it ?
# Setup scratch and tmp space...
run_cmd bpsh $NODE mkdir -p /rootfs/{tmp,scratch}
run_cmd bpsh $NODE chmod 1777 /rootfs/{tmp,scratch}

bplib -l | bpsh $NODE bplib -a -
$BINDIR/setup_libs $NODE /rootfs || exit 1

# Copy over device nodes from the front end.
echo "node_up: populating /dev and /etc"
run_cmd bpsh $NODE mkdir -p /rootfs/{dev,etc}

echo "node_up: Copying over device nodes."
run_cmd bpsh $NODE mkdir -p /rootfs/dev
#find /dev -mount -type b -o -type c | \
#    sed -e 's!^/!!' | tar cf - -T - | bpsh $NODE tar -C /rootfs -xf -
DEVLIST="console zero null"
tar -C /dev -cf - $DEVLIST | bpsh $NODE tar -C /rootfs/dev -xf -
[ "$?" = "0" ] || die "" "copying device nodes"

echo "node_up: Copying over time zone info."
run_cmd bpcp /etc/localtime $NODE:/rootfs/etc/localtime

echo "node_up: Copy over nsswitch info."
run_cmd cat << EOF | bpsh $NODE --stdout /rootfs/etc/nsswitch.conf cat
passwd: bproc
hosts: bproc
EOF

# nss_bproc is optional equipment so ignore errors....
#echo "node_up: Copying over bproc nss library."
#bpcp /lib/libnss_bproc.so.2 $NODE:/rootfs/lib

#---------------------------------------------------------------------
# Finish up...

#run_cmd bpsh $NODE umount -n /proc

run_cmd bpctl -S $NODE -r /rootfs

#if needed for locking NFS
#run_cmd bpsh $NODE portmap

# This is a hack to make the dynamic linker work for things which are
# exec'ed remotely.
run_cmd bpsh -N $NODE /sbin/ldconfig -l /lib/ld-*

run_cmd bpsh -N $NODE hostname n$NODE

run_cmd $BINDIR/nodeinfo $NODE			# Update node information DB

if [ -x /usr/lib/beoboot/init.d/rcS ]; then /usr/lib/beoboot/init.d/rcS $NODE
fi

#--- A message for the log file and node's console.
echo "node_up: Node setup finished."
bpsh $NODE --stdout /dev/console echo "node_up: Node setup finished."
exit 0