[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: status of the Richmond cluster



hi steven,

   responses below.

jerry

steven james wrote:
> 
> Greetings,
> 
> The message about not finding ld.so.conf was a problem (fixed). The rest
> is just a result of the caching system, and will not affect the systems
> when they actually run.

good.

> 
> I am concerned about the master freezing up. Do you know if it displayed
> any sort of OOPS on the console monitor when it hung?

i ran it from jlab. tomorrow i will be in richmond and can try it
from there.

> 
> There are two tactics to get this nailed down. One is to have the slaves
> copy their data directly from their /data? fileserver mounts. Currently,
> the data is being double copied it looks like.

i can try that tomorrow.

> 
> The other is to stagger the start times of the analysis runs at .5 to 1
> minute intervals (perhaps a sleep 30 in a script?).

i'll put it in. however, the data copying often takes more than a minute
so i don't know how much this will help.

> 
> This will help to narrow things down for a final resolution.
> 
> G'day,
> sjames
> 
> On Tue, 26 Nov 2002, gilfoyle wrote:
> 
> > hi steven,
> >
> >    here's the latest.
> >
> > 1. i restored root. this was done by deleting the old directory and
> > untarring the file from cern containing the libraries and the binaries.
> > i used the version for redhat 7.2 and gcc 2.96. the file is
> >
> > /usr/root/root_v3.03.09.Linux.RH7.2.gcc296.tar
> >
> > 2. i executed the 'bpsh -d allup /sbin/ldconfig -v >setuplog-02-nov-25'
> > and got the following.
> >
> > [root@pscm1]# bpsh -d allup /sbin/ldconfig -v > setuplog-02-nov-25
> > /sbin/ldconfig: Can't open configuration file /etc/ld.so.conf: No such
> > file or directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_compat.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_dns.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_files.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_nis.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_compat.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_dns.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_files.so: No such file or
> > directory
> > /sbin/ldconfig: Cannot stat /usr/lib/libnss1_nis.so: No such file or
> > directory
> >
> > ...  lots more.
> >
> > i have attached the log file i created during this process. i didn't
> > know
> > if the messages above were a problem or not so i trudged on.
> >
> > 3. i tried running root from pscm1 and it worked beautifully.
> >
> > 4. i ran my scripts for submitting jobs to the cluster using four
> > analysis runs. this also ran beautifully. it produced output files
> > in the correct place that looked like things had worked. i was
> > very happy. i ran this script using slaves 0-3. i had not been able
> > to use slaves 2-3 before.
> >
> > 5. i ran my scripts using 12 analysis runs next. things started out
> > fine. i was monitoring the number of jobs running on the slaves. at that
> > point the jobs were either transferring data over to the slaves' disk
> > or starting the analysis. sometime during this process, the master
> > (pscm1) hung and i could get no response. this is similar to what we
> > saw a couple of weeks ago. i waited for quite some time as you
> > suggested in one of your emails, but i never got a response so i went
> > home.
> >
> > 6. after i came in this morning i could get no response from pscm1 so i
> > rebooted the master (actually my secretary did it. i am at JLab today).
> > root still runs fine on the master. i will ask sasko (our linux person)
> > to reboot the cluster today.
> >
> > let me know what you think.
> >
> > jerry
> >
> >
> >
> > steven james wrote:
> > >
> > > Greetings,
> > >
> > > Actually running root has been most informative. There appears to be a
> > > problem with the installation of root PRO. Looking at the dmesg output of
> > > node 3, root.exe is getting a segv when it tries to run. ldd shows that
> > > it's library requirements are unsatisfied (on the master as well!).
> > >
> > > When I add /usr/root/PRO/lib to /etc/ld/so.conf and run ldconfig, it tells
> > > me that several libraries in that directory are truncated. The real
> > > surprise is that it could run on node 0 at all. Possably the libs somehow
> > > got cached there and were later damaged.
> > >
> > > The best way to proceed would be to restore /usr/root/PRO/lib from backup
> > > or re-install the package, then bpsh -d allup /sbin/ldconfig -v >setuplog
> > >
> > > setuplog should show no problems then.
> > >
> > > I have set up the ld.so.conf on all nodes but 8 in advance to be ready for
> > > this operation.
> > >
> > > G'day,
> > > sjames
> > >
> > > On Mon, 18 Nov 2002, gilfoyle wrote:
> > >
> > > > hi steven,
> > > >
> > > >    i tried running things yesterday and got the following.
> > > >
> > > > 1. i tried running my perl scripts on slaves 10-11 (i.e. analyze two
> > > > runs) and root did not run. the other tasks in the perl script were
> > > > done correctly.
> > > >
> > > > 2. i tried running root with the bpsh command from pscm1. i executed the
> > > > command in the area /scratch/gilfoyle/e5/24023 which is the area on the
> > > > slave. what is the jargon for this? mirror/ghost directory? it did not
> > > > run correctly or produce any output. however, there is a core file in
> > > > the
> > > > /scratch/gilfoyle/e5/24023 area on slave 10.
> > > >
> > > > 3. i tried running my perl scripts on slaves 0-1 since they worked
> > > > before.
> > > > they worked!! root ran and produced output files with filled histograms
> > > > and all the good stuff.
> > > >
> > > > 4. i tried running root on pscm1 (to look at the results of step 3) and
> > > > it did not run! it flashes its little greeting (which is an X-window
> > > > function) and then crashes. the core file is in
> > > > /home/gilfoyle/eod/run/results/.
> > > >
> > > > if you want to run this yourself the commands are the following.
> > > >
> > > > 1. to run root:
> > > >
> > > > root<cr>
> > > >
> > > > if you want to do more than that, let me know and i can give you a
> > > > quick how-to for looking at data.
> > > >
> > > > 2. to run root on slave 10:
> > > >
> > > > bpsh 10 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
> > > >
> > > > the data files are already on the slave. usually i would delete them
> > > > after an analysis run, but i have them left them on the disk now for
> > > > testing.
> > > >
> > > > 3. to submit a job to the cluster.
> > > >
> > > > a. go to /home/gilfoyle/eod/run.
> > > > b. execute submit_eod3c.pl<cr>
> > > >
> > > > the scripts are submit_eod3c.pl and run_root_on_node3.pl. the main
> > > > input file is /home/gilfoyle/eod/run/E5_run_numbers.inp which
> > > > determines which runs to analyze. right now it only lists 2 runs so
> > > > only two runs will get analyzed when you run submit_eod3c.pl. the
> > > > script submit_eod3c.pl sets some parameters including which slaves
> > > > to run the analysis on. for example, see the parameter first_node
> > > > in submit_eod3c.p.
> > > >
> > > > let me know if there is more that will help. i'm starting to get a bit
> > > > desperate to get this thing working.
> > > >
> > > > jerry
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > steven james wrote:
> > > > >
> > > > > Greetings,
> > > > >
> > > > > I believe I have all of the library issues dealt with.
> > > > >
> > > > > I noticed a possably confusing behaviour that might have been the root of
> > > > > some of this.
> > > > >
> > > > > Perl depends on several libraries in /lib to run. Unlike those in
> > > > > /usr/lib, they were being managed by caching rather than just being
> > > > > available from NFS. It can take about a minute for the libs to be fetched
> > > > > from the master. During that time, the app will appear hung, but will
> > > > > eventually start.
> > > > >
> > > > > I have pre-cached the files onto the node's local drive to try to avoid
> > > > > that delay.
> > > > >
> > > > > Since the libs are cached, once that startup penelty is paid, it doesn't
> > > > > happen again for those libs on that node until reboot.
> > > > >
> > > > > You can see this happen using tcpdump (I have a binary of it in my home
> > > > > directory). The libs are transferred as a stream of multicast packets.
> > > > >
> > > > > Please let me know if this gets it going. If problems remain, a good
> > > > > approach might be for me to make a copy of your test data and try the runs
> > > > > myself until the expected results come up.
> > > > >
> > > > > G'day,
> > > > > sjames
> > > > >
> > > > > On Thu, 14 Nov 2002, gilfoyle wrote:
> > > > >
> > > > > > hi steven,
> > > > > >
> > > > > >    i'm checking in (when there is no beam) to find out the
> > > > > > status of the cluster. have the library issues been resolved?
> > > > > > if so, what was the solution? i'm itching to let this thing
> > > > > > get cooking.
> > > > > >
> > > > > > jerry
> > > > > >
> > > > > >
> > > > >
> > > > > --
> > > > > -------------------------steven james, director of research, linux labs
> > > > > ... ........ ..... ....                     230 peachtree st nw ste 701
> > > > > the original linux labs                             atlanta.ga.us 30303
> > > > >       -since 1995                              http://www.linuxlabs.com
> > > > >                                    office 404.577.7747 fax 404.577.7743
> > > > > -----------------------------------------------------------------------
> > > >
> > > >
> > >
> > > --
> > > -------------------------steven james, director of research, linux labs
> > > ... ........ ..... ....                     230 peachtree st nw ste 701
> > > the original linux labs                             atlanta.ga.us 30303
> > >       -since 1995                              http://www.linuxlabs.com
> > >                                    office 404.577.7747 fax 404.577.7743
> > > -----------------------------------------------------------------------
> >
> >
> 
> --
> -------------------------steven james, director of research, linux labs
> ... ........ ..... ....                     230 peachtree st nw ste 701
> the original linux labs                             atlanta.ga.us 30303
>       -since 1995                              http://www.linuxlabs.com
>                                    office 404.577.7747 fax 404.577.7743
> -----------------------------------------------------------------------

-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482