[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: status of the Richmond cluster



Greetings,

Actually running root has been most informative. There appears to be a
problem with the installation of root PRO. Looking at the dmesg output of
node 3, root.exe is getting a segv when it tries to run. ldd shows that
it's library requirements are unsatisfied (on the master as well!).

When I add /usr/root/PRO/lib to /etc/ld/so.conf and run ldconfig, it tells
me that several libraries in that directory are truncated. The real
surprise is that it could run on node 0 at all. Possably the libs somehow
got cached there and were later damaged.

The best way to proceed would be to restore /usr/root/PRO/lib from backup
or re-install the package, then bpsh -d allup /sbin/ldconfig -v >setuplog

setuplog should show no problems then. 

I have set up the ld.so.conf on all nodes but 8 in advance to be ready for
this operation.

G'day,
sjames



On Mon, 18 Nov 2002, gilfoyle wrote:

> hi steven,
> 
>    i tried running things yesterday and got the following.
> 
> 1. i tried running my perl scripts on slaves 10-11 (i.e. analyze two
> runs) and root did not run. the other tasks in the perl script were
> done correctly.
> 
> 2. i tried running root with the bpsh command from pscm1. i executed the
> command in the area /scratch/gilfoyle/e5/24023 which is the area on the
> slave. what is the jargon for this? mirror/ghost directory? it did not
> run correctly or produce any output. however, there is a core file in
> the 
> /scratch/gilfoyle/e5/24023 area on slave 10.
> 
> 3. i tried running my perl scripts on slaves 0-1 since they worked
> before.
> they worked!! root ran and produced output files with filled histograms
> and all the good stuff.
> 
> 4. i tried running root on pscm1 (to look at the results of step 3) and 
> it did not run! it flashes its little greeting (which is an X-window
> function) and then crashes. the core file is in 
> /home/gilfoyle/eod/run/results/.
> 
> if you want to run this yourself the commands are the following.
> 
> 1. to run root: 
> 
> root<cr>
> 
> if you want to do more than that, let me know and i can give you a
> quick how-to for looking at data.
> 
> 2. to run root on slave 10: 
> 
> bpsh 10 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
> 
> the data files are already on the slave. usually i would delete them
> after an analysis run, but i have them left them on the disk now for
> testing.
> 
> 3. to submit a job to the cluster.
> 
> a. go to /home/gilfoyle/eod/run.
> b. execute submit_eod3c.pl<cr>
> 
> the scripts are submit_eod3c.pl and run_root_on_node3.pl. the main 
> input file is /home/gilfoyle/eod/run/E5_run_numbers.inp which 
> determines which runs to analyze. right now it only lists 2 runs so
> only two runs will get analyzed when you run submit_eod3c.pl. the
> script submit_eod3c.pl sets some parameters including which slaves
> to run the analysis on. for example, see the parameter first_node
> in submit_eod3c.p.
> 
> let me know if there is more that will help. i'm starting to get a bit
> desperate to get this thing working.
> 
> jerry
> 
> 
> 
> 
> 
> steven james wrote:
> > 
> > Greetings,
> > 
> > I believe I have all of the library issues dealt with.
> > 
> > I noticed a possably confusing behaviour that might have been the root of
> > some of this.
> > 
> > Perl depends on several libraries in /lib to run. Unlike those in
> > /usr/lib, they were being managed by caching rather than just being
> > available from NFS. It can take about a minute for the libs to be fetched
> > from the master. During that time, the app will appear hung, but will
> > eventually start.
> > 
> > I have pre-cached the files onto the node's local drive to try to avoid
> > that delay.
> > 
> > Since the libs are cached, once that startup penelty is paid, it doesn't
> > happen again for those libs on that node until reboot.
> > 
> > You can see this happen using tcpdump (I have a binary of it in my home
> > directory). The libs are transferred as a stream of multicast packets.
> > 
> > Please let me know if this gets it going. If problems remain, a good
> > approach might be for me to make a copy of your test data and try the runs
> > myself until the expected results come up.
> > 
> > G'day,
> > sjames
> > 
> > On Thu, 14 Nov 2002, gilfoyle wrote:
> > 
> > > hi steven,
> > >
> > >    i'm checking in (when there is no beam) to find out the
> > > status of the cluster. have the library issues been resolved?
> > > if so, what was the solution? i'm itching to let this thing
> > > get cooking.
> > >
> > > jerry
> > >
> > >
> > 
> > --
> > -------------------------steven james, director of research, linux labs
> > ... ........ ..... ....                     230 peachtree st nw ste 701
> > the original linux labs                             atlanta.ga.us 30303
> >       -since 1995                              http://www.linuxlabs.com
> >                                    office 404.577.7747 fax 404.577.7743
> > -----------------------------------------------------------------------
> 
> 

-- 
-------------------------steven james, director of research, linux labs
... ........ ..... ....                     230 peachtree st nw ste 701
the original linux labs                             atlanta.ga.us 30303
      -since 1995                              http://www.linuxlabs.com
                                   office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------