[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fixing the Richmond cluster



hi steven,

   thanks for the response. we have a new linux support person and
this seems like an appropriate task for him. i'm reasonably sure that
we will need to consult with you about some things so i have to
check our grant to see how much money we have. how you found out
what you rate is these days?

jerry

steven james wrote:
> 
> Greetings,
> 
> I was the one who worked on the fileserver. I just had a look and it
> appears to be running the correct kernel.
> 
> WRT compiling the kernel:
> The config I used was left in /usr/src/linux. You can change the limit,
> make clean, and then make the new kernel with the same configs.
> You may also need new ethernet module (bcm5700). It's source is in
> /usr/src
> 
> Finally, the bproc modules (bproc vmadump and ksyscall) are in
> /usr/src/redhat/SOURCES/bproc-2.2-pyro1
> 
> I can certainly do the kernel mod. Oddly enough, I don't know how much I
> cost these days :-), I will inquire w/ Markus on that and any other
> options available.
> 
> G'day,
> sjames
> 
> On Fri, 18 Oct 2002, gilfoyle wrote:
> 
> > Hi Steven,
> >
> >    Thank you for the response to my questions. I have been at meetings
> > all this week and just got back to doing real work. I would like to
> > update you on the status of the Richmond cluster and raise some
> > questions/proposals for dealing with the problem. First, the status.
> >
> > 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs
> > (about 150) to the cluster and found that only about 1/3 of them ran
> > and we had exhausted the number of available processes on the master.
> > We had to reboot to get things going again.
> >
> > 2. I was able to run with a reduced number of jobs  (about 40) at a
> > time and things seemed to work. However, this means we are not making
> > full use of the cluster (actually not even half use).
> >
> > 3. I have tried some other approaches since then including raising the
> > limiting load factor from its compile-time choice of 0.8 (see the man
> > page for the 'atd' command), but keep running into similar
> > problems.
> >
> > Next, some questions.
> >
> > Last spring we had lots of problems with the fileserver getting hung
> > during data transfer. This was later determined to be caused by cables
> > that were too long. However, in the course of debugging this problem
> > we (actually I think this was you Steve, but I'm not sure) tried a
> > number of different kernels. Are we now sure that we are using the
> > correct, optimized kernel? How can we check?
> >
> > Your  proposals with questions.
> >
> > 1. We could recompile the kernel with the limit on the number of
> > processes raised. I agree this would be the least disruptive option.
> > We could ask our linux person here at Richmond to work on this.
> > However, do we have (or can we get) some listing of all the modules
> > and parameters that where used to build the current kernel. My
> > experience has been that this information is important so we don't
> > spend days trying to figure out what is in the current kernel. Steve,
> > can we get help from you on this? How much would it cost?
> >
> > 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to
> > Markus Geiger about this option and it would cost us about $5000
> > ($50/cpu). I hesitate to do this since it seems like a lot of money to
> > spend to upgrade what is essentially a brand new cluster. I am also
> > unsure if this option will fix our problem. If we go down this path,
> > how quickly could the upgrade be done?
> >
> > Mike and I would really like to see this problem resolved as soon as
> > possible. I am considering asking for funds to add slave nodes to the
> > cluster and I want to be sure that we can make full use of them.
> >
> > Let me know what you think.
> >
> > Jerry Gilfoyle
> >
> >
> >
> 
> --
> -------------------------steven james, director of research, linux labs
> ... ........ ..... ....                     230 peachtree st nw ste 701
> the original linux labs                             atlanta.ga.us 30303
>       -since 1995                              http://www.linuxlabs.com
>                                    office 404.577.7747 fax 404.577.7743
> -----------------------------------------------------------------------

-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482