[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fixing the Richmond cluster



hi steven,

   that's good news. i think we would like to go with the nimbus upgrade
while we're still within our support time. when would you be able to
do this? i want to coordinate our activities here to minimize the down
time of the cluster. let me know what you think.

cheers,

jerry

steven james wrote:
> 
> Greetings,
> 
> I pulled up records, and I think there may be some confusion. You are not
> required to pay $5000 for a Nimbus upgrade. As you are still within your
> support time, the upgrade is free. The price given to Mike was for a new
> install on a new cluster.
> 
> The procedure would be for someone on your side to make backups and
> install RedHat 7.2 with specified options on the master, and make it
> available on the net. I will transfer the RPMs for the Nimbus system over,
> and perform the installation and configuration.
> 
> You still have the option of a re-compile of your current kernel with the
> process limit increased if you prefer. IMHO, you will like the new
> features of Nimbus (that would, of course, have nothing to do with the
> fact that I built Nimbus :-)
> 
> G'day,
> sjames
> 
> On Mon, 21 Oct 2002, gilfoyle wrote:
> 
> > hi steven,
> >
> >    thanks for the response. we have a new linux support person and
> > this seems like an appropriate task for him. i'm reasonably sure that
> > we will need to consult with you about some things so i have to
> > check our grant to see how much money we have. how you found out
> > what you rate is these days?
> >
> > jerry
> >
> > steven james wrote:
> > >
> > > Greetings,
> > >
> > > I was the one who worked on the fileserver. I just had a look and it
> > > appears to be running the correct kernel.
> > >
> > > WRT compiling the kernel:
> > > The config I used was left in /usr/src/linux. You can change the limit,
> > > make clean, and then make the new kernel with the same configs.
> > > You may also need new ethernet module (bcm5700). It's source is in
> > > /usr/src
> > >
> > > Finally, the bproc modules (bproc vmadump and ksyscall) are in
> > > /usr/src/redhat/SOURCES/bproc-2.2-pyro1
> > >
> > > I can certainly do the kernel mod. Oddly enough, I don't know how much I
> > > cost these days :-), I will inquire w/ Markus on that and any other
> > > options available.
> > >
> > > G'day,
> > > sjames
> > >
> > > On Fri, 18 Oct 2002, gilfoyle wrote:
> > >
> > > > Hi Steven,
> > > >
> > > >    Thank you for the response to my questions. I have been at meetings
> > > > all this week and just got back to doing real work. I would like to
> > > > update you on the status of the Richmond cluster and raise some
> > > > questions/proposals for dealing with the problem. First, the status.
> > > >
> > > > 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs
> > > > (about 150) to the cluster and found that only about 1/3 of them ran
> > > > and we had exhausted the number of available processes on the master.
> > > > We had to reboot to get things going again.
> > > >
> > > > 2. I was able to run with a reduced number of jobs  (about 40) at a
> > > > time and things seemed to work. However, this means we are not making
> > > > full use of the cluster (actually not even half use).
> > > >
> > > > 3. I have tried some other approaches since then including raising the
> > > > limiting load factor from its compile-time choice of 0.8 (see the man
> > > > page for the 'atd' command), but keep running into similar
> > > > problems.
> > > >
> > > > Next, some questions.
> > > >
> > > > Last spring we had lots of problems with the fileserver getting hung
> > > > during data transfer. This was later determined to be caused by cables
> > > > that were too long. However, in the course of debugging this problem
> > > > we (actually I think this was you Steve, but I'm not sure) tried a
> > > > number of different kernels. Are we now sure that we are using the
> > > > correct, optimized kernel? How can we check?
> > > >
> > > > Your  proposals with questions.
> > > >
> > > > 1. We could recompile the kernel with the limit on the number of
> > > > processes raised. I agree this would be the least disruptive option.
> > > > We could ask our linux person here at Richmond to work on this.
> > > > However, do we have (or can we get) some listing of all the modules
> > > > and parameters that where used to build the current kernel. My
> > > > experience has been that this information is important so we don't
> > > > spend days trying to figure out what is in the current kernel. Steve,
> > > > can we get help from you on this? How much would it cost?
> > > >
> > > > 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to
> > > > Markus Geiger about this option and it would cost us about $5000
> > > > ($50/cpu). I hesitate to do this since it seems like a lot of money to
> > > > spend to upgrade what is essentially a brand new cluster. I am also
> > > > unsure if this option will fix our problem. If we go down this path,
> > > > how quickly could the upgrade be done?
> > > >
> > > > Mike and I would really like to see this problem resolved as soon as
> > > > possible. I am considering asking for funds to add slave nodes to the
> > > > cluster and I want to be sure that we can make full use of them.
> > > >
> > > > Let me know what you think.
> > > >
> > > > Jerry Gilfoyle
> > > >
> > > >
> > > >
> > >
> > > --
> > > -------------------------steven james, director of research, linux labs
> > > ... ........ ..... ....                     230 peachtree st nw ste 701
> > > the original linux labs                             atlanta.ga.us 30303
> > >       -since 1995                              http://www.linuxlabs.com
> > >                                    office 404.577.7747 fax 404.577.7743
> > > -----------------------------------------------------------------------
> >
> >
> 
> --
> -------------------------steven james, director of research, linux labs
> ... ........ ..... ....                     230 peachtree st nw ste 701
> the original linux labs                             atlanta.ga.us 30303
>       -since 1995                              http://www.linuxlabs.com
>                                    office 404.577.7747 fax 404.577.7743
> -----------------------------------------------------------------------

-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482