[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: fixing the Richmond cluster
Greetings,
I pulled up records, and I think there may be some confusion. You are not
required to pay $5000 for a Nimbus upgrade. As you are still within your
support time, the upgrade is free. The price given to Mike was for a new
install on a new cluster.
The procedure would be for someone on your side to make backups and
install RedHat 7.2 with specified options on the master, and make it
available on the net. I will transfer the RPMs for the Nimbus system over,
and perform the installation and configuration.
You still have the option of a re-compile of your current kernel with the
process limit increased if you prefer. IMHO, you will like the new
features of Nimbus (that would, of course, have nothing to do with the
fact that I built Nimbus :-)
G'day,
sjames
On Mon, 21 Oct 2002, gilfoyle wrote:
> hi steven,
>
> thanks for the response. we have a new linux support person and
> this seems like an appropriate task for him. i'm reasonably sure that
> we will need to consult with you about some things so i have to
> check our grant to see how much money we have. how you found out
> what you rate is these days?
>
> jerry
>
> steven james wrote:
> >
> > Greetings,
> >
> > I was the one who worked on the fileserver. I just had a look and it
> > appears to be running the correct kernel.
> >
> > WRT compiling the kernel:
> > The config I used was left in /usr/src/linux. You can change the limit,
> > make clean, and then make the new kernel with the same configs.
> > You may also need new ethernet module (bcm5700). It's source is in
> > /usr/src
> >
> > Finally, the bproc modules (bproc vmadump and ksyscall) are in
> > /usr/src/redhat/SOURCES/bproc-2.2-pyro1
> >
> > I can certainly do the kernel mod. Oddly enough, I don't know how much I
> > cost these days :-), I will inquire w/ Markus on that and any other
> > options available.
> >
> > G'day,
> > sjames
> >
> > On Fri, 18 Oct 2002, gilfoyle wrote:
> >
> > > Hi Steven,
> > >
> > > Thank you for the response to my questions. I have been at meetings
> > > all this week and just got back to doing real work. I would like to
> > > update you on the status of the Richmond cluster and raise some
> > > questions/proposals for dealing with the problem. First, the status.
> > >
> > > 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs
> > > (about 150) to the cluster and found that only about 1/3 of them ran
> > > and we had exhausted the number of available processes on the master.
> > > We had to reboot to get things going again.
> > >
> > > 2. I was able to run with a reduced number of jobs (about 40) at a
> > > time and things seemed to work. However, this means we are not making
> > > full use of the cluster (actually not even half use).
> > >
> > > 3. I have tried some other approaches since then including raising the
> > > limiting load factor from its compile-time choice of 0.8 (see the man
> > > page for the 'atd' command), but keep running into similar
> > > problems.
> > >
> > > Next, some questions.
> > >
> > > Last spring we had lots of problems with the fileserver getting hung
> > > during data transfer. This was later determined to be caused by cables
> > > that were too long. However, in the course of debugging this problem
> > > we (actually I think this was you Steve, but I'm not sure) tried a
> > > number of different kernels. Are we now sure that we are using the
> > > correct, optimized kernel? How can we check?
> > >
> > > Your proposals with questions.
> > >
> > > 1. We could recompile the kernel with the limit on the number of
> > > processes raised. I agree this would be the least disruptive option.
> > > We could ask our linux person here at Richmond to work on this.
> > > However, do we have (or can we get) some listing of all the modules
> > > and parameters that where used to build the current kernel. My
> > > experience has been that this information is important so we don't
> > > spend days trying to figure out what is in the current kernel. Steve,
> > > can we get help from you on this? How much would it cost?
> > >
> > > 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to
> > > Markus Geiger about this option and it would cost us about $5000
> > > ($50/cpu). I hesitate to do this since it seems like a lot of money to
> > > spend to upgrade what is essentially a brand new cluster. I am also
> > > unsure if this option will fix our problem. If we go down this path,
> > > how quickly could the upgrade be done?
> > >
> > > Mike and I would really like to see this problem resolved as soon as
> > > possible. I am considering asking for funds to add slave nodes to the
> > > cluster and I want to be sure that we can make full use of them.
> > >
> > > Let me know what you think.
> > >
> > > Jerry Gilfoyle
> > >
> > >
> > >
> >
> > --
> > -------------------------steven james, director of research, linux labs
> > ... ........ ..... .... 230 peachtree st nw ste 701
> > the original linux labs atlanta.ga.us 30303
> > -since 1995 http://www.linuxlabs.com
> > office 404.577.7747 fax 404.577.7743
> > -----------------------------------------------------------------------
>
>
--
-------------------------steven james, director of research, linux labs
... ........ ..... .... 230 peachtree st nw ste 701
the original linux labs atlanta.ga.us 30303
-since 1995 http://www.linuxlabs.com
office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------