[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: fixing the Richmond cluster
Greetings,
I was the one who worked on the fileserver. I just had a look and it
appears to be running the correct kernel.
WRT compiling the kernel:
The config I used was left in /usr/src/linux. You can change the limit,
make clean, and then make the new kernel with the same configs.
You may also need new ethernet module (bcm5700). It's source is in
/usr/src
Finally, the bproc modules (bproc vmadump and ksyscall) are in
/usr/src/redhat/SOURCES/bproc-2.2-pyro1
I can certainly do the kernel mod. Oddly enough, I don't know how much I
cost these days :-), I will inquire w/ Markus on that and any other
options available.
G'day,
sjames
On Fri, 18 Oct 2002, gilfoyle wrote:
> Hi Steven,
>
> Thank you for the response to my questions. I have been at meetings
> all this week and just got back to doing real work. I would like to
> update you on the status of the Richmond cluster and raise some
> questions/proposals for dealing with the problem. First, the status.
>
> 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs
> (about 150) to the cluster and found that only about 1/3 of them ran
> and we had exhausted the number of available processes on the master.
> We had to reboot to get things going again.
>
> 2. I was able to run with a reduced number of jobs (about 40) at a
> time and things seemed to work. However, this means we are not making
> full use of the cluster (actually not even half use).
>
> 3. I have tried some other approaches since then including raising the
> limiting load factor from its compile-time choice of 0.8 (see the man
> page for the 'atd' command), but keep running into similar
> problems.
>
> Next, some questions.
>
> Last spring we had lots of problems with the fileserver getting hung
> during data transfer. This was later determined to be caused by cables
> that were too long. However, in the course of debugging this problem
> we (actually I think this was you Steve, but I'm not sure) tried a
> number of different kernels. Are we now sure that we are using the
> correct, optimized kernel? How can we check?
>
> Your proposals with questions.
>
> 1. We could recompile the kernel with the limit on the number of
> processes raised. I agree this would be the least disruptive option.
> We could ask our linux person here at Richmond to work on this.
> However, do we have (or can we get) some listing of all the modules
> and parameters that where used to build the current kernel. My
> experience has been that this information is important so we don't
> spend days trying to figure out what is in the current kernel. Steve,
> can we get help from you on this? How much would it cost?
>
> 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to
> Markus Geiger about this option and it would cost us about $5000
> ($50/cpu). I hesitate to do this since it seems like a lot of money to
> spend to upgrade what is essentially a brand new cluster. I am also
> unsure if this option will fix our problem. If we go down this path,
> how quickly could the upgrade be done?
>
> Mike and I would really like to see this problem resolved as soon as
> possible. I am considering asking for funds to add slave nodes to the
> cluster and I want to be sure that we can make full use of them.
>
> Let me know what you think.
>
> Jerry Gilfoyle
>
>
>
--
-------------------------steven james, director of research, linux labs
... ........ ..... .... 230 peachtree st nw ste 701
the original linux labs atlanta.ga.us 30303
-since 1995 http://www.linuxlabs.com
office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------