[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: fixing the Richmond cluster



Greetings,

I was the one who worked on the fileserver. I just had a look and it
appears to be running the correct kernel.

WRT compiling the kernel:
The config I used was left in /usr/src/linux. You can change the limit,
make clean, and then make the new kernel with the same configs.
You may also need new ethernet module (bcm5700). It's source is in
/usr/src

Finally, the bproc modules (bproc vmadump and ksyscall) are in
/usr/src/redhat/SOURCES/bproc-2.2-pyro1

I can certainly do the kernel mod. Oddly enough, I don't know how much I
cost these days :-), I will inquire w/ Markus on that and any other
options available.

G'day,
sjames


On Fri, 18 Oct 2002, gilfoyle wrote:

> Hi Steven,
> 
>    Thank you for the response to my questions. I have been at meetings 
> all this week and just got back to doing real work. I would like to 
> update you on the status of the Richmond cluster and raise some 
> questions/proposals for dealing with the problem. First, the status.
> 
> 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs 
> (about 150) to the cluster and found that only about 1/3 of them ran 
> and we had exhausted the number of available processes on the master. 
> We had to reboot to get things going again.
> 
> 2. I was able to run with a reduced number of jobs  (about 40) at a 
> time and things seemed to work. However, this means we are not making 
> full use of the cluster (actually not even half use).
> 
> 3. I have tried some other approaches since then including raising the 
> limiting load factor from its compile-time choice of 0.8 (see the man 
> page for the 'atd' command), but keep running into similar
> problems.
> 
> Next, some questions. 
> 
> Last spring we had lots of problems with the fileserver getting hung 
> during data transfer. This was later determined to be caused by cables 
> that were too long. However, in the course of debugging this problem 
> we (actually I think this was you Steve, but I'm not sure) tried a 
> number of different kernels. Are we now sure that we are using the 
> correct, optimized kernel? How can we check?
> 
> Your  proposals with questions.
> 
> 1. We could recompile the kernel with the limit on the number of 
> processes raised. I agree this would be the least disruptive option. 
> We could ask our linux person here at Richmond to work on this. 
> However, do we have (or can we get) some listing of all the modules 
> and parameters that where used to build the current kernel. My 
> experience has been that this information is important so we don't 
> spend days trying to figure out what is in the current kernel. Steve, 
> can we get help from you on this? How much would it cost?
> 
> 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to 
> Markus Geiger about this option and it would cost us about $5000 
> ($50/cpu). I hesitate to do this since it seems like a lot of money to 
> spend to upgrade what is essentially a brand new cluster. I am also 
> unsure if this option will fix our problem. If we go down this path, 
> how quickly could the upgrade be done?
> 
> Mike and I would really like to see this problem resolved as soon as 
> possible. I am considering asking for funds to add slave nodes to the 
> cluster and I want to be sure that we can make full use of them.
> 
> Let me know what you think.
> 
> Jerry Gilfoyle
> 
> 
> 

-- 
-------------------------steven james, director of research, linux labs
... ........ ..... ....                     230 peachtree st nw ste 701
the original linux labs                             atlanta.ga.us 30303
      -since 1995                              http://www.linuxlabs.com
                                   office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------