[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

fixing the Richmond cluster



Hi Steven,

   Thank you for the response to my questions. I have been at meetings 
all this week and just got back to doing real work. I would like to 
update you on the status of the Richmond cluster and raise some 
questions/proposals for dealing with the problem. First, the status.

1. Last Tuesday (10/8/2) I submitted a large number of batch jobs 
(about 150) to the cluster and found that only about 1/3 of them ran 
and we had exhausted the number of available processes on the master. 
We had to reboot to get things going again.

2. I was able to run with a reduced number of jobs  (about 40) at a 
time and things seemed to work. However, this means we are not making 
full use of the cluster (actually not even half use).

3. I have tried some other approaches since then including raising the 
limiting load factor from its compile-time choice of 0.8 (see the man 
page for the 'atd' command), but keep running into similar
problems.

Next, some questions. 

Last spring we had lots of problems with the fileserver getting hung 
during data transfer. This was later determined to be caused by cables 
that were too long. However, in the course of debugging this problem 
we (actually I think this was you Steve, but I'm not sure) tried a 
number of different kernels. Are we now sure that we are using the 
correct, optimized kernel? How can we check?

Your  proposals with questions.

1. We could recompile the kernel with the limit on the number of 
processes raised. I agree this would be the least disruptive option. 
We could ask our linux person here at Richmond to work on this. 
However, do we have (or can we get) some listing of all the modules 
and parameters that where used to build the current kernel. My 
experience has been that this information is important so we don't 
spend days trying to figure out what is in the current kernel. Steve, 
can we get help from you on this? How much would it cost?

2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to 
Markus Geiger about this option and it would cost us about $5000 
($50/cpu). I hesitate to do this since it seems like a lot of money to 
spend to upgrade what is essentially a brand new cluster. I am also 
unsure if this option will fix our problem. If we go down this path, 
how quickly could the upgrade be done?

Mike and I would really like to see this problem resolved as soon as 
possible. I am considering asking for funds to add slave nodes to the 
cluster and I want to be sure that we can make full use of them.

Let me know what you think.

Jerry Gilfoyle


-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482