[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
fixing the Richmond cluster
Hi Steven,
Thank you for the response to my questions. I have been at meetings
all this week and just got back to doing real work. I would like to
update you on the status of the Richmond cluster and raise some
questions/proposals for dealing with the problem. First, the status.
1. Last Tuesday (10/8/2) I submitted a large number of batch jobs
(about 150) to the cluster and found that only about 1/3 of them ran
and we had exhausted the number of available processes on the master.
We had to reboot to get things going again.
2. I was able to run with a reduced number of jobs (about 40) at a
time and things seemed to work. However, this means we are not making
full use of the cluster (actually not even half use).
3. I have tried some other approaches since then including raising the
limiting load factor from its compile-time choice of 0.8 (see the man
page for the 'atd' command), but keep running into similar
problems.
Next, some questions.
Last spring we had lots of problems with the fileserver getting hung
during data transfer. This was later determined to be caused by cables
that were too long. However, in the course of debugging this problem
we (actually I think this was you Steve, but I'm not sure) tried a
number of different kernels. Are we now sure that we are using the
correct, optimized kernel? How can we check?
Your proposals with questions.
1. We could recompile the kernel with the limit on the number of
processes raised. I agree this would be the least disruptive option.
We could ask our linux person here at Richmond to work on this.
However, do we have (or can we get) some listing of all the modules
and parameters that where used to build the current kernel. My
experience has been that this information is important so we don't
spend days trying to figure out what is in the current kernel. Steve,
can we get help from you on this? How much would it cost?
2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to
Markus Geiger about this option and it would cost us about $5000
($50/cpu). I hesitate to do this since it seems like a lot of money to
spend to upgrade what is essentially a brand new cluster. I am also
unsure if this option will fix our problem. If we go down this path,
how quickly could the upgrade be done?
Mike and I would really like to see this problem resolved as soon as
possible. I am considering asking for funds to add slave nodes to the
cluster and I want to be sure that we can make full use of them.
Let me know what you think.
Jerry Gilfoyle
--
Dr. Gerard P. Gilfoyle
Physics Department e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173 phone: 804-289-8255
USA fax: 804-289-8482