[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

even more questions about the Richmond cluster



Hi Steven,

   Happy New Year and yet another question about the Richmond cluster. 
I have been experimenting with different ways of running the cluster 
and I have run into a problem with the batch system. I'm submitting 
jobs in two different ways; one uses the beomap command to allocate 
slave nodes and the other just picks the slave nodes `by hand'. I'm 
using this second method because the limiting factor now is the 
ability to transfer the data files to the slave nodes. I was thinking 
that I could transfer the data on the first pass and leave it there 
for later passes to speed things up. The problem now is that after 
many jobs are submitted (from 60-100 of so) the remaining jobs get 
sent to the `b' batch queue and never run. This has happened even when 
the /var/spool area is not full. My thoughts are the following.

1. Can we reset the average cpu load with the 'atd -l' command? I've 
tried this and it seems to have little effect.

2. Can we restart the jobs in the queue? Now they just sit there and 
never get started.

3. In some of the recent analysis runs, the /var/spool/mqueue area has 
filled up and hung things up. Before it was the /var/spool/at or 
/var/spool/mail areas. Do you have any idea what would cause that? 
Should we make a link for /var/spool/mqueue to one of RAID disks so 
there is plenty of space?

Let me know what you think?

Thanks-in-advance,

jerry

-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482