[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
even more questions about the Richmond cluster
Hi Steven,
Happy New Year and yet another question about the Richmond cluster.
I have been experimenting with different ways of running the cluster
and I have run into a problem with the batch system. I'm submitting
jobs in two different ways; one uses the beomap command to allocate
slave nodes and the other just picks the slave nodes `by hand'. I'm
using this second method because the limiting factor now is the
ability to transfer the data files to the slave nodes. I was thinking
that I could transfer the data on the first pass and leave it there
for later passes to speed things up. The problem now is that after
many jobs are submitted (from 60-100 of so) the remaining jobs get
sent to the `b' batch queue and never run. This has happened even when
the /var/spool area is not full. My thoughts are the following.
1. Can we reset the average cpu load with the 'atd -l' command? I've
tried this and it seems to have little effect.
2. Can we restart the jobs in the queue? Now they just sit there and
never get started.
3. In some of the recent analysis runs, the /var/spool/mqueue area has
filled up and hung things up. Before it was the /var/spool/at or
/var/spool/mail areas. Do you have any idea what would cause that?
Should we make a link for /var/spool/mqueue to one of RAID disks so
there is plenty of space?
Let me know what you think?
Thanks-in-advance,
jerry
--
Dr. Gerard P. Gilfoyle
Physics Department e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173 phone: 804-289-8255
USA fax: 804-289-8482