[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
latest on the Richmond cluster
hi steven,
here is the latest.
the good news is the cluster seems to be working at some level. some
of the behavior is significantly different from before the upgrade
so we may still have some issues to resolve.
when i last left you i had submitted a large number of jobs (about
115) to run and left for chicago for thanksgiving. things had looked
good for smaller numbers of jobs. when i returned only the first 55
jobs had been run. the remaining ones were sitting in the batch queue
(used the bbq and atq commands to see this). even more strange was
that the first 55 jobs that got submitted were still running root
after 3 days!! i killed those jobs by hand (kill -9), the perl scripts
finished up, and the jobs in the batch queue remained there and never
got started. my questions are the following.
1. is the apparent limit of 55 jobs fixed? can we raise it? it seems
reasonable to run two jobs per machine (one per cpu). the 'atd -l '
command looks like it should work (according to the man page).
2. after i killed the long-running root executables, i thought the
queued up jobs would get submitted, but they didn't. do you have any
idea why?
3. i noticed that root found no good events even when it ran
successfully with a smaller number of submitted jobs. this is
mysterious since this code and these scripts worked before the
upgrade. i will investigate this problem this week. if you have any
ideas, please let me know.
4. the last problem i'm having is that i submitted some jobs tonight
(sunday) and they immediately go into the 'b' queue and don't get
submitted. they are listed under the atq command, but do not appear
when i execute the bbq command. this i don't understand.
let me know what you think.
jerry
--
Dr. Gerard P. Gilfoyle
Physics Department e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173 phone: 804-289-8255
USA fax: 804-289-8482