Hi Steven, I am one of the physicists from the University of Richmond (along with Mike Vineyard) that is using the cluster you delivered to us earlier this year. We have recently run into a problem which is limiting our ability to make full use of the cluster. The problem is the following. Until a couple of days ago we have never tried to run multiple jobs on each slave node in the entire cluster. On Tuesday, for the first time, I submitted 148 jobs evenly distributed among the 48 slave nodes. After about 4-5 hours no more jobs were running, but I noticed that only about 1/3 of the submitted jobs produced any usable output. Today, I was running another large set of jobs and found I could no longer run any new processes even from the command line of a shell. For example, I would type in 'ls' and get back 'no more processes'. It appears there is an upper limit on the number of processes that can be run on the master. Once you exceed that limit it looks like any new attempts to start a process are essentially ignored. In submitting the full set of 148 jobs many were not run because they would have exceeded this upper limit on the allowed number of processes. Right now I can run no more than about 40 jobs on the cluster without encountering this problem. This is fewer than one job per slave node. Each job I submit starts three separate processes so I am starting 120 processes. In searching the web, there are discussions of this limitation and a solution (which involves building a new kernel). The urls are below. I have also attached the scripts I am using to do the data analysis (one shell script and one perl script). Any help you can provide would be greatly appreciated. Thanks-in-advance, Jerry Gilfoyle http://www.ltsp.org/documentation/lts_ig_v2.4/lts_ig_v2.4-14.html http://www.geocrawler.com/archives/3/61/1998/10/0/2207294/ -- Dr. Gerard P. Gilfoyle Physics Department e-mail: ggilfoyl@richmond.edu University of Richmond, VA 23173 phone: 804-289-8255 USA fax: 804-289-8482
Attachment:
submit_eod3b.pl
Description: Binary data
File attachment: run_root_on_node2.sh The file attached to this email was removed because files of this type are not accepted for delivery by your email gateway.