[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Fwd: fixing the Richmond cluster]



hi sasko,

   i am forwarding the message i received from linuxlabs with
some guidance for recompiling the kernel on the cluster. it
lists where the configuration files are located which is 
important for us if we want to do this successfully. i've 
already done a couple of small things.

1. created backups on the disk of the different, existing 
versions of the kernel in the /boot/ area. look for files with
'-gpg' on the end.

2. modified /etc/lilo.conf to add a boot option 'URlinux' that
points to one of the backups i made above. the full file is
below.

i am glad to have you work on this, but i would like to be
around when you do it so i can learn more about it. would
Thursday morning be a good time?

i had the following thoughts on the plan for recompiling the
kernel. 

1. we should make a boot floppy in case disaster strikes.

2. we should recompile and test the kernel with NO changes
just to make sure the configuration files and such are accurate.

3. we should increase the parameter NR_TASKS to something like
3000 and MAX_TASKS_PER_USER to 1000 in
/usr/src/linux/include/tasks.h according to the reference below.

http://www.ltsp.org/documentation/lts_ig_v2.4/lts_ig_v2.4-14.html

http://www.geocrawler.com/archives/3/61/1998/10/0/2207294/

opps! i found tasks.h in /usr/src/linux/include/linux instead of
area described in the documentation. the file tasks.h is shown
below.

let me know what you think.

jerry


tasks.h ----------------------------------------

#ifndef _LINUX_TASKS_H
#define _LINUX_TASKS_H

/*
 * This is the maximum nr of tasks - change it if you need to
 */
 
#ifdef __SMP__
#define NR_CPUS 32              /* Max processors that can be running in
SMP */
#else
#define NR_CPUS 1
#endif

#define NR_TASKS        512     /* On x86 Max about 4000 */  <-- change
to 3000.

#define MAX_TASKS_PER_USER (NR_TASKS/2) <---- change to 1000.
#define MIN_TASKS_LEFT_FOR_ROOT 4


/*
 * This controls the maximum pid allocated to a process
 */
#define PID_MAX 0x8000

#endif


lilo.conf ----------------------------------------

boot=/dev/hda
map=/boot/map
install=/boot/boot.b
prompt
timeout=50
linear
default=linux

image=/boot/bzImage-2.2.20p7-pyro1-scyld-dolphin
        label=linux
        read-only
        root=/dev/hda1
        append="hdd=ide-scsi"

image=/boot/bzImage-2.2.17-lila.beosmp
        label=2-2-17
        append="mem=1024M"
        read-only
        root=/dev/hda1

image=/boot/bzImage-2.2.20p7-pyro1-scyld-dolphin-02-oct-21-gpg  /*
modified from here
        label=URlinux
        read-only
        root=/dev/hda1
        append="hdd=ide-scsi"


-- 
Dr. Gerard P. Gilfoyle
Physics Department                e-mail: ggilfoyl@richmond.edu
University of Richmond, VA 23173  phone:  804-289-8255
USA                               fax:    804-289-8482
--- Begin Message ---
Greetings,

I was the one who worked on the fileserver. I just had a look and it
appears to be running the correct kernel.

WRT compiling the kernel:
The config I used was left in /usr/src/linux. You can change the limit,
make clean, and then make the new kernel with the same configs.
You may also need new ethernet module (bcm5700). It's source is in
/usr/src

Finally, the bproc modules (bproc vmadump and ksyscall) are in
/usr/src/redhat/SOURCES/bproc-2.2-pyro1

I can certainly do the kernel mod. Oddly enough, I don't know how much I
cost these days :-), I will inquire w/ Markus on that and any other
options available.

G'day,
sjames


On Fri, 18 Oct 2002, gilfoyle wrote:

> Hi Steven,
> 
>    Thank you for the response to my questions. I have been at meetings 
> all this week and just got back to doing real work. I would like to 
> update you on the status of the Richmond cluster and raise some 
> questions/proposals for dealing with the problem. First, the status.
> 
> 1. Last Tuesday (10/8/2) I submitted a large number of batch jobs 
> (about 150) to the cluster and found that only about 1/3 of them ran 
> and we had exhausted the number of available processes on the master. 
> We had to reboot to get things going again.
> 
> 2. I was able to run with a reduced number of jobs  (about 40) at a 
> time and things seemed to work. However, this means we are not making 
> full use of the cluster (actually not even half use).
> 
> 3. I have tried some other approaches since then including raising the 
> limiting load factor from its compile-time choice of 0.8 (see the man 
> page for the 'atd' command), but keep running into similar
> problems.
> 
> Next, some questions. 
> 
> Last spring we had lots of problems with the fileserver getting hung 
> during data transfer. This was later determined to be caused by cables 
> that were too long. However, in the course of debugging this problem 
> we (actually I think this was you Steve, but I'm not sure) tried a 
> number of different kernels. Are we now sure that we are using the 
> correct, optimized kernel? How can we check?
> 
> Your  proposals with questions.
> 
> 1. We could recompile the kernel with the limit on the number of 
> processes raised. I agree this would be the least disruptive option. 
> We could ask our linux person here at Richmond to work on this. 
> However, do we have (or can we get) some listing of all the modules 
> and parameters that where used to build the current kernel. My 
> experience has been that this information is important so we don't 
> spend days trying to figure out what is in the current kernel. Steve, 
> can we get help from you on this? How much would it cost?
> 
> 2. Upgrade to your new Nimbus distribution. Mike Vineyard talked to 
> Markus Geiger about this option and it would cost us about $5000 
> ($50/cpu). I hesitate to do this since it seems like a lot of money to 
> spend to upgrade what is essentially a brand new cluster. I am also 
> unsure if this option will fix our problem. If we go down this path, 
> how quickly could the upgrade be done?
> 
> Mike and I would really like to see this problem resolved as soon as 
> possible. I am considering asking for funds to add slave nodes to the 
> cluster and I want to be sure that we can make full use of them.
> 
> Let me know what you think.
> 
> Jerry Gilfoyle
> 
> 
> 

-- 
-------------------------steven james, director of research, linux labs
... ........ ..... ....                     230 peachtree st nw ste 701
the original linux labs                             atlanta.ga.us 30303
      -since 1995                              http://www.linuxlabs.com
                                   office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------


--- End Message ---