[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: the Richmond saga continues
Greetings,
It looks like I will need to make a few additional adjustments. To be
fully effective, I need the whole cluster booted up if possible. That way,
I can look for any oddities or excettions and make it just work.
G'day,
sjames
On Mon, 11 Nov 2002, gilfoyle wrote:
> Hi Steven,
>
> The saga continues. After you made your changes last Friday I was
> able to run root on the slaves 0-2. I could execute it from the master
> using the following command.
>
> bpsh 0 root -b -q /scratch/gilfoyle/e5/24023/run_eod3.C
>
> I was also able to run my scripts for just those two nodes. On Sunday,
> I rebooted the remaining nodes (3-48), removed the /home area and put
> in a link home->/usr/home. I then started to run ten jobs which would
> run on nodes 0-9. The master hung: wouldn't budge. I rebooted the
> master and brought up slaves 0-5 and tried again and got the same
> results. After rebooting the master and slaves 0-5 this is what I have
> noticed.
>
> 1. I ran my scripts without running root and they appeared to work!
>
> 2. There are two sub-directories on slave 0, /include and /cint that
> are not visible on any of the other slaves. These two subdirectories
> are needed by root. This would seem to be a smoking gun for the
> problem except for one thing. Slave 1 seemed to run root successfully
> even though those areas are not visible to it.
>
> 3. I can run root on slaves 3-5 from the master using the bpsh
> command. The master only gets hung when I am running my script. I am
> using perl for these scripts and I have attached them to this message.
> Perhaps there is some library that perl needs??
>
> 4. The problem seems to be with the nodes that I rebooted on Sunday
> and not the ones you worked on last Friday. Did I reboot them
> incorrectly? I checked some of the permissions of directories on the
> slaves and they all appear to be the same.
>
> I have rebooted the master and nodes 0-5. I am at JLab this week so I
> can only work on this sporadically, but I will try to get as much
> done as I can.
>
> Let me know what you think.
>
> Jerry
>
> p.s. description of perl scripts:
>
> submit_eod3c.pl - main script, does some housekeeping and generates the
> input file for the batch command.
>
> run_root_on_node2.pl - copies files over to the slave, runs root, and
> cleans up.
>
>
>
>
>
>
--
-------------------------steven james, director of research, linux labs
... ........ ..... .... 230 peachtree st nw ste 701
the original linux labs atlanta.ga.us 30303
-since 1995 http://www.linuxlabs.com
office 404.577.7747 fax 404.577.7743
-----------------------------------------------------------------------