[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openmosix-devel
Subject:    Re: [Openmosix-devel] Problem with process migration?
From:       Paul Millar <paulm () astro ! gla ! ac ! uk>
Date:       2003-02-03 15:09:33
[Download RAW message or body]

Hi Moshe,

Sorry for the delay in replying.  After installing some more memory on the
nodes, I started to get some weird errors, random kernel oops, ...  Turns
out some of the memory on one of the nodes was bad (using distcc
for kernel compilation, ouch!) 

So I've run all of memtest86 tests on all nodes and went back and verify
the previous results, which took a bit of time ...

On Wed, 22 Jan 2003, Moshe Bar wrote:
> Do you get interrupt overrun messages in your log files? You might have 
> lost some interrupts and therefor the protocol gets all confused by 
> your ifconfig wouldn't show errors just because it doesn't know about 
> missed interrupts.

I don't see any mention of them in the syslog or dmesg.  They could be
occurring and just not reported, but that seems unlikely.  I've also tried
2.4.20-2, but that has the same problems.

I've started to narrow down the problem.  Its occurring in the
deputy_main_loop() (in hpc/deputy.c line 215) because comm_recv() is
failing:

                p->mosix.dflags |= DSYNC;
                if(delay_sigs)
                        evaluate_pending_signals_in_mosix_context();
                if((type = comm_recv(&head, &hlen)) < 0)
                        deputy_die_on_communication();
                if(type & ANYTIME)
                {
                        if(deputy_handle_interim_request(type, head, hlen))
                                deputy_die_on_communication();
                }

I haven't found out why comm_recv() is failing, that's next on todo list.

Any ideas appreciated :)

Cheers,

Paul.


> On Wednesday, Jan 22, 2003, at 09:57 US/Eastern, Paul Millar wrote:
> 
> > On Tue, 14 Jan 2003, Mirko Caserta wrote:
> >> Try compiling with CONFIG_MOSIX_PIPE_EXCEPTIONS set. It should help.
> >>
> >> Also try a newest kernel (2.4.20) and patch against that, then let us
> >> know.
> >
> > Ok, I've tried 2.4.19-7 and 2.4.20-1 (both with and without
> > CONFIG_MOSIX_PIPE_EXCEPTIONS set).  All combinations have the same
> > problem: OM kills off processes with messages like
> >> Process 24613(make), uid=501, killed because it lost communication
> >> with the remote site where it was running
> >
> >> From watching this happening, subjectively there's a complete loss of
> > activity; although the kernel seems to be functioning fine.  Then, 
> > after a
> > short delay (a few minutes) the kernel kills off the process.
> >
> > Its as if the network has dropped a packet.  Yet after this happens,
> > ifconfig doesn't report any lost packets on any of the nodes (OM uses 
> > TCP
> > though so this shouldn't matter, right?).  So I suspect the problem 
> > isn't
> > with the network cards or the switch.
> >
> > As no one else is getting this error and the machines are quite slow 
> > (1x
> > P-200 & 3x P-166) it looks to me like there's a race-condition within 
> > the
> > comms section of OM -- admittedly, I haven't looked at the source-code
> > yet.
> >
> > Does this sound at all likely to anyone?  Any ideas how to go about
> > isolating the bug?
> >
> > Cheers,
> >
> > Paul.
> >
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
> > -- -- -- 
> > Particle Physics (Theory & Experimental) Groups                Dr Paul 
> > Millar
> > Department of Physics and Astronomy                     
> > paulm@astro.gla.ac.uk
> > University of Glasgow                                 
> > paulm@physics.gla.ac.uk
> > Glasgow, G12 8QQ, Scotland             
> > http://www.astro.gla.ac.uk/users/paulm
> > +44 (0)141 330 4717        A54C A9FC 6A77 1664 2E4E  90E3 FFD2 704B 
> > BF0F 03E9
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
> > -- -- -- 
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.net email is sponsored by: Scholarships for Techies!
> > Can't afford IT training? All 2003 ictp students receive scholarships.
> > Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
> > www.ictp.com/training/sourceforge.asp
> > _______________________________________________
> > Openmosix-devel mailing list
> > Openmosix-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/openmosix-devel
> >
> 
> 
> 
> -------------------------------------------------------
> This SF.NET email is sponsored by:
> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
> http://www.vasoftware.com
> _______________________________________________
> Openmosix-devel mailing list
> Openmosix-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
> 
> 

-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
Particle Physics (Theory & Experimental) Groups                Dr Paul Millar 
Department of Physics and Astronomy                     paulm@astro.gla.ac.uk
University of Glasgow                                 paulm@physics.gla.ac.uk
Glasgow, G12 8QQ, Scotland             http://www.astro.gla.ac.uk/users/paulm 
+44 (0)141 330 4717        A54C A9FC 6A77 1664 2E4E  90E3 FFD2 704B BF0F 03E9
-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 




-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
Openmosix-devel mailing list
Openmosix-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openmosix-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic