[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openmosix-devel
Subject:    [Openmosix-devel] New Cluster-Mask Feature
From:       Moshe Bar <moshe () moelabs ! com>
Date:       2003-02-04 2:41:27
[Download RAW message or body]

Hi folks

Several people have asked for a feature in openMosix which allows to 
specifiy to which nodes a given process and it's children can migrate 
and to which nodes it cannot.

Simone Ettore has just committed a new patch to the CVS which allows 
you to do just that.

Here is how ti works:

/proc/[pid]/migfilter enable/disable the capability of filter migration.
/proc/[pid]/mignodes is a bit-list of nodes. The bit position of a node 
is calculated as 2^(PE-1). PE is node number.
/proc/[pid]/migpolicy is the policy of the filtering:
0=DENY: the process can migrate in all nodes except when the relative
bit on mignodes is 1
1=ALLOW: the process can migrate in all nodes where the relative bit on
mignodes is 1

We are shortly going to release also a simple user-land tool to set the 
node mask, but I would like you guys to give it a try asap before we 
release it as openMosix 2.4.20-3.

Kind regards and many thanks to Ettore.

Moshe


On Monday, Feb 3, 2003, at 07:09 US/Pacific, Paul Millar wrote:

> Hi Moshe,
>
> Sorry for the delay in replying.  After installing some more memory on 
> the
> nodes, I started to get some weird errors, random kernel oops, ...  
> Turns
> out some of the memory on one of the nodes was bad (using distcc
> for kernel compilation, ouch!)
>
> So I've run all of memtest86 tests on all nodes and went back and 
> verify
> the previous results, which took a bit of time ...
>
> On Wed, 22 Jan 2003, Moshe Bar wrote:
>> Do you get interrupt overrun messages in your log files? You might 
>> have
>> lost some interrupts and therefor the protocol gets all confused by
>> your ifconfig wouldn't show errors just because it doesn't know about
>> missed interrupts.
>
> I don't see any mention of them in the syslog or dmesg.  They could be
> occurring and just not reported, but that seems unlikely.  I've also 
> tried
> 2.4.20-2, but that has the same problems.
>
> I've started to narrow down the problem.  Its occurring in the
> deputy_main_loop() (in hpc/deputy.c line 215) because comm_recv() is
> failing:
>
>                 p->mosix.dflags |= DSYNC;
>                 if(delay_sigs)
>                         evaluate_pending_signals_in_mosix_context();
>                 if((type = comm_recv(&head, &hlen)) < 0)
>                         deputy_die_on_communication();
>                 if(type & ANYTIME)
>                 {
>                         if(deputy_handle_interim_request(type, head, 
> hlen))
>                                 deputy_die_on_communication();
>                 }
>
> I haven't found out why comm_recv() is failing, that's next on todo 
> list.
>
> Any ideas appreciated :)
>
> Cheers,
>
> Paul.
>
>
>> On Wednesday, Jan 22, 2003, at 09:57 US/Eastern, Paul Millar wrote:
>>
>>> On Tue, 14 Jan 2003, Mirko Caserta wrote:
>>>> Try compiling with CONFIG_MOSIX_PIPE_EXCEPTIONS set. It should help.
>>>>
>>>> Also try a newest kernel (2.4.20) and patch against that, then let 
>>>> us
>>>> know.
>>>
>>> Ok, I've tried 2.4.19-7 and 2.4.20-1 (both with and without
>>> CONFIG_MOSIX_PIPE_EXCEPTIONS set).  All combinations have the same
>>> problem: OM kills off processes with messages like
>>>> Process 24613(make), uid=501, killed because it lost communication
>>>> with the remote site where it was running
>>>
>>>> From watching this happening, subjectively there's a complete loss 
>>>> of
>>> activity; although the kernel seems to be functioning fine.  Then,
>>> after a
>>> short delay (a few minutes) the kernel kills off the process.
>>>
>>> Its as if the network has dropped a packet.  Yet after this happens,
>>> ifconfig doesn't report any lost packets on any of the nodes (OM uses
>>> TCP
>>> though so this shouldn't matter, right?).  So I suspect the problem
>>> isn't
>>> with the network cards or the switch.
>>>
>>> As no one else is getting this error and the machines are quite slow
>>> (1x
>>> P-200 & 3x P-166) it looks to me like there's a race-condition within
>>> the
>>> comms section of OM -- admittedly, I haven't looked at the 
>>> source-code
>>> yet.
>>>
>>> Does this sound at all likely to anyone?  Any ideas how to go about
>>> isolating the bug?
>>>
>>> Cheers,
>>>
>>> Paul.
>>>
>>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
>>> -- -- -- 
>>> Particle Physics (Theory & Experimental) Groups                Dr 
>>> Paul
>>> Millar
>>> Department of Physics and Astronomy
>>> paulm@astro.gla.ac.uk
>>> University of Glasgow
>>> paulm@physics.gla.ac.uk
>>> Glasgow, G12 8QQ, Scotland
>>> http://www.astro.gla.ac.uk/users/paulm
>>> +44 (0)141 330 4717        A54C A9FC 6A77 1664 2E4E  90E3 FFD2 704B
>>> BF0F 03E9
>>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
>>> -- -- -- 
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> This SF.net email is sponsored by: Scholarships for Techies!
>>> Can't afford IT training? All 2003 ictp students receive 
>>> scholarships.
>>> Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
>>> www.ictp.com/training/sourceforge.asp
>>> _______________________________________________
>>> Openmosix-devel mailing list
>>> Openmosix-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.NET email is sponsored by:
>> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
>> http://www.vasoftware.com
>> _______________________________________________
>> Openmosix-devel mailing list
>> Openmosix-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>>
>>
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
> -- -- -- 
> Particle Physics (Theory & Experimental) Groups                Dr Paul 
> Millar
> Department of Physics and Astronomy                     
> paulm@astro.gla.ac.uk
> University of Glasgow                                 
> paulm@physics.gla.ac.uk
> Glasgow, G12 8QQ, Scotland             
> http://www.astro.gla.ac.uk/users/paulm
> +44 (0)141 330 4717        A54C A9FC 6A77 1664 2E4E  90E3 FFD2 704B 
> BF0F 03E9
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 
> -- -- -- 
>
>
>
>
> -------------------------------------------------------
> This SF.NET email is sponsored by:
> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
> http://www.vasoftware.com
> _______________________________________________
> Openmosix-devel mailing list
> Openmosix-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>



-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
Openmosix-devel mailing list
Openmosix-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openmosix-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic