[prev in list] [next in list] [prev in thread] [next in thread]
List: openmosix-devel
Subject: [Openmosix-devel] New Cluster-Mask Feature
From: Moshe Bar <moshe () moelabs ! com>
Date: 2003-02-04 2:41:27
[Download RAW message or body]
Hi folks
Several people have asked for a feature in openMosix which allows to
specifiy to which nodes a given process and it's children can migrate
and to which nodes it cannot.
Simone Ettore has just committed a new patch to the CVS which allows
you to do just that.
Here is how ti works:
/proc/[pid]/migfilter enable/disable the capability of filter migration.
/proc/[pid]/mignodes is a bit-list of nodes. The bit position of a node
is calculated as 2^(PE-1). PE is node number.
/proc/[pid]/migpolicy is the policy of the filtering:
0=DENY: the process can migrate in all nodes except when the relative
bit on mignodes is 1
1=ALLOW: the process can migrate in all nodes where the relative bit on
mignodes is 1
We are shortly going to release also a simple user-land tool to set the
node mask, but I would like you guys to give it a try asap before we
release it as openMosix 2.4.20-3.
Kind regards and many thanks to Ettore.
Moshe
On Monday, Feb 3, 2003, at 07:09 US/Pacific, Paul Millar wrote:
> Hi Moshe,
>
> Sorry for the delay in replying. After installing some more memory on
> the
> nodes, I started to get some weird errors, random kernel oops, ...
> Turns
> out some of the memory on one of the nodes was bad (using distcc
> for kernel compilation, ouch!)
>
> So I've run all of memtest86 tests on all nodes and went back and
> verify
> the previous results, which took a bit of time ...
>
> On Wed, 22 Jan 2003, Moshe Bar wrote:
>> Do you get interrupt overrun messages in your log files? You might
>> have
>> lost some interrupts and therefor the protocol gets all confused by
>> your ifconfig wouldn't show errors just because it doesn't know about
>> missed interrupts.
>
> I don't see any mention of them in the syslog or dmesg. They could be
> occurring and just not reported, but that seems unlikely. I've also
> tried
> 2.4.20-2, but that has the same problems.
>
> I've started to narrow down the problem. Its occurring in the
> deputy_main_loop() (in hpc/deputy.c line 215) because comm_recv() is
> failing:
>
> p->mosix.dflags |= DSYNC;
> if(delay_sigs)
> evaluate_pending_signals_in_mosix_context();
> if((type = comm_recv(&head, &hlen)) < 0)
> deputy_die_on_communication();
> if(type & ANYTIME)
> {
> if(deputy_handle_interim_request(type, head,
> hlen))
> deputy_die_on_communication();
> }
>
> I haven't found out why comm_recv() is failing, that's next on todo
> list.
>
> Any ideas appreciated :)
>
> Cheers,
>
> Paul.
>
>
>> On Wednesday, Jan 22, 2003, at 09:57 US/Eastern, Paul Millar wrote:
>>
>>> On Tue, 14 Jan 2003, Mirko Caserta wrote:
>>>> Try compiling with CONFIG_MOSIX_PIPE_EXCEPTIONS set. It should help.
>>>>
>>>> Also try a newest kernel (2.4.20) and patch against that, then let
>>>> us
>>>> know.
>>>
>>> Ok, I've tried 2.4.19-7 and 2.4.20-1 (both with and without
>>> CONFIG_MOSIX_PIPE_EXCEPTIONS set). All combinations have the same
>>> problem: OM kills off processes with messages like
>>>> Process 24613(make), uid=501, killed because it lost communication
>>>> with the remote site where it was running
>>>
>>>> From watching this happening, subjectively there's a complete loss
>>>> of
>>> activity; although the kernel seems to be functioning fine. Then,
>>> after a
>>> short delay (a few minutes) the kernel kills off the process.
>>>
>>> Its as if the network has dropped a packet. Yet after this happens,
>>> ifconfig doesn't report any lost packets on any of the nodes (OM uses
>>> TCP
>>> though so this shouldn't matter, right?). So I suspect the problem
>>> isn't
>>> with the network cards or the switch.
>>>
>>> As no one else is getting this error and the machines are quite slow
>>> (1x
>>> P-200 & 3x P-166) it looks to me like there's a race-condition within
>>> the
>>> comms section of OM -- admittedly, I haven't looked at the
>>> source-code
>>> yet.
>>>
>>> Does this sound at all likely to anyone? Any ideas how to go about
>>> isolating the bug?
>>>
>>> Cheers,
>>>
>>> Paul.
>>>
>>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
>>> -- -- --
>>> Particle Physics (Theory & Experimental) Groups Dr
>>> Paul
>>> Millar
>>> Department of Physics and Astronomy
>>> paulm@astro.gla.ac.uk
>>> University of Glasgow
>>> paulm@physics.gla.ac.uk
>>> Glasgow, G12 8QQ, Scotland
>>> http://www.astro.gla.ac.uk/users/paulm
>>> +44 (0)141 330 4717 A54C A9FC 6A77 1664 2E4E 90E3 FFD2 704B
>>> BF0F 03E9
>>> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
>>> -- -- --
>>>
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> This SF.net email is sponsored by: Scholarships for Techies!
>>> Can't afford IT training? All 2003 ictp students receive
>>> scholarships.
>>> Get hands-on training in Microsoft, Cisco, Sun, Linux/UNIX, and more.
>>> www.ictp.com/training/sourceforge.asp
>>> _______________________________________________
>>> Openmosix-devel mailing list
>>> Openmosix-devel@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>>>
>>
>>
>>
>> -------------------------------------------------------
>> This SF.NET email is sponsored by:
>> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
>> http://www.vasoftware.com
>> _______________________________________________
>> Openmosix-devel mailing list
>> Openmosix-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>>
>>
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> -- -- --
> Particle Physics (Theory & Experimental) Groups Dr Paul
> Millar
> Department of Physics and Astronomy
> paulm@astro.gla.ac.uk
> University of Glasgow
> paulm@physics.gla.ac.uk
> Glasgow, G12 8QQ, Scotland
> http://www.astro.gla.ac.uk/users/paulm
> +44 (0)141 330 4717 A54C A9FC 6A77 1664 2E4E 90E3 FFD2 704B
> BF0F 03E9
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> -- -- --
>
>
>
>
> -------------------------------------------------------
> This SF.NET email is sponsored by:
> SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
> http://www.vasoftware.com
> _______________________________________________
> Openmosix-devel mailing list
> Openmosix-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/openmosix-devel
>
-------------------------------------------------------
This SF.NET email is sponsored by:
SourceForge Enterprise Edition + IBM + LinuxWorld = Something 2 See!
http://www.vasoftware.com
_______________________________________________
Openmosix-devel mailing list
Openmosix-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/openmosix-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic