[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ssic-linux-devel
Subject:    [SSI-devel] [ ssic-linux-Bugs-2000692 ] Attempting to build glibc
From:       "SourceForge.net" <noreply () sourceforge ! net>
Date:       2008-06-26 8:59:42
Message-ID: E1KBnKQ-00068G-SV () 565xhf1 ! ch3 ! sourceforge ! com
[Download RAW message or body]

Bugs item #2000692, was opened at 2008-06-23 07:03
Message generated for change (Comment added) made by rogertsang
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=2000692&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Process Management
Group: v2.0.0pre1
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Roger Tsang (rogertsang)
Summary: Attempting to build glibc reliably crashes node

Initial Comment:
Attempting to build glibc on latest (20/6/2008) kernel from CVS crashes the node:

kernel BUG at fs/exec.c:820!
invalid operand: 0000 [#1]
SMP 
Modules linked in: parport_pc parport floppy uhci_hcd ohci_hcd ehci_hcd ide_scsi \
                scsi_mod i2c_piix4 i2c_core dm_snapshot dm_mirror dm_mod ext3 jbd \
                ne2k_pci 8390
CPU:    0
EIP:    0060:[<c01796f4>]    Not tainted VLI
EFLAGS: 00000202   (2.6.11-ssi-8k-686-smp) 
EIP is at flush_old_exec+0x714/0x760
eax: 00000001   ebx: 00000010   ecx: cf726c00   edx: cf726c18
esi: c4b89290   edi: cf71e680   ebp: c5133e4c   esp: c5133df4
ds: 007b   es: 007b   ss: 0068
Process ld-linux.so.2 (pid: 87475, threadinfo=c5132000 task=c530ccb0)
Stack: c4b89290 00000011 00000000 00000000 cf669a44 c4b8974c 00000001 00000080 
       c5132000 00000000 00000000 00000000 cc632170 cf669540 ce97a580 cf732e00 
       00000080 c5133e3c 00000080 ce359aa0 ce359ad8 c0498392 c5133efc c019c116 
Call Trace:
 [<c01067af>] show_stack+0x7f/0xa0
 [<c0106954>] show_registers+0x164/0x230
 [<c0106d04>] die+0xf4/0x1c0
 [<c0106e56>] do_trap+0x86/0xd0
 [<c0107148>] do_invalid_op+0xb8/0xd0
 [<c0106413>] error_code+0x2b/0x30
 [<c019c116>] load_elf_binary+0x356/0xc70
 [<c0179a3e>] search_binary_handler+0x8e/0x260
 [<c0179e8a>] ssi_do_execve+0x23a/0x350
 [<c0179c3f>] do_execve+0x2f/0x40
 [<c0102cd2>] sys_execve+0x42/0xc0
 [<c01058ab>] syscall_call+0x7/0xb
Code: c7 49 c0 e8 ef e8 fa ff e9 d5 fd ff ff c7 04 24 48 c7 49 c0 e8 de e8 fa ff e9 \
99 fd ff ff 0f 0b 94 02 41 ae 49 c0 e9 d5 fe ff ff <0f> 0b 34 03 41 ae 49 c0 e9 5a f9 \
ff ff 89 34 24 e8 47 6f fb ff   
Entering kdb (current=0xc530ccb0, pid 87475) on processor 0 Oops: invalid operand
due to oops @ 0xc01796f4
eax = 0x00000001 ebx = 0x00000010 ecx = 0xcf726c00 edx = 0xcf726c18 
esi = 0xc4b89290 edi = 0xcf71e680 esp = 0xc5133df4 eip = 0xc01796f4 
ebp = 0xc5133e4c xss = 0x00000068 xcs = 0x00000060 eflags = 0x00000202 
xds = 0x0000007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xc5133dc0
[0]kdb> 


----------------------------------------------------------------------

> Comment By: Roger Tsang (rogertsang)
Date: 2008-06-26 04:59

Message:
Logged In: YES 
user_id=1246761
Originator: NO

File Added: ssic-linux-bugs-2000692_2.patch

----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-06-26 04:42

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Try attached patch.
- Fix stale vproc hash chain list after switching pids.
- de_thread() wait for asynchronous release_task() to complete.

# ./a.out /bin/echo
Timed out: killed the child process
File Added: ssic_linux_bugs.2000692-1.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-25 06:47

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's how things work on the standard 2.6.11 kernel.

We call exec from a thread, so we get to de_thread:

   de_thread (pid=1946, tgid=1945, count = 2)
   
   Note that sig->count is 2, i.e. the struct_signal is shared between the
two threads.

   Here's how we got to de_thread (inlined into flush_old_exec):

 [<c016f8c7>] flush_old_exec+0xb7/0x830
 [<c018fc36>] load_elf_binary+0x356/0xc10
 [<c017034e>] search_binary_handler+0x8e/0x250
 [<c0170697>] do_execve+0x187/0x220
 [<c0102c42>] sys_execve+0x42/0xc0
 [<c01040d7>] syscall_call+0x7/0xb

de_thread calls zap_other_threads to zap the other threads. 
zap_other_threads kills the process thread leader:

   zap_other_threads: send SIGKILL to 1945

de_thread now waits for the non leader threads to finish with the
struct_signal:

   de_thread: Wait for count (2) to fall to 2

Then, since it's not running in the leader thread it waits for the leader
thread to become a zombie (EXIT_ZOMBIE = 16):

   de_thread: wait for 1945, state 0 to go to state 16

As the state is not yet zombie it calls yield.

Which allows us to get to exit_notify to set the state to EXIT_ZOMBIE.

    exit_notify: set 1945 to EXIT_ZOMBIE

Here's how we get to exit_notify:

 [<c0125a33>] exit_notify+0x473/0x910
 [<c01260a1>] do_exit+0x1d1/0x330
 [<c0126279>] do_group_exit+0x39/0xc0
 [<c012fdb5>] get_signal_to_deliver+0x215/0x330
 [<c0103ee0>] do_signal+0x70/0x130
 [<c0103fdb>] do_notify_resume+0x3b/0x40
 [<c0104122>] work_notifysig+0x13/0x15

Then we get to __exit_signal for the thread leader to decrement the
sig->count:

   __exit_signal: pid=1946 count=2

Here's how we get to __exit_signal:

 [<c012dc8b>] __exit_signal+0x4b/0x180
 [<c01249e3>] release_task+0x63/0x140
 [<c016fbf8>] flush_old_exec+0x3e8/0x830
 [<c018fc36>] load_elf_binary+0x356/0xc10
 [<c017034e>] search_binary_handler+0x8e/0x250
 [<c0170697>] do_execve+0x187/0x220
 [<c0102c42>] sys_execve+0x42/0xc0
 [<c01040d7>] syscall_call+0x7/0xb

Aha, we get to __exit_signal from release_task, called from de_thread
after the leader has become a zombie.

What is different in OpenSSI?  This is:

#ifdef CONFIG_VPROC
                do_notify_parent(leader, SIGCHLD, 0, 0);
#else
                release_task(leader);
#endif

So on OpenSSI we don't call release_task, which would do the necessary
cleanup, we notify the parent.

Isn't this just wrong?  An overdone search & replace maybe?


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-24 10:23

Message:
Logged In: YES 
user_id=166336
Originator: YES

So leader->exit_state is getting set to EXIT_ZOMBIE but
__exit_signal(leader) is not being called.

leader->exit_state is being set to EXIT_ZOMBIE in do_exit (kernel/exit.c)

fastcall NORET_TYPE void do_exit(long code)
{
[...]
        tsk->exit_code = code;
        exit_notify(tsk);
#ifdef CONFIG_VPROC
        tsk->exit_state = EXIT_ZOMBIE;
        /* Let pproc_reap know we're done with exit processing */
        SIGNAL_EVENT(&(tsk->p_exiting));
#endif

On a non-SSI system leader->exit_state gets set to EXIT_ZOMBIE in
exit_notify.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-24 07:28

Message:
Logged In: YES 
user_id=166336
Originator: YES

So bug is simply that if a thread that is not the group leader calls exec
then node crashes
'cos the code that's supposed to clean up all running threads doesn't work
when called from
something that's not the group leader.

Attached a very simple test program.

$ cc -pthread exec-crash.c -o exec-crash
$ ./exec-crash

File Added: exec-crash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-24 04:51

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code in fs/exec.c where it complains:

/*
 * This function makes sure the current process has its own signal table,
 * so that flush_signal_handlers can later reset the handlers without
 * disturbing other processes.  (Other processes might share the signal
 * table via the CLONE_SIGHAND option to clone().)
 */
static inline int de_thread(struct task_struct *tsk)
{
[...]
     
no_thread_group:
        BUG_ON(atomic_read(&sig->count) != 1);
        exit_itimers(sig);

So it's unhappy that sig->count is not 1, i.e. that the signal struct is
still shared.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-23 12:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, it was tst-exec4, not tst-exec3.

To demonstrate the crash, compile the attached tst-exec4.c file with
pthreads then run with some random binary as its argument:

$ cc -pthread tst-exec4.c
$ ./a.out /bin/echo


File Added: tst-exec4.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-23 08:52

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, just running that test crashes the system.  I'll try and package up a
neat testcase.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-23 08:47

Message:
Logged In: YES 
user_id=166336
Originator: YES

The crash seems to happen as the glibc build process is testing it's
interface to exec:

GCONV_PATH=/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/iconvdata
LC_ALL=C  
/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/elf/ld-linux.so.2
--library-path
/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl:/usr/local/src/debian/glibc \
-2.3.6.ds1/build-tree/i386-nptl/math:/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/ \
i386-nptl/elf:/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/dlfcn:/usr/lo \
cal/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/nss:/usr/local/src/debian/glibc-2. \
3.6.ds1/build-tree/i386-nptl/nis:/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386 \
-nptl/rt:/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/resolv:/usr/local/ \
src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/crypt:/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/nptl
                
/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/nptl/tst-exec3 
> 
/usr/local/src/debian/glibc-2.3.6.ds1/build-tree/i386-nptl/nptl/tst-exec3.out

Or something like that.  :-)



----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-23 08:00

Message:
Logged In: YES 
user_id=166336
Originator: YES

For interest I've tried this with the old 2.6.10 based kernel, and the
same bug happens (also a "sleeping function called from invalid context"
but I think Roger has fixed that one for 2.6.11).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=2000692&group_id=32541

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
ssic-linux-devel mailing list
ssic-linux-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic