BusyBox Bug and Patch Tracking
BusyBox
  

Viewing Issue Simple Details Jump to Notes ] View Advanced ] Issue History ] Print ]
ID Category Severity Reproducibility Date Submitted Last Update
0001013 [uClibc] Posix Threads major always 08-31-06 09:04 10-11-08 03:07
Reporter jalaber View Status public  
Assigned To khem
Priority normal Resolution open  
Status assigned   Product Version
Summary 0001013: pthread_cancel/pthread_join sequence hangs when using select in an other thread
Description Hello,

I have found a very strange bug in uClibc using pthread_cancel/pthread_join.

My test program launches 1 thread which basically makes a select call with a struct timeval set to 600ms. Then the main thread calls pthread_cancel and pthread_join, followed by a printf. The program hangs.
However if you remove the printf call, then the program terminates normally. I have tried to replace the select call with a sem_wait call, and everything works fine with or without printf. So the problem seems to happen only with select.

I use buildroot with kernel 2.4.28 and uclibc 0.9.28. I have attached the program to reproduce. If you try to comment the printf("join OK\n"), it works for me.

Thank you for your time and help,
Philippe.
Additional Information
Attached Files  pthread_join_test.c [^] (921 bytes) 08-31-06 09:04
 uClibc-select-cancellation-point.patch [^] (603 bytes) 06-30-07 14:18

- Relationships

- Notes
(0001715)
dwagner
10-23-06 11:43

I think this issue is responsible that the LIRC driver of directfb RC1 does not terminate. The driver uses select() and pthread_cancel()/pthread_join().

Please fix that.
 
(0001744)
vapier
11-16-06 23:07

here's a tip ... saying things like "Please fix that." makes people think "Fix it your goddamn self."

the hang may be because of the IO mutex being held by the canceled thread ... if you turn on PDEBUG in libpthread/linuxthreads/debug.h, that may give you helpful output
 
(0002526)
chombourger
06-27-07 12:58

I tried to reproduce this issue on uclibc 0.9.29 running on a PC with a 2.6 linux kernel and your program is still running as I am typing these lines!
Is it what you are getting? Running the same program on the host (compiled and linked against glibc worked). I will now try to enable the debug traces to see if that helps.
 
(0002527)
chombourger
06-27-07 13:15

traces with debug enabled in linuxthreads.old:

26294 : __pthread_initialize_manager: manager stack: size=8160, bos=0x804a150, tos=0x804c130
26294 : __pthread_initialize_manager: send REQ_DEBUG to manager thread
26294 : pthread_create: write REQ_CREATE to manager thread
26294 : pthread_create: before suspend(self)
26295 : __pthread_manager: before poll
26295 : __pthread_manager: after poll
26295 : __pthread_manager: before __libc_read
26295 : __pthread_manager: after __libc_read, n=148
26295 : __pthread_manager: got REQ_CREATE
26295 : pthread_handle_create: cloning new_thread = 0xbf1ffe20
26295 : pthread_handle_create: new thread pid = 26296
26295 : __pthread_manager: restarting -1208466944
26294 : pthread_create: after suspend(self)
26295 : __pthread_manager: before poll
26296 : pthread_start_thread:
step 0
step 1
step 2
26295 : __pthread_manager: after poll
26295 : __pthread_manager: before poll
step 3
cancel th...
26294 : pthread_cancel: sending cancel signal to 26296
26294 : pthread_cancel: kill returned 0
 
(0002528)
chombourger
06-27-07 13:26
edited on: 06-30-07 14:22

It seems that the created thread has no jmpbuf when pthread_handle_sigcancel() is called in the created thread and the signal handler returns and the thread was not rerouted.

select() does not behave like a cancellation point (while it should). Could it be because select() is simply a syscall5 and we therefore never reach the sigwait() function of the pthread library?

If I modify the select() call as follow, the thread is indeed canceled:

r = select(0, NULL, NULL, &tv);
if ((r == -1) && (errno == EINTR)) pthread_testcancel();

I eventually found where linuxthreads.old defines cancellable system calls: wrapsyscall.c and added an entry for select(2).

Note: select() was previously listed as a cancellation point but it got removed by Ulrich Depper and I don't know why:

CVSROOT: /cvs/glibc
Module name: libc
Changes by: drepper@sourceware.org 2002-12-15 13:43:25

Modified files:
    linuxthreads : wrapsyscall.c

Log message:
    Remove creat, poll, pselect, readv, select, sigpause, sigsuspend,
    sigwaitinfo, waitid, and writev wrappers.

I have attached to this report, a patch re-introducing select(2).

 
(0002712)
hmoffatt
09-04-07 21:25

I have an application which is hanging with uClibc 0.9.29. The main process is regularly calling fork() and exec(). There is also a thread which is doing a select() with a 1ms timeout in an endless loop. The application sometimes hangs just after the fork/exec; the exec has happened (there is a zombie process left around).

I tried the patch in this bug report; now the program segfaults instead of hanging. So I don't think the patch is the correct solution.
 
(0002713)
chombourger
09-05-07 00:30

Three questions:

   (a) have you tried running this program on a glibc system and does it work?
   (b) is your app. making any use of pthread_cancel() and pthread_join()?
   (c) can you provide us a stripped down version of your application so that we can reproduce the bug/segfault with it?
 
(0002714)
hmoffatt
09-05-07 02:58

My application is in Python. At the time my hang occurs I am not using pthread_join or cancel; a finite set of threads have been created some time earlier and should continue to exist for the life of the program.

Hence the original problem in this report doesn't describe my situation. Nonetheless the patch had some impact which suggests it may not be right.

My original process (not one of the threads) is regularly calling fork() and exec() (via python wrappers). It appears that the process hangs somewhere after exec(), before returning to my interpreted program. strace shows that the program did get SIGCHLD as the last thing that happened, meaning that the child has exited. Then it seems to be sleeping waiting for something to happen.

There is another thread which is calling select() with a 1ms sleep indefinitely. I think it hangs also though I will have to retest to be sure.

The thread manager thread seems to be still running ok, calling poll() with a 1 second timeout. strace shows it is still running.

When I put in the patch from this report, the select() thread dies with SIGSEGV.

I am trying to build glibc for the embedded system now. I will also try to run it on my desktop with glibc and reproduce it.
 
(0002721)
hmoffatt
09-05-07 21:58

I have now tested this with glibc 2.3.6 with linuxthreads (not NPTL). It segfaults just the same as with your patch for uClibc.

I guess that could be considered a positive sign.. probably meaning that my bug is somewhere else.
 
(0002722)
chombourger
09-06-07 01:14

Ok interesting point. Do you have a backtrace when it crashes? Wondering if the segfaults occurs within the libc.
 
(0002723)
hmoffatt
09-06-07 07:45

My cross-gdb insists on trying to load the host libraries rather than the target ones so I can't get a meaningful back trace even though I have the build and a core file. :(

Looks like it is reading the absolute paths from the core file; I really need to be able to prepend a path to those when loading into the cross-gdb. Is that possible?

I have had even less success doing a live cross-gdb against gdbserver.
 
(0002724)
chombourger
09-06-07 07:51

You could use the gdb setting 'set solib-absolute-prefix PATH'
to tell gdb the base prefix of your target file-system

That used to work for me. Hope this helps!
 
(0002725)
hmoffatt
09-06-07 16:27

Great, solib-absolute-prefix was exactly what I needed.

I got the following when tracing against glibc; I don't have a build with the latest uclibc ready to test at the moment.

Core was generated by `/usr/bin/python /opt/calyptech/lib/webserver/server.py'.
Program terminated with signal 11, Segmentation fault.
0 0x4021b8ec in sem_wait () from /home/hamish/work/robots/glibc-romfs/lib/libpthread.so.0
(gdb)
(gdb) bt
0 0x4021b8ec in sem_wait () from /home/hamish/work/robots/glibc-romfs/lib/libpthread.so.0
0000001 0xbe5ff54c in ?? ()

gdb is insisting that my core file does not match the python binary though so it may be confused. I'm sure that it does (I copied the binary back out of the embedded system). I'm not sure if this is related to threads or not.
 
(0009924)
thomask
07-23-08 06:32

Is there any update on this bug? Looks like it's still there.
 

- Issue History
Date Modified Username Field Change
08-31-06 09:04 jalaber New Issue
08-31-06 09:04 jalaber Status new => assigned
08-31-06 09:04 jalaber Assigned To  => uClibc
08-31-06 09:04 jalaber File Added: pthread_join_test.c
10-23-06 11:41 dwagner Issue Monitored: dwagner
10-23-06 11:43 dwagner Note Added: 0001715
11-16-06 23:07 vapier Note Added: 0001744
06-27-07 12:58 chombourger Note Added: 0002526
06-27-07 13:15 chombourger Note Added: 0002527
06-27-07 13:26 chombourger Note Added: 0002528
06-27-07 13:52 chombourger Note Edited: 0002528
06-30-07 12:49 chombourger Note Edited: 0002528
06-30-07 14:18 chombourger File Added: uClibc-select-cancellation-point.patch
06-30-07 14:22 chombourger Note Edited: 0002528
07-01-07 22:26 chombourger Issue Monitored: chombourger
09-04-07 21:25 hmoffatt Note Added: 0002712
09-05-07 00:30 chombourger Note Added: 0002713
09-05-07 02:58 hmoffatt Note Added: 0002714
09-05-07 21:58 hmoffatt Note Added: 0002721
09-06-07 01:14 chombourger Note Added: 0002722
09-06-07 07:45 hmoffatt Note Added: 0002723
09-06-07 07:51 chombourger Note Added: 0002724
09-06-07 16:27 hmoffatt Note Added: 0002725
07-23-08 06:32 thomask Note Added: 0009924
10-11-08 03:07 khem Assigned To uClibc => khem


Copyright © 2000 - 2006 Mantis Group
Powered by Mantis Bugtracker