I know that this question may sound quite vague but I don't know how else to address it. I have a process with several worker threads. When the process is closing I need to gracefully shut it down. I use a well established and tried mechanism of a signaling "stop event" to let all worker threads know that it's time to quit, while at the same time waiting for all thread handles to become signaled.
Here's the code for the waiting part:
//Signal threads that it's time to stop
SetEvent(hEventStopNow);
//And wait for all threads to be signaled
DWORD dwR = WaitForMultipleObjects(nThreadCount, pThreadHandles, TRUE, 5000);
Most times the waiting API above returns almost immediately, but maybe 1% of the times it does not and times out with error code 258. The situation is made worse by the following factors:
1. The process in question is a screen saver
2. And it is running on a client's machine that I don't have access to.
The worker threads are:
A. One that does graphics rendering using GDI+
B. Then another thread that preps graphics for the thread A
(The access to these objects are synchronized between threads using mutexes)
OK, now my question. Is there any way to find out where a thread is locked up when or after a waiting API times out, or which API was called in it last?
PS. Please note that I'm not asking to check if my process of synchronization between threads is correct, I've done that many times myself.
Igor Vartanov
November 6th, 2010, 06:25 PM
Is there any way to find out where a thread is locked up when or after a waiting API times out, or which API was called in it last?Exhaustive logging operations may help, but strictly saying, the answer is 'no' (just because logging itself interferes with thread timings and ultimately distorts the picture).
PS. Another approach would be running the app in remote debugging session. But this surely cannot/mustn't be practiced on client site.
PPS. The situation looks like threads deadlock each other, so you should review your code for proper resource freeing order after signaling shutdown.
PPPS. Oh, just crossed my mind, you could snap a minidump and analyze it in the lab for thread states as well.
I've done that many times myself. Never trusted statements like this. You know why? This is you who did that many times, yeah, but this is the same you who still have troubles with what you did. Please note, this is not a criticism, just a little side note. ;)
Arjay
November 6th, 2010, 06:48 PM
Let's start in reverse order.
PS. Please note that I'm not asking to check if my process of synchronization between threads is correct, I've done that many times myself.Why put restrictions on what we can address to help you? You say your synchronization is correct, but how do you know (after all, if everything was working you wouldn't be posting)?
How about posting a snippet of the shared resources and how you are using the mutext to protect it? Btw, why are you using a mutex anyway?
The situation is made worse by the following factors:
...
2. And it is running on a client's machine that I don't have access to.
Consider logging some trace messages to a file or to the event log for the thread signalling/shutdown operations.
OK, now my question. Is there any way to find out where a thread is locked up when or after a waiting API times out, or which API was called in it last? When you receive a WAIT_TIMEOUT (258), call ExitCodeThread on each thread and retrieve the return code or STILL_ACTIVE (259).
Most times the waiting API above returns almost immediately, but maybe 1% of the times it does not and times out with error code 258. Post the thread code so we can see how you are waiting for the stop event.
As an idea of what can go wrong, consider the following code:
while( TRUE )
{
switch( WaitForMultipleEvents( .... ) )
case WAIT_OBJECT_0 + 0: // Shutdown event
return 1;
break;
case WAIT_OBJECT_0 + 1:
DoGraphicsWork( );
break;
... more switch cases
}
Keep in mind that in the above code the thread is only going to return immediately when the stop event is set if the thread is waiting in the WaitFor call.
If the thread is doing some work (like in the DoGraphicsWork call), it has to return out of that call before it will be able to respond to the stop event.
If DoGraphics accesses some shared resource that isn't synchronized properly, another thread may not be releasing the lock (perhaps when it receives its shutdown event) causing this thread to wait indefinitely.
Showing some code of the shared resource synchronization, and the thread procs will help us help you.
Alex F
November 7th, 2010, 07:26 AM
Some general points.
Does this happen only when stop event is set? Without setting this event, both threads are running smoothly, without deadlocks? In this case chevk both threads for the case when the only thread is running, and another thread already exited.
hEventStopNow should be manual reset event, I guess it is.
Every wait operation in both threads must include hEventStopNow. If there are some input operations, they should be overlapped.
Zaccheus
November 8th, 2010, 11:55 AM
The worker threads are:
A. One that does graphics rendering using GDI+
B. Then another thread that preps graphics for the thread A
(The access to these objects are synchronized between threads using mutexes)
Could using GDI+ from a worker thread be causing problems?
ahmd
November 8th, 2010, 03:23 PM
Thank you all, guys. And I apologize for this delay. Let me address your points.
Exhaustive logging operations may help...
As you also pointed out, Igor, they can also interfere with the thread timings. I don't know if I mentioned this but I could not reproduce this behavior on my system. The only indication that this happens comes from event logs from a client. And as you also said remote debugging will be quite a cumbersome and probably futile way to address this issue since it will also introduce its own delay. As for minidumps I always had a hard time analyzing them.
Never trusted statements like this. You know why?I agree. And I came off too pompous in it. I put it there so that people weren't asking me to check the synchronization, which was quite simple and shouldn't have caused any issues by itself. [PS. I'm sure this phrase doesn't help the original statement either :)]
How about posting a snippet of the shared resources and how you are using the mutext to protect it? Btw, why are you using a mutex anyway?
I was originally using a critical section but then I had some doubts that maybe it was locking up with a possible repeated call or something, thus I reworked a synchronization object to use the wait on a mutex and a stop event together. Here's a code snippet for a synchronized object:
private:
HANDLE hCSMutex; //Synchronization mutex
HANDLE hSynchObject[CAE_COUNT]; //Synchronization objects
int nCntCSObjs; //Number of valid objects in 'hSynchObject'
int nV1; //Object being synch'ed
//Copy constructor and assignments are NOT available!
//(Create new object instances instead)
SYNCHED_V1_DATA(const SYNCHED_V1_DATA& s)
{
}
SYNCHED_V1_DATA& operator = (const SYNCHED_V1_DATA& s)
{
}
};
As an idea of what can go wrong, consider the following code:
while( TRUE )
{
switch( WaitForMultipleEvents( .... ) )
case WAIT_OBJECT_0 + 0: // Shutdown event
return 1;
break;
case WAIT_OBJECT_0 + 1:
DoGraphicsWork( );
break;
... more switch cases
}
Yes, that is exactly how it goes inside each worker thread, and I understand that if execution locks somewhere within DoGraphicsWork( ); it will pose a problem. But since it's a graphics rendering thread everything should be running (fairly) quickly within the DoGraphicsWork( ); method, that should definitely return within 5 second margin.
Does this happen only when stop event is set?
Yes.
hEventStopNow should be manual reset eventYes, of course.
Every wait operation in both threads must include hEventStopNowYes, of course.
Could using GDI+ from a worker thread be causing problems?
Thank you for bringing it up because that was my main concern. In the back of my head I thought, "what if GDI+ does not support multi-threading like I do it here?" I still don't know the answer, by the way.
But I think I found a possible culprit for this issue (which is less mundane). I found one SendMessage in the worker thread that is calling to the main thread and that is a red flag #1. So I replaced it with a PostMessage and now have to wait if the glitch repeats again. (Which is quite painful by itself.)
On a side note, I want to ask you. This was always confusing for me, since some system APIs can call SendMessage themselves internally. One of my worker threads may call GetWindowLongPtr(GWL_STYLE), SetWindowLongPtr(GWL_STYLE) and SetWindowPos(). Can those APIs call SendMessage internally?
Igor Vartanov
November 8th, 2010, 04:07 PM
SendMessage can cause a problem only when message handler gets blocked internally. And SendMessage itself is harmless most of the times calling main thread which typically is the latest thread freeing its resources and quitting the process. Of course, the main rule of the main thread is: no waiting in main thread.
Arjay
November 8th, 2010, 04:47 PM
I'm not what you posted is pseudo code or is close to the real code. Consider:
BOOL Get_nV1(int& v1)
{
//"GET" method
//RETURN: TRUE if stop event is set (thus, need to exit ASAP w/o any further processing)
BOOL bR = ::WaitForMultipleObjects(nCntCSObjs, hSynchObject, FALSE, INFINITE) == (WAIT_OBJECT_0 + CAE_STOP_EVENT);
If the real code is close to what you have posted, then you might have a problem.
The reason being is that although you are checking if the stop event has been set, you aren't doing anything different if it has been set. Usually you'll exit a function when stop event has been set and skip any subsequent operations.
If the line V1 = nV1; does anything substantial (and waits for other locks to be released) you could have an issue.
Btw, if V1 = nV1 isn't pseudo-code and is in fact a simple integer assignment, then consider getting rid of the mutex synchronization code and use the InterlockedExchange api.
ahmd
November 8th, 2010, 09:37 PM
And SendMessage itself is harmless most of the times calling main thread which typically is the latest thread freeing its resources and quitting the process...
Igor, the main thread is where the waiting API is called from. (I probably didn't mention that.)
Usually you'll exit a function when stop event has been set and skip any subsequent operations.
The real code has two int's, one CString and COleDateTime object, can they lock internally? I doubt that it would happen from me not returning in time. I put those API calls that way because synchronization comes in its own struct and calls are made in that succession.
Arjay
November 8th, 2010, 11:26 PM
The real code has two int's, one CString and COleDateTime object, can they lock internally? I doubt that it would happen from me not returning in time. I put those API calls that way because synchronization comes in its own struct and calls are made in that succession.The point of having a stop event is to force a thread to cleanly exit early. If you don't exit from the work method before doing the work, what's the point of having a stop event?
Sure, it may seem trivial now, but what if you add something later that does depend on a shared resource? Why not write robust code that would handle changes?
I'm actually surprised to see you not using RAII for thread synchronization after all the times that folks here have suggested that you use it.
With regard to RAII, I'm not sure I've ever seen threading problem posted on this forum where the OP has used RAII for thread sync. On the other hand, I've seem many of the same types of posts where folks run into threading problems when not using RAII.
I'm not sure, but I think there is a lesson in there somewhere.
Lastly, I suspect newer parallel libraries will be popular because they hide the details of thread synchronization. The interesting thing is using RAII with a couple of design patterns nearly make threading as easy as a parallel library. I guess the trick for either is to get people to use them.
ahmd
November 8th, 2010, 11:40 PM
Arjay, I think I made it clear that I took the WaitForMultipleObjects() and ReleaseMutex() calls from a struct that is written with the RAII principles in mind. I put them this way in the example above to make it more readable. That is actually why ReleaseMutex() is called last (because it is called from a destructor in a real example.)
superbonzo
November 9th, 2010, 03:43 AM
if the stop event and the mutex become signaled during the same wait call then WaitForMultipleObjects will return WAIT_OBJECT_0 + CAE_MUTEX and not WAIT_OBJECT_0 + CAE_STOP_EVENT; from msdn:
If bWaitAll is FALSE, the return value minus WAIT_OBJECT_0 indicates the lpHandles array index of the object that satisfied the wait. If more than one object became signaled during the call, this is the array index of the signaled object with the smallest index value of all the signaled objects.
therefore, you should change your enum to something like
enum CS_SYNCH_OBJECTS{
CAE_STOP_EVENT, //Stop event - MUST BE FIRST!
CAE_MUTEX, //Synchronization mutex
CAE_COUNT //MUST BE LAST!
};
ahmd
November 9th, 2010, 04:43 AM
Good point, thanks.
Still, can someone share their views on this:
This was always confusing for me, since some system APIs can call SendMessage themselves internally. One of my worker threads may call GetWindowLongPtr(GWL_STYLE), SetWindowLongPtr(GWL_STYLE) and SetWindowPos(). Can those APIs call SendMessage internally?
Alex F
November 9th, 2010, 07:12 AM
If API uses SendMessage internally, this is written in MSDN, for example, see SetWindowText.
Arjay
November 9th, 2010, 09:02 AM
Arjay, I think I made it clear that I took the WaitForMultipleObjects() and ReleaseMutex() calls from a struct that is written with the RAII principles in mind. I put them this way in the example above to make it more readable. That is actually why ReleaseMutex() is called last (because it is called from a destructor in a real example.)No you didn't make that clear at all.
Next time, how about writing the pseudo code as
void fn( int nV1 )
{
// RAII lock
V1 = nV1;
}
As it stands, I'm not really sure that you understand RAII as it applies to thread sync as I have yet to see an RAII implementation from you.
I'm not trying to pick on you, but often your posts tend to go round and round because details such as these are left out.
No worries though, if you understood this, I guess you wouldn't be posting. :)
ahmd
November 9th, 2010, 12:26 PM
I was just trying to make it more readable for people, that's it. If I put it as you did, people would start asking me, "What's in RAII lock?" when it's clearly not the issue here. I knew right off the get-go that someone would pick on the synchronization part and that is why I added this line in my original statement, "PS. Please note that I'm not asking to check if my process of synchronization between threads is correct."
Also what not to know about RAII -- you simply utilize a language destructors to ensure that operations that may require release/deallocation are performed. This simply rids a sloppy programmer of a possible trouble of creating a memory leak or not releasing a lock, or missing it because of an exception. That's it. This is not a panacea since many computer languages don't even support such concept. And speaking of exceptions, the most common one that can make APIs fail is a low memory exception, but guess what, if it's thrown in one place, it is most certainly guaranteed to cascade down the program in other places so it is practically impossible for a program to get out of it gracefully. Try it on some older system, the Windows will lock up and your program's behavior won't even matter.
Arjay
November 9th, 2010, 02:11 PM
With respect to RAII, you seem misunderstand its benefit otherwise you wouldn't say it's for sloppy programmers only (thanks for the dig btw. :)).
The whole point of RAII is so the programmer doesn't have to check every single code path and/or exception to ensure objects are getting unlocked.
As programs increase in complexity, it becomes more and more difficult to sync without using RAII - it's not about sloppy programming, it's about smart programming. So why not use a technique that ensures that all locks are released under any condition?
As far as your not wanting to discuss synchronization, I don't understand why not.
Issues with multithreading almost always boil down to improper synchronization. So understanding how you are synchonizing is always relevant (even if you think you don't have an issue).
Another area that the code can be improved (over manual locking/unlocking) is to use encapsulation as I've mentioned several times before.
Simply putting all the thread synchronization code into a single class and exposing thread-safe setters and getters will eliminate the majority of theading issues.
Now since this isn't the first time you've posted about having trouble with threading, I have to wonder why the advice of RAII/encapsulation isn't being followed (or perhaps you misunderstand and think it's being followed).
ahmd
November 9th, 2010, 03:20 PM
Arjay, come on. I am following everything you just said. I don't know why you keep repeating it? That sample I posted above is employing your principle of encapsulation. Did I not call those methods a "getter" and a "setter"? I also keep repeating for you that I removed WaitForMultipleObjects() and ReleaseMutex() APIs from a RAII struct (or a class if you'd like to call it that way) for readability here and you keep telling me that I don't understand it :)
I agree with most of what you said but unfortunately the resolution to my problem seems to be a call to SendMessage that was locking one of the threads. (Or at least I hope so. I keep my fingers crossed, since I still haven't heard of the problem repeating after I put PostMessage instead.)
Arjay
November 9th, 2010, 05:19 PM
GetWindowLongPtr(GWL_STYLE), SetWindowLongPtr(GWL_STYLE) and SetWindowPos use SendMessage internally. Even if it did, it wouldn't be posting to your apps msg queue anyway.
In my experience, there's never a need to use SendMessage when working with threads. I know that some developers use it to pass data between threads, but I don't like to involve windows messaging with threads because message processing is dependent on the types of message being sent and other messages can be sent in a higher priority than a user message (and therefore interrupt when a user message gets processed).
For threading, I only will ever post a message strickly to notify a UI thread that some state has changed. The data is only ever passed via a controller class that offers thread safe data accessor methods.
ahmd
November 9th, 2010, 05:45 PM
GetWindowLongPtr(GWL_STYLE), SetWindowLongPtr(GWL_STYLE) and SetWindowPos use SendMessage internally. Even if it did, it wouldn't be posting to your apps msg queue anyway.
I'm sorry, did you mean to say that they do, or they don't?
As for my use of Post/SendMessage then yes, this is to notify a GUI element in the main thread. I don't pass it on a regular basis during a normal operation, but only when a thread terminates. It is used to hide a common control element.
Arjay
November 9th, 2010, 08:18 PM
They don't.
ahmd
November 10th, 2010, 11:23 PM
They don't.
Arjay, sorry to contradict you. I just got to the part of the code that locks up. (Took me a while to replicate it in a debugger, so it's 100%) Here it is:
SetWindowLongPtr(GWL_STYLE);
My only explanation why this happens is because it calls SendMessage to the main GUI thread internally.
Igor Vartanov
November 11th, 2010, 12:33 AM
It's very and very unlikely. SetWindowLong must talk directly to kernel, and I cannot imagine even a single reason why it should use messaging.
ahmd
November 11th, 2010, 12:53 AM
It's very and very unlikely. SetWindowLong must talk directly to kernel, and I cannot imagine even a single reason why it should use messaging.
Igor, I'm 100% positive. I've tried it several times with a debugger and that particular line locks up the worker thread it's called from. In it I'm enabling the WS_VISIBLE flag for a UI control from the main thread. Whatever it does internally is similar to the SendMessage behavior.
Igor Vartanov
November 11th, 2010, 01:03 AM
and that particular line locks up the worker thread it's called from. In it I'm enabling the WS_VISIBLE flag for a UI control from the main thread.So this naturally brings us to the idea of doing it in main thread. :) I mean, after posting custom message of course.
ahmd
November 11th, 2010, 01:15 AM
So this naturally brings us to the idea of doing it in main thread. :) I mean, after posting custom message of course.
Yeah, this seems like what I'll end up doing. It's just so much rewriting that I'll have to subject myself to... Plus posting messages back and forth doesn't really go well (or fast) with the GUI stuff.
So, the bottom line is that it's generally not a good practice to call any API that may be dealing with a UI element from a thread other than the main, is that correct?
PS. Now I don't know what other APIs may also lock up, since there's clearly not a word about it in the documentation.
Arjay
November 11th, 2010, 07:28 AM
So, the bottom line is that it's generally not a good practice to call any API that may be dealing with a UI element from a thread other than the main, is that correct?
Wow, you're just now figuring out that you shouldn't access UI components from worker threads? This surprises me since you've been told this many times in many different posts where you've had threading issues.
posting messages back and forth doesn't really go well (or fast) with the GUI stuff.Not sure where this comes from. My experience tells me opposite. :)
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.