void threadfunc(int threadnum) {
g_threadnum=threadnum;
test();
}
What I am trying to do is "mark" each thread with a variable so that each thread can instantly know "I am thread number X". These checks will be performed frequently, so I'm looking for the fastest method.
I was previously passing threadnum on into all subroutines of the thread function but after reading about TLS i realize there may be more efficient ways of doing this.
Arjay
February 19th, 2008, 12:22 PM
Could we put aside speed considerations for a moment and consider thread safety?
The code you posted is not thread safe because there could be a conflict when two threads access the same global (if you are running two thread simultaneously).
Can you give an idea of why you need to know what thread you are in? I ask because there might be a way to structure the code that is threadsafe without the need to track which thread you are in.
zerver
February 20th, 2008, 03:38 AM
The code you posted is not thread safe because there could be a conflict when two threads access the same global (if you are running two thread simultaneously).
Thanks for your reply but I am confused.
You mean that the __declspec(thread) in front of the global variable has no effect at all?
That does not make sense because when I run this code, my printf debugging tells me that g_threadnum has a different value, depending on which thread reads the value.
I need the thread number for many purposes. Threads store their results in arrays that are indexed using the thread number. Threads use arrays of synchronization objects that are also indexed by thread number.
E.g. myLocks[g_threadnum].Lock();
Arjay
February 20th, 2008, 07:00 AM
Thanks for your reply but I am confused.
You mean that the __declspec(thread) in front of the global variable has no effect at all?
That does not make sense because when I run this code, my printf debugging tells me that g_threadnum has a different value, depending on which thread reads the value.
I need the thread number for many purposes. Threads store their results in arrays that are indexed using the thread number. Threads use arrays of synchronization objects that are also indexed by thread number.
E.g. myLocks[g_threadnum].Lock();You are correct in terms of __declspec(thread). I misread that as a dll export.
This is my preference, but I prefer to take more of an OO approach to solving a problem like this and not have global variables or globals arrays being accessed by multiple threads.
I don't know if you are performing synchronizating tasks in the thread itself, but my preference is to create a class that provides thread safe wrappers, and to pass a pointer to this class to the thread during thread creation. Then the thread just calls the accessor methods and doesn't need to worry about synchronization. This class pointer can be passed directly during thread creation or be a member of another class or struct. For example, you may have a struct that contains the pointer to the class, and an ID that is used to identify the thread. Inside the thread proc, the ID is passed to the accessor functions so that the data for that thread can be accessed.
Other approaches are to create class with it's own static thread proc and containing its own data. When a class instance is created, the thread is started (passing in the this pointer) and the thread works on the data within the class instance. This approach works only for situations where there are relatively few threads (< 25).
For a greater number of threads, you can leverage QueueUserWorkItem() api. Here you create a class that contains a static thread proc, but instead of creating a thread for each class instance, you pass the thread sproc to QueueUserWorkItem() and the system will create and maintain a thread pool for you.
These are just different approaches for solving the same type of problem.
zerver
February 20th, 2008, 09:45 AM
Thank you. Well, global variables is the only way for me. What I'm doing is OpenGL multithreading.
If we return to my original question regarding performance of accessing the TLS variables:
If it is a "real" variable (and it seems so) the system must dynamically allocate a new code segment for each thread that wants to execute a function that uses the TLS variable. It must then perform some kind of relocation so that each thread in fact uses a different variable.
Am I right in my thinking?
If so, the memory usage would increase, but performance would not suffer, unless the system is "stupid" and performs allocation/relocation each time the function is executed...
exterminator
February 20th, 2008, 10:14 AM
Thank you. Well, global variables is the only way for me. What I'm doing is OpenGL multithreading.
Is OpenGL multi-threading API different from win32 or MFC threading APIs? What MT APIs are you using?
If it is a "real" variable (and it seems so) the system must dynamically allocate a new code segment for each thread that wants to execute a function that uses the TLS variable. It must then perform some kind of relocation so that each thread in fact uses a different variable.
Am I right in my thinking?Yes, you are right in your thinking but you cannot precisely know what happens behind the scenes and probably not worth it. The underlying data structure could be anything - a map, a hash map, or something completely different.
If so, the memory usage would increase, but performance would not suffer, unless the system is "stupid" and performs allocation/relocation each time the function is executed...Not everytime a function is executed. If you declared the thread local variable as __declspec(thread) declarator, the variable would be created once any thread is created and probably destroyed as soon as the variable gets the destroyed. And that should not happen each time any function in a thread gets called.
What I am not understanding is what you are calling a global variable? A simple standard global variable is not sufficient to hold a number for all threads because it is just one variable and any read or write should be synchronized. Synchronization is not necessary because it doesn't fit your purpose in the first place. Is that what you calling a global variable? A thread local variable on TLS is actually a global variable but it is thread specific. There are as many instances of that same name as there are the threads and you can use them without any interferences among threads unless you actually share a pointer to that variable with another thread.
By the way, with pthreads, there is a function called pthread_self() which gives back the opaque thread ID of the calling thread. That is the way to actually identify threads. I suppose there should be some similar way to identify your threads rather than having to create a TLS variable. With win32 threading APIs, you have GetCurrentThreadId() function. That should suffice in case you are doing win32 threads. Else you will have to tell what APIs are you using.
exterminator
February 20th, 2008, 10:22 AM
I am not sure if I have answered your question and hence you would have to explain a bit more. Is it a thread pool that you are planning to make? Why do you need to have code that does something different depending upon different thread IDs?
zerver
February 20th, 2008, 10:29 AM
Thanks. Yeah, what I meant was "global TLS variable".
Hash map is unlikely because you can actually take the address of a TLS variable and use it, as long as you are inside a specific thread.
Looks like I'll have to do some testing to see if reading/writing TLS is slower than a standard global variable.
TheCPUWizard
February 20th, 2008, 10:58 AM
TLS (thread local storage) IS what is used when you use _declspec(thread) (http://msdn2.microsoft.com/en-us/library/9w1sdazb(VS.80).aspx) . Be aware that this is a Microsoft extension, and will not port to other compilers or platforms. Also the items must be simple POD (not classes!)
The actual access is done by a set of helper routines (http://msdn2.microsoft.com/en-us/library/6yh4a9k1(VS.80).aspx)
There are rstrictions on access, including taking the address of a TLS variable (http://msdn2.microsoft.com/en-us/library/2s9wt68x(VS.80).aspx)
In almost all cases, accessing a TLS variable will be slower than accessing a "true" global or static vaiable. And is typically equivilant to a stack or heap access.
It is critical to remember that performance is always a secondary concern to reliability and maintainability. Only if it can be determined that a specific portion of code is having a measurable impact [via the use of some type of profiling tool] that performance should become the focus.
This is true for EVERY type of application, ranging from Games to Missile Guidance was ;) ].
I am also VERY skeptical of ANY design that has any of the following charastics:
1) Dependancies on static or global data.
2) Requires Knowledge of system context (e.g. thread, process, processor)
3) Heavily depends on proprietary extensions (they can usually be encapsulated)
The "real world" issues that will usually impact an application are most usually issues that arise from outside the application itself. A dramatic example was SQLServer-2005 vs. SQLServer-2000. All of the "benchmarks" indicated a general improvement in performance (some cases dramatic, others minimal). But when deployed to one (very large) client base, the results were that SQL2005 was orders of magnitude SLOWER than SQL2000. It turned out that there was one specific model Pentium (which this company had used in over 5,000 systems) which had a slight difference in the L1/L2 caching.
The code in SQL2005 had grown by a few bytes for a key algorithm deep in the heart of SQL Server. For all other processors the code would either: Not be held in the cache (older smaller slower processors) or Would be held in the cache (all recent processors) for both the 2000 and 2005 versions.
But for the one specific processor, the SQL2000 implementation DID fit in the cache, but the SQL2005 version did NOT fit. This resulted in faults all the way back to main RAM (and potentiall the disk resident page file). :eek: :eek: :eek: :eek:
zerver
February 20th, 2008, 11:08 AM
"Win32 and the Visual C++ compiler now support statically bound (load-time) per-thread data in addition to the existing API implementation."
Doesn't this mean that it also supports "real" TLS variables?
I am not sure if I have answered your question and hence you would have to explain a bit more. Is it a thread pool that you are planning to make? Why do you need to have code that does something different depending upon different thread IDs?
You could call it a thread pool.
I already said this, but it uses the thread number to index arrays.
results[g_threadnum]=...;
locks[g_threadnum].Lock();
So, if g_threadnum is not a "real" variable (a disguised function call) it will be too slow for my requirements.
I may choose to put all the thread data in a thread class to reduce the frequent array indexing, but the access to g_threadnum must still be fast.
TheCPUWizard
February 20th, 2008, 11:49 AM
"Win32 and the Visual C++ compiler now support statically bound (load-time) per-thread data in addition to the existing API implementation."
Doesn't this mean that it also supports "real" TLS variables?
I would be interested in seeing the source of that reference. It is impossible to allocate and reference the space at process load time. There is no way of knowing if there are going to be 1 or 10000 threads running at a given time. The last time I actually looked at the disassembly the size of a TLS block was known, and dynamically allocated at thread creation, then all access was via an indexed offset.
You could call it a thread pool.
It either IS a thread pool, or it is not. I can call a wild lion a kitty....
I already said this, but it uses the thread number to index arrays.
results[g_threadnum]=...;
locks[g_threadnum].Lock();
So, if g_threadnum is not a "real" variable (a disguised function call) it will be too slow for my requirements.
I may choose to put all the thread data in a thread class to reduce the frequent array indexing, but the access to g_threadnum must still be fast.
1) Again I question WHY the need. Your posted sample makes NO sense. If there is going to be one lock per thread, then the lock will NEVER get invoked!
2) Something being implenented as a function is not a performance issue per se. The function could be inlined, and then optimized with the surrounding code.
3) "It will be too slow for my requirements". Exactly how many nano or micro seconds must the specific operation complete in? How will you handle system interruptions (e.g. interupts, task switches)? How will you deal with hard page faults? etc....
Arjay
February 20th, 2008, 03:19 PM
Whenever possible, in multithreading you want to control the access of shared data in an encapsulated way.
Code such as this can be extremely problematic in an multithreaded environment.
locks[g-threadnum].UnLock(); // I assume you need to unlock here
}
It's problemmatic because you allow external control of the locking and unlocking and this makes it difficult to track down synchronization issues.
Now consider the design I was referring to earlier. In this design, during thread creation the a struct is used to pass in a thread index and a pointer to a class that provides thread safe access to shared data.
Notice in the above snippet that the thread doesn't need to worry about synchronization as the sync code is encapsulated within the CDataManager class.
Many of the perceived threading issues can be reduced following an approach such as this. IMO, developers get into trouble when they don't compartmentalize the synchronization chores within a controlling class.
TheCPUWizard
February 20th, 2008, 09:39 PM
IMO, developers get into trouble when they don't compartmentalize the synchronization chores within a controlling class.
:thumb: (see point #2 in reply #11!!!!!) :D
zerver
February 25th, 2008, 07:33 AM
I would be interested in seeing the source of that reference. It is impossible to allocate and reference the space at process load time. There is no way of knowing if there are going to be 1 or 10000 threads running at a given time. The last time I actually looked at the disassembly the size of a TLS block was known, and dynamically allocated at thread creation, then all access was via an indexed offset
1) Again I question WHY the need. Your posted sample makes NO sense. If there is going to be one lock per thread, then the lock will NEVER get invoked!
The source of the reference is MSDN :D Actually, it is one of the links provided by you.
http://msdn2.microsoft.com/en-us/library/6yh4a9k1.aspx
1. There is one lock per producer thread. Each producer thread has two vectors where it posts the results. When a vector is full, it switches to the other vector.
There is also a consumer thread that checks all producer threads for vectors containing data. Producer/consumer must of course lock a vector to start producing/consuming.
I really appreciate your help. However, this discussion has gone a little out of control. If you don't have anything to say regarding how (the fastest form of) TLS is implemented, please do not post. If no one replies i will simply analyze the performance myself. Should not take long to find out whether it is slower than ordinary variable accesses.
zerver
February 26th, 2008, 11:25 AM
I found an interesting article about this so I think can say: Problem solved
The conclusion of the article is that TLS is slower than ordinary variable access.
Each access translates into at least three instructions.
I guess it would be possible to make TLS faster by copying relevant parts of the code segment each time a thread is started and then relocating certain variables.
However, this could cause a tremendous overhead for starting a thread and also dramatically increase the memory usage.
codeguru.com
Copyright Internet.com Inc., All Rights Reserved.