Well for one thing, Intel has far more documents than AMD/ATI/Nvidia in documenting various parts of their hardware. It's almost impossible to find exact current usage and TDP on chipsets for ATI/Nvidia.
There are indeed 8192 (or 16384) registers per multiprocessor. So they are shared by all the warps running on that multiprocessor. Now comes an interesting part. Say you can have 10 warps (320 threads) running on a multiprocessor, and you have 2 warps per block (64 threads). Then you have 5 blocks per MP. And when a block is finished, it gets quickly replaced by the next block that needs to run.
So the scheduler is constantly juggling threads around to keep the ALU's busy, and when a et of warps is done, it quickly fetches the next set to keep everything nice & warm.
Some tests done by people on the CUDA forums have indicated that this bringing in of new blocks is happening very fast indeed.
As a ASEET degree holder and a BSEE student, I love these technical articles, even if programming isn't my thing. I used to go to toms hardware, but their articles no longer include the technical stuff that makes us EE's, CE's, and other geeks all warm and fuzzy inside. So I come to Anand now.
So keep up the good work, and don't sellout like toms hardware did.
Im not so sure where you get your register count from.
It is explicited in the cuda programming guide at 8192 per multiprocessor.
As for the last comment about nvidia opening up, pretty much all the needed info to make the most of out the hardware is present in the programming guide.
nVidia also has a visual profiler that runs your code and profiles your occupancy and memory transactions (which are most of the time the bottleneck in kernels)
Oh and the way to hide latency is not to use more registers (as you seem to have hinted at), but to use less. Since the number of registers is fixed per MP, the less you use, the more blocks can run on a given MP.
When you have more blocks running, you can hide the latency better since you have a bigger pool a blocks to pick from.
Or maybe we dont have the same definition of "register space". You might be refering to occupancy, or number of active warps.
yeah, sorry, the register space bit was something i forgot to put in in originally and the update was a text message i sent to anand -- we got our lines crossed and it should have read as it reads now.
which is to say that using 25% or less of your register space will help hide latency.
...
on your original comment, the 8k registers are not registers physically available hardware resources. developers can use that many in software, but i can guarantee that they'll be optimized out in compilers/assemblers and swapped into and out of memory when physical register space runs low.
the comment in the thread really does suggest that the 42 registers make up 25% of the physical register file on G80. i suppose i could have misunderstood or harris could have been representing things wrong ...
I dont know... it was always my understanding (from developing cuda software) that there were 8192 registers per MP. That does sound like a ridiculously huge number of registers though.
That number is the base with which it is possible to calculate the maximum number of threads in a thread block. The nvcc compiler can be asked to create a "cubin" file in which the number of registers needed by a kernel (per thread) is displayed. 8192/that number = the maximum number of threads that ca be in a thread block. Exceed that number and the kernel will not launch and a cuda exception "invalid launch parameters" will be raised.
Page 63 of the cuda programming guide for cuda beta 2.0 gives a similar equation.
Maybe youre right and there is some swapping magic occuring down the line, but it is not how i understood it.
no problem at all ... reply to yourself all you want :-)
and that's an interesting point ... i was thinking register space per thread, but i was even going on about how context is per warp myself which would put register space defined per warp rather than per thread anyway -- it makes sense that threads in a warp would share register space.
if you multiply my number by 64 you get yours ... which makes sense as he was talking about 64 thread blocks ...
and super insane numbers of registers does make sense when realizing that register space is defined per warp too ...
my numbers should still be right on a per thread basis though ...
i have to finish reading through the cuda manuals and guides and see if i can't start talking to nvidia tech support rather than PR :-)
I have found the cuda forums to be a great place to learn.
Many of the contributors that wrote the programs in the SDK participate on the forums and id like to think they know their stuff!
As for registers per thread. If we accept there are 8192 threads available per multiprocessor, and if we want to run at least one full warp of 32 threads, that would put the maximum of registers per thread to 256. I guess we could run only 1 thread and have a full 8192 registers to a thread but that would obviously be completly useless.
I guess what im saying is that i dont think there is a "register per threads" value. There is a registers per multiprocessor fixed (per card) value and your launch configuration decides how many registers a kernel can hope to be able to use. On the other hand, a given kernel knows how many register it needs (and unlike general purpose cpus, it NEEDS those registers as there is no cachine mechanism), so you have to generate a launch configuration that agrees with this value.
Anyone with interest in these specs should read the CUDA Programming Guide doc.
For devices with compute capability 1.0 (eg GeForce 8800)
- The maximum number of threads per block is 512
- The number of registers per multiprocessor is 8192
- The maximum number of active blocks per multiprocessor is 8
- The maximum number of active warps per multiprocessor is 24
- The maximum number of active threads per multiprocessor is 768
For devices with compute capability 1.2 (eg GeForce GTX 280/260)
- The maximum number of threads per block is 512
- The number of registers per multiprocessor is 16384
- The maximum number of active blocks per multiprocessor is 8
- The maximum number of active warps per multiprocessor is 32
- The maximum number of active threads per multiprocessor is 1024
When receiving the card my first impression was that they doubled the registercount because of double support since it takes 2 registers per double. But since there was apparently a (internal?) separate compute capability it might indeed be unrelated.
Yes, and some of us can't help thinking with a bad attitude, "The b****rds, they're always holding back, making it all harder than it should be, the proprietary/patent prenatalist protectors".
Some of us enjoy the technical stuff even if we don't fully understand it, I think it would be great if one could link to articles / books / references on the web that would enable one to look into it on ones own time and understand it.
I know reading you articles I come across terms and I think "If only I had link to look further into this".
No doubt on a GPU most people are interested in gaming performance and whether it's worth their $ that's what the majority of the market wants to know.
Most people do not have an interest in technical minutae, the care as much about GPU design or architecture as they do what kind of butter knife they use. They don't care about how the knife was made all they want to know is: Does it get the job done at the price that is affordable?
that's where it's at baby. hennessy and patterson really need to tackle GPU architecture, but if you start with CPUs you'll definitely get be in a position to understand GPUs as well.
i'd say if you want to learn more, check out the above book and look into graphics programming introductions. i prefer opengl, but to be fair i haven't done anything with dx10 yet.
i would love to link concepts to things ... but that'd generate quite a bit of traffic to wikipedia (since it'd take a significant ammount of time for us to do it all ourselves), but they really aren't even the best source for people who want to learn and don't already mostly understand what's happening ...
We’ve updated our terms. By continuing to use the site and/or by logging into your account, you agree to the Site’s updated Terms of Use and Privacy Policy.
20 Comments
Back to Article
IntelUser2000 - Saturday, June 21, 2008 - link
Well for one thing, Intel has far more documents than AMD/ATI/Nvidia in documenting various parts of their hardware. It's almost impossible to find exact current usage and TDP on chipsets for ATI/Nvidia.To all its own.
Denis Riedijk - Thursday, June 19, 2008 - link
There are indeed 8192 (or 16384) registers per multiprocessor. So they are shared by all the warps running on that multiprocessor. Now comes an interesting part. Say you can have 10 warps (320 threads) running on a multiprocessor, and you have 2 warps per block (64 threads). Then you have 5 blocks per MP. And when a block is finished, it gets quickly replaced by the next block that needs to run.So the scheduler is constantly juggling threads around to keep the ALU's busy, and when a et of warps is done, it quickly fetches the next set to keep everything nice & warm.
Some tests done by people on the CUDA forums have indicated that this bringing in of new blocks is happening very fast indeed.
soloman02 - Wednesday, June 18, 2008 - link
As a ASEET degree holder and a BSEE student, I love these technical articles, even if programming isn't my thing. I used to go to toms hardware, but their articles no longer include the technical stuff that makes us EE's, CE's, and other geeks all warm and fuzzy inside. So I come to Anand now.So keep up the good work, and don't sellout like toms hardware did.
Ztx - Wednesday, June 18, 2008 - link
"Some of us enjoy the technical stuff even if we don't fully understand it,"Yup, and we sometimes learn something geeky from them:)
----
^
I agree with them, keep writing the articles Derek they are VERY informative! :D
Aileur - Wednesday, June 18, 2008 - link
Im not so sure where you get your register count from.It is explicited in the cuda programming guide at 8192 per multiprocessor.
As for the last comment about nvidia opening up, pretty much all the needed info to make the most of out the hardware is present in the programming guide.
nVidia also has a visual profiler that runs your code and profiles your occupancy and memory transactions (which are most of the time the bottleneck in kernels)
Aileur - Wednesday, June 18, 2008 - link
Oh and the way to hide latency is not to use more registers (as you seem to have hinted at), but to use less. Since the number of registers is fixed per MP, the less you use, the more blocks can run on a given MP.When you have more blocks running, you can hide the latency better since you have a bigger pool a blocks to pick from.
Or maybe we dont have the same definition of "register space". You might be refering to occupancy, or number of active warps.
DerekWilson - Wednesday, June 18, 2008 - link
yeah, sorry, the register space bit was something i forgot to put in in originally and the update was a text message i sent to anand -- we got our lines crossed and it should have read as it reads now.which is to say that using 25% or less of your register space will help hide latency.
...
on your original comment, the 8k registers are not registers physically available hardware resources. developers can use that many in software, but i can guarantee that they'll be optimized out in compilers/assemblers and swapped into and out of memory when physical register space runs low.
the comment in the thread really does suggest that the 42 registers make up 25% of the physical register file on G80. i suppose i could have misunderstood or harris could have been representing things wrong ...
Aileur - Wednesday, June 18, 2008 - link
I dont know... it was always my understanding (from developing cuda software) that there were 8192 registers per MP. That does sound like a ridiculously huge number of registers though.That number is the base with which it is possible to calculate the maximum number of threads in a thread block. The nvcc compiler can be asked to create a "cubin" file in which the number of registers needed by a kernel (per thread) is displayed. 8192/that number = the maximum number of threads that ca be in a thread block. Exceed that number and the kernel will not launch and a cuda exception "invalid launch parameters" will be raised.
Page 63 of the cuda programming guide for cuda beta 2.0 gives a similar equation.
Maybe youre right and there is some swapping magic occuring down the line, but it is not how i understood it.
Aileur - Wednesday, June 18, 2008 - link
Sorry for replying to myself again!in http://forums.nvidia.com/index.php?showtopic=66238...">http://forums.nvidia.com/index.php?showtopic=66238...
if, as he says, you use 64 theads in a block, with 42 regs per thread, and that this number represents 25% of the total register space, that amount to (64*42)/0.25=~10000.
not 8192, but still of the same magnitude.
DerekWilson - Wednesday, June 18, 2008 - link
no problem at all ... reply to yourself all you want :-)and that's an interesting point ... i was thinking register space per thread, but i was even going on about how context is per warp myself which would put register space defined per warp rather than per thread anyway -- it makes sense that threads in a warp would share register space.
if you multiply my number by 64 you get yours ... which makes sense as he was talking about 64 thread blocks ...
and super insane numbers of registers does make sense when realizing that register space is defined per warp too ...
my numbers should still be right on a per thread basis though ...
i have to finish reading through the cuda manuals and guides and see if i can't start talking to nvidia tech support rather than PR :-)
Aileur - Wednesday, June 18, 2008 - link
I have found the cuda forums to be a great place to learn.Many of the contributors that wrote the programs in the SDK participate on the forums and id like to think they know their stuff!
As for registers per thread. If we accept there are 8192 threads available per multiprocessor, and if we want to run at least one full warp of 32 threads, that would put the maximum of registers per thread to 256. I guess we could run only 1 thread and have a full 8192 registers to a thread but that would obviously be completly useless.
I guess what im saying is that i dont think there is a "register per threads" value. There is a registers per multiprocessor fixed (per card) value and your launch configuration decides how many registers a kernel can hope to be able to use. On the other hand, a given kernel knows how many register it needs (and unlike general purpose cpus, it NEEDS those registers as there is no cachine mechanism), so you have to generate a launch configuration that agrees with this value.
Hope to see you on the cuda forums soon!
jibbo79 - Thursday, June 19, 2008 - link
Anyone with interest in these specs should read the CUDA Programming Guide doc.For devices with compute capability 1.0 (eg GeForce 8800)
- The maximum number of threads per block is 512
- The number of registers per multiprocessor is 8192
- The maximum number of active blocks per multiprocessor is 8
- The maximum number of active warps per multiprocessor is 24
- The maximum number of active threads per multiprocessor is 768
For devices with compute capability 1.2 (eg GeForce GTX 280/260)
- The maximum number of threads per block is 512
- The number of registers per multiprocessor is 16384
- The maximum number of active blocks per multiprocessor is 8
- The maximum number of active warps per multiprocessor is 32
- The maximum number of active threads per multiprocessor is 1024
Denis Riedijk - Thursday, June 19, 2008 - link
GTX260 & 280 are compute capability 1.3 actually, but the numbers are correct.jibbo79 - Thursday, June 19, 2008 - link
Yes, but 1.3 only adds double precision and is completely unrelated to register counts.Denis Riedijk - Friday, June 20, 2008 - link
When receiving the card my first impression was that they doubled the registercount because of double support since it takes 2 registers per double. But since there was apparently a (internal?) separate compute capability it might indeed be unrelated.Zak - Wednesday, June 18, 2008 - link
"Some of us enjoy the technical stuff even if we don't fully understand it,"Yup, and we sometimes learn something geeky from them:)
Z.
SiliconDoc - Monday, July 28, 2008 - link
Yes, and some of us can't help thinking with a bad attitude, "The b****rds, they're always holding back, making it all harder than it should be, the proprietary/patent prenatalist protectors".Gannon - Wednesday, June 18, 2008 - link
Some of us enjoy the technical stuff even if we don't fully understand it, I think it would be great if one could link to articles / books / references on the web that would enable one to look into it on ones own time and understand it.I know reading you articles I come across terms and I think "If only I had link to look further into this".
No doubt on a GPU most people are interested in gaming performance and whether it's worth their $ that's what the majority of the market wants to know.
Most people do not have an interest in technical minutae, the care as much about GPU design or architecture as they do what kind of butter knife they use. They don't care about how the knife was made all they want to know is: Does it get the job done at the price that is affordable?
skiboysteve - Wednesday, June 18, 2008 - link
good book over the stuff from a microprocessor architecture classhttp://www.amazon.com/Logic-Computer-Design-Fundam...">http://www.amazon.com/Logic-Computer-De...;s=books...
DerekWilson - Wednesday, June 18, 2008 - link
the book ...http://www.amazon.com/Computer-Architecture-Fourth...">http://www.amazon.com/Computer-Architec...;s=books...
that's where it's at baby. hennessy and patterson really need to tackle GPU architecture, but if you start with CPUs you'll definitely get be in a position to understand GPUs as well.
i'd say if you want to learn more, check out the above book and look into graphics programming introductions. i prefer opengl, but to be fair i haven't done anything with dx10 yet.
i would love to link concepts to things ... but that'd generate quite a bit of traffic to wikipedia (since it'd take a significant ammount of time for us to do it all ourselves), but they really aren't even the best source for people who want to learn and don't already mostly understand what's happening ...