Name: The GPU Advances: ATI's Stream Processing & Folding@Home
Item: The GPU Advances: ATI's Stream Processing & Folding@Home
Author: Ryan Smith

The GPU Advances: ATI's Stream Processing & Folding@Home

by Ryan Smith on 9/30/2006 8:00 PM EST

Posted in
GPUs

Post Your Comment
Please log in or sign up to comment.

Comments Locked

43 Comments

Back to Article

BikeAR - Monday, October 2, 2006 - link
I may be living in a vacuum these days, but did anyone notice the following comment on the
F@D site, "Help" page?...

Intel has been helping support our project(Stanford/Intel Alzheimer's Research Program), but has announced that it is ending their contribution to distributed computing in general and no longer supports any distributed computing clients, including F@H.

What is up with this?
Staples - Monday, October 2, 2006 - link
I'd love to know how a fully loaded X1900 folding would be compared to an E6600 Core 2 Duo? These GPUs have always been said to be 100s of times faster than CPUs at what they are designed to do so I'd love to see if it is really true or not. If not, it looks like we may have been lied to for so so many years.
Ryan Smith - Monday, October 2, 2006 - link
Unfortunately it looks like they'll be using larger units. We thought we'd be able to use the same units for both the normal and GPU-accelerated clients, but this appears to not be the case. There's no direct way to compare the clients then, the closest we could get is comparing the number of points given for a work unit.
peternelson - Sunday, October 1, 2006 - link
"So far, the only types of programs that have effectively tapped this power other than applications and games requiring 3D rendering have also been video related, such as video decoders, encoders, and video effect processors. In short, the GPU has been underutilized, as there are many tasks that are floating-point hungry while not visual in nature, and these programs have not used the GPU to any large degree so far."

Erm, not so! Try looking at GPGPU.org

Also see the books GPU Gems and GPU Gems 2.
mostlyprudent - Sunday, October 1, 2006 - link
BTW, a type-o in the 1st paragraph of the article: "...manipulate data in ways that goes far beyond" -- should read "in ways that GO far beyond..".

I knew about SETI, but was completely unaware of F@H. Thanks. I will look to get involved!
imaheadcase - Sunday, October 1, 2006 - link
Did anandtech just post a article on the weekend when most PC users are at home so they can read it? Amazing! :P
CupCak3 - Sunday, October 1, 2006 - link
When loading the client, our team number is 198. :)

If anyone has any questions and/or would like to join the TeAm, come and visit us http://forums.anandtech.com/categories.aspx?catid=...">here. We have many people which would be more than happy to answer any question.

I'll try to answer some of the questions and comments which have been posted thus far:

NVIDIA support may come later. This is a BETA right now so only a small number of devices will be supported. The supported ATI line will expand and when NVIDIA gets the kinks worked out on using GPGPU processing for their cards, I"m sure the Pande group would be glad to them up :)

No one knows about Crossfire support yet. We'll know more about this tomorrow.

Same goes for using CPU + GPU.

Scalability: A multithreaded client I've heard is in the works but I'm sure its on hiatus with the GPU client coming into BETA.

That is correct in saying for each core, a separate folding@home client must be loaded. Right now the max amount of RAM one core will use is between 100 and 120 megs of RAM. Other workunits utilize around 5 or 10 megs. The client will only load workunits with this amount of RAM usage if you have the resources to spare. (so not just having 1 256 meg stick in your XP box and then using 110 megs of that for folding) We still yet do not know the RAM requirements for the vid cards.

The Pande Group has posted that 1600 and 1800 series cards will be the next ones supported :) (if all goes well of course)

Linux and Max OSX client is also available for those wondering.

I hope this helps!
Messire - Sunday, October 1, 2006 - link
Hi folks

It is surprising to me that this Stanford project works only on the most modern and powerful ATI GPUs, because i know of an another Stanford project called BrookGPU. Here is the URL: http://graphics.stanford.edu/projects/brookgpu/">http://graphics.stanford.edu/projects/brookgpu/

It is only in beta now and seems to be abandoned a little bit, but they've made a very usable GENERAL PURPOSE streaming programming language wich works on much more older GPUs as well. And it is very fast...

Messire
lopri - Sunday, October 1, 2006 - link
I know this question is somewhat off the discussion at the moment, but I can't help but ask. Would this crank up the GPU usage to 100% as it does to CPU? All the time? Then it could be a problem for average users, because the X1900 is hot as it is even in idle.
GhandiInstinct - Sunday, October 1, 2006 - link
Don't get me wrong, I'd love to contribute, as I have with SETI, but what evidence is there that this will help anything or even at all?

I mean we've had super computers working in science for a while now and I haven't heard of any major breakthroughs because of it, and if my computer is going to exhaust a little extra heat I want some numbers to crunch before I do so.

That's all.
photoguy99 - Sunday, October 1, 2006 - link
Basic research is like that - it may takes a lot of years to benefit from it.

Look at Einstein, his work was fundamental research but the benefits are still being realized 100 years later.

So even if they have major breakthroughs they may be at such a foundational level that the actual cure for Alz. comes 25 years later.

Nature of the beast.
JarredWalton - Sunday, October 1, 2006 - link
http://folding.stanford.edu/results.html">Published Results

Current research includes:
http://folding.stanford.edu/FAQ-diseases.html#AD">Alzheimer's, Cancer, Huntington's Disease, Osteogenesis Imperfecta, Parkinson's Disease, Ribosome and antibiotics.

And of course, there's always http://folding.stanford.edu/faq.html">the Folding@Home FAQ.

Do they know in advance that all of the issues are related to protein folding? No, but I'd assume they have good cause to suspect it. The problem is that it takes time; breakthrough results might not materialize soon, next year, or even for 5-10 year. Should research halt just because the task is difficult? Personally, I think FAH has a far greater chance of impacting the world during my lifetime than SETI@Home.

Cheers!
Baked - Sunday, October 1, 2006 - link
I wonder if a X1600 card will work. I've tried both the graphics and command line version of F@H on my new system but both had problem connecting to F@H server. Hopefully this new F@H version will work.
JarredWalton - Sunday, October 1, 2006 - link
Next step is to extend to X1800 and probably from there to X1600. Beyond that, the G70 chips are probably the next up would be my guess.
smitty3268 - Sunday, October 1, 2006 - link
I assume the new client uses the cpu + gpu, and not just the gpu? Also, it would be nice to have some sort of explanation for the poor nvidia performance in the next article. Is it just their architecture, or has Folding@Home been getting assistance from ATI and not NVidia?

This doesn't make much sense:
quote:
Additionally, as processors have recently hit a cap in terms of total speed in megahertz, AMD and Intel have been moving to multiple-core designs, which introduce scaling problems for the Folding@Home design and is not as effective as increasing clockspeeds.

The Folding@Home design is quite obviously a massivly parallel design, as shown by the fact that hundreds of thousands of computers are all working on the same problem. Therefore, doubling the amount of cores would double the amount of work being done and this seems to be happening faster than the old incremental speed bumps.

Otherwise, it was a good article.
z3R0C00L - Monday, October 2, 2006 - link
It's Simple.. F@H uses Dynamic Branching Calculations. nVIDIA GPU's are technologically inferior to ATi VPU's when it comes to shading performance and branching performance.

As such.. nVIDIA's highly mighty GeForce 7950GX2 would perform much like an ATi Radeon x1600XT. In other words.. too slow.
tygrus - Monday, October 9, 2006 - link
The Nvidia FP hardware is fast enough but the overall design doesn't fit well with the software(task) design of F@H. For other tasks the Nvidia GPU's may be very fast. The next Nvidia GPU & API will hopefully be better.

The CPU handles the data transformation and setup before sending to GPU for the accelerated portion. Then the CPU overseas the return of data from the GPU. The CPU also looks after the log, text console, disk read/write, internet upload&download, and other system ovreheads.

More information is available from ???

i just found a really great article which covers the public release: http://techreport.com/onearticle.x/10907">http://techreport.com/onearticle.x/10907

quote:

--------------------------------------------------------------------------------
Talk w/Vijay Pande

ATI is currently 8X faster than Nvidia. Nvidia has our code, running it internally, hope we can close the gap. But even 4X difference is large, and ATI is getting faster all of the time.

Lot of work goes into qualifying GPUs internally so they can run.

Making apps like this run on a GPU requires a lot of development work. Currently, science is best served by using ATI chips. Nv may come in future.

---------

The CPU has to poll the GPU to find out if it finished a block and needs help (data from GPU->CPU etc). This takes a context switch and CPU time to wait for the reply (ns wait not fixed number of cycles). Any acutual work is done in the remaining time slice or more. The faster the GPU, the more it demands uses the CPU. Slow the CPU in half and you may be slowing GPU by upto half.
Ryan Smith - Sunday, October 1, 2006 - link
The new client "uses" the CPU like all applications do, but the core is GPU-based, so it won't be pushing the CPU like it does on the CPU-only client, I don't know to what level that means however.

As for the Nvidia stuff, we only know what the Folding team tells us. They made it clear that the Nvidia cards do not show the massive gains that ATI's cards do when they try to implement their GPU code on Nvidia's cards. Folding@Home has been getting assistance from Nvidia, but they also made it clear that this is something they can do without help, so the problem is in the design of the G7x in executing their code.

As for the core stuff, this is something the Folding team explicitly brought up with us. The analogy they used is trying to bring together 2000 grad students to write a PhD thesis in 1 day, it doesn't scale like that. They can add cores to a certain point, but the returns are diminishing versus faster methods of processing. This is directly a problem for Folding@Home, which is why they are putting efforts in to stream processing, which can offer the gains they need.
smitty3268 - Sunday, October 1, 2006 - link
Does the G7x have as much support for 32bit floats as ATI does? It seems like I read somewhere that one of the two had moved to 32 bit exclusively while the other was still much faster at 16/24 bit fp. Could that be why they aren't seeing the same performance from NVidia?
Clauzii - Monday, October 2, 2006 - link
Probably that, and the fact that the big ATI models contain 48 shaders - pretty beefes the calculations up!
photoguy99 - Sunday, October 1, 2006 - link
The folding team just hasn't designed their architecture efficiently for parallelism within a system.

No doubt they are brilliant computational biologists, but it's simply an oxymoron to claim a system can scale well using thousands of systems but not with the cores within those systems - Nonsense.

In fact I challenge anyone from their coding team to explain this contradiction.

Now if they say look, we're busy, we just haven't had time to optimize the architecture for multi-core yet, then that makes perfect sense. But to say inherently the problem doesn't lend itself to that is not right.
JarredWalton - Sunday, October 1, 2006 - link
Not at all true! See above comments, but data dependency is a key. They know the starting point, but beyond that they don't know anything. So they might generate 100,000 (or more) starting points. There's 100K WUs out there. They can't even start the second sequence of any of those points until the first point is complete.

Think of it within a core: They can split up a task into several (or hundreds) of pieces only if each piece is fully independent. It's not like searching for primes where scanning from 2^100000 to 2^100001 is totally unrelated to what happened in 2^99999 to 2^100000. Here, what happens at stage x of Project 2126 (Run 51, Clone 9, Gen 7) absolutely determins where stage x+1 of Project: 2126 (Run 51, Clone 9, Gen 7) begins. A separate task of Project: 2126 (Run 51, Clone 9, Gen 6) or whatever can be running, but the results there have nothing to do with Project: 2126 (Run 51, Clone 9, Gen 7).
photoguy99 - Monday, October 2, 2006 - link
Jared, I respectfully submit that you are not correct.

Think of it this way - what is the algorithmic difference between submiting jobs to distributed PCs vs. distributed processes within a PC?

Multiple processes within a PC could operate indepedently and easily take advantage of the multi-core parallelism. A master UI process could manage the sub processes on the machine so the that user would not even require special setup by the user.

I'm telling you the problem with leveraging multi-core is not inherent to the folding problem, it's just a limitation of how they've designed their architecture.

Again not to take away credit from all the goodness they have achieved, but if you think about it this is really indisputable. I'm sure their developers would agree.
JarredWalton - Monday, October 2, 2006 - link
Are we talking about *can* they get some advantage from multiple cores with different code, or are we talking about gaining a nearly 2X performance boost? I would agree that there is room for them to use more than one core, but I would guess the benefit will be more like a 50% speedup.

Right now, running two instances of FAH nearly doubles throughput, but no individual piece is completed faster. They could build in support for executing multiple cores without user intervention, but that's not a big deal since you can already do that on your own. Their UI could definitely be improved. The difficulty is that they aren't able to crank out individual pieces faster; they can get more pieces done, but if there's a time sensitive project they can't explore it faster. For example, what if they come on a particular folding sequence that seems promising, and they'd like to investigate it further with 100K slices covering several seconds (or whatever). If piece one determines piece 2, and 2 determimes 3... well, they're stuck with a total time to calculate 100K segments that would be in the range of thousands of years (assuming a day or two per piece).

Anyway, there are tasks which are extremely difficult to thread, though I wouldn't expect this to be one. Threading and threading really well aren't the same, though. Four years from now, if they get octal core CPUs, that increases the total number of cores people can process, but they wouldn't be able to look at any longer sequences than today if CPUs are still at the same clockspeed. (GPUs doing 40X faster means they could look at 40X more length/complexity.)

Anyway, without low level access to their code and an understanding of the algorithms they're using, the simple truth is that neither of us can say for sure what they can or can't get from multithreading. Then there's the whole manpower problem - is it more beneficial to work on multithread, or to work on something else? Obviously, so far they have done "something else". :)
smitty3268 - Monday, October 2, 2006 - link
Looking at their website, they are working on a multithreaded core which would take advantage of smp systems. Regardless of how well that turns out, a 40x increase is not going to happen until we get > 40 cores in a cpu, so this GPU client is still a very big deal.

I understand what you mean about data dependence and not being able to move on to more involved simulations due to time factors of individual work units, but it seems like this would be fairly easy to solve by simply splitting the work units in half or in quarters, etc. This could definitely be difficult to do, though, depending on how their software has been designed. Perhaps they would have to completely rewrite their software and it isn't worth the trouble.
JarredWalton - Tuesday, October 3, 2006 - link
I don't think they can split a WU in half, though, or whatever. Best they can do would be to split off a computation so that, i.e. atoms 1-5000 are solved at each stage on core 1 and 5001-10000 are on core 2. You still come back to the determination of the "trajectory". If you start at A and you want to know where you end up, the only way to know is to compute each point on the path. You can't just break that calculation into A-->C and then C-->B with C being halfway.

I know the Pande people are working on a lot of stuff right now, so GPUs, PS3, SMP, etc. are all being explored to varying extents.
icarus4586 - Wednesday, October 4, 2006 - link
The reason that modern GPUs are so powerful is that they have many parallel processing pipelines, which is only a little different than saying that they have many processing cores. Even the diagram given in this article is titled: "Modern GPU: 16-48 Multi-threaded cores." If the F@H algorithm can be optimized to use the parallelism that exists within modern GPUs, it should also be optimizeable for the parallelism of multi-core CPUs.
smitty3268 - Sunday, October 1, 2006 - link

quote:
As for the core stuff, this is something the Folding team explicitly brought up with us.

I still don't really see what the actual problem is, but I'll certainly take their word for it. Maybe if I ever get a degree in biochemistry I'll try and figure out what's going on :)

Thanks for the info. I think I'll go ahead and install F@H. It's something I've occasionally meant to do but I keep forgetting about it.
Furen - Sunday, October 1, 2006 - link
I think it's about data dependency. Let's say you start 2000 processes on different PCs and run them for 1 unit time. The result from this is 2000 processes at 1 unit time, not 1 process at 2000 units time, which is probably what you'd prefer. Having a massive speed up on a single node means that that node can push a single "calculation" farther along. I'd guess that the client itself is not multithreaded because of the threading overhead, it may not be worth the effort to optimize heavily for a dual-core speed up since the overhead will take a chunk out of that but a 40x speed up is another thing altogether.
JarredWalton - Sunday, October 1, 2006 - link
The way FAH currently works is that pieces of a similation are distributed; some will "fail" (i.e. fold improperly or hit a dead end) early, others will go for a long time. So they're trying to simulate the whole folding sequence under a large set of variables (temperatures, environment, acid/base, whatever), and some will end earlier than others. Eventually, they reach the stage where most of the sequences are in progress, and new work units are generated as old WUs are returned. That's where the problem comes.

If we were still scaling to higher clock speed, they could increase the size/complexity of simulations and still get WUs back in 1-5 days on average. If you add multiple cores at the same clock speed as earlier CPUs (i.e. X2 3800+ is the same as two Athlon 64 3200+ CPUs), you can do twice as many WUs at a time, but you're still waiting the same amount of time for results that may be important for future WU creation.

Basically, Pande Group/Stanford has simulations that they'd like to run that might take months on current high-end CPUs, and then they don't know how fast each person is really crunching away - that's why some WUs have a higher priority. Now they can do those on an X1900 and get the results in a couple days, which makes the work a lot more feasible to conduct.

That's one scenario, at least.
ProviaFan - Sunday, October 1, 2006 - link
The problem with F@H scaling over multiple cores is not what one might first think. I've run F@H on my Athlon X2 system since I bought it when the X2's became available in mid-2005. Since each F@H client is single-threaded, you simply install a separate command line client for each core (the GUI client can't run more than one of itself at once), and once they are installed as Windows services, they distribute nicely over all of the available CPUs. The problem with this is that each client has its own work unit with the requisite memory requirements, which with the larger units can become significant if you must run four copies of the client to keep your quad-core system happy. The scalability issues mentioned actually involve in the difficulty in making a single client with a single unit of work multi-threaded. I'm hoping that the F@H group doesn't give up trying to make this possible, because the memory requirements will become a serious issue with large work units when quad and octo core systems become more readily available.
highlandsun - Sunday, October 1, 2006 - link
Two thoughts - first, buy more RAM, duh. But that raises a second point - I've got 4GB of RAM in my X2 system. If resource consumption is really such a problem, how is a GPU with a measly 256MB of RAM going to have a chance? How much of the performance is coming from having superfast GDDR available, what kind of slowdown do they see from having to swap data in and out of system memory?

As for Crossfire (or SLI, for that matter) why does that matter at all, these things aren't rendering a video display any more, they don't need to talk to each other at all. You should be able to plug in as many video cards as you have slots for and run them all independently.

It sounds to me like these programs are compute-bound and not memory bandwidth-bound, so you could build a machine with 32 PCIEx1 slots and toss 32 GPUs into it and have a ball.
icarus4586 - Tuesday, October 3, 2006 - link
It depends on the type of core, and the data it's working on. There's an option when you set up the client for whether or not to do "big" WUs. I've found that a "big" WU generally uses somewhere around 500MB of system memory, while "small" ones use 100MB or less. I would assume that they'd target it to graphics card memory sizes. Given that the high-end cards they're targeting have 256MB or 512MB of RAM, this should be doable.
gersson - Saturday, September 30, 2006 - link
I'm sure my PC can do some good...3.5Ghz C2D and x1900 Crossfire.
I've done some Folding @ home before but could never get into it. I'll give it a spin when the new client comes out.
Pastuch - Saturday, September 30, 2006 - link
I just wanted to say thanks to Anandtech for writing this article. I have been an avid reader for years and an overclocker. People always talk about folding in the OC scene but I never took the time to learn just what folding@home is. I had no idea it was research into Alzheimer's. I'm downloading the client right now.
Griswold - Sunday, October 1, 2006 - link
Unfortunately, many overclocked machines that are stable by the owners standards dont meet the standards of such projects. The box may run rock stable but can you vouch for the the results to be correct due to the fact that the system is running outside of its specs?

If you read the forums of these projects, you will soon see that the people running them arent too fond of overclocking. I've never seen any figures, but I bet there are many, many work units being discarded (yet you still get credit) because they're useless. However, the benefit still seems to outweight the damage. There are just so many people contributing to the project because they want to see their name on a ranking list - without caring about the actual background. I guess this can be called a win-win situation.
JarredWalton - Sunday, October 1, 2006 - link
If a WU completes - whether on an OC'ed PC or not - it is almost always a valid result. If a WU is returned as an EUE (or generates another error that apparently stems from OC'ing), then it is reassigned to other users to verify the result. Even non-OC'ed PCs will sometimes have issues on some WUs, and Standford does try to be thorough - they might even send out all WUs several times just to be safe? Anyway, if you run OC'ed and the PC gets a lot of EUEs (Early Unit Ends), it's a good indication that your OC is not stable. Memory OCs also play a significant role.
nomagic - Saturday, September 30, 2006 - link
I'd like to put some of my GPU power into some use too.
Griswold - Sunday, October 1, 2006 - link
Read the article.
Furen - Saturday, September 30, 2006 - link
Has ATI updated it at all? I dont have an ATI video card around here so I can't go check it out but from what I've seen it's was an extremely barebone application.
JarredWalton - Saturday, September 30, 2006 - link
You mean the converter for AVIVO? It's still pretty basic, but it converts a lot faster than other applications I've used. AVIVO video decoding has definitely improved, though.
ViRGE - Saturday, September 30, 2006 - link
AFAIK encoding still isn't hardware accelerated. It's just fast because it turns out the equivalent of fast/non-quality mode from other encoders.
inoculate86 - Thursday, March 24, 2011 - link
Would be really cool if some hardware review sites like Anandtech ran Folding@Home benchmarks for CPUs and GPUs reviews, so people who are really into folding can easiier pick out the best hardware to buy. these days, its not just video games people want performance for anymore!

The GPU Advances: ATI's Stream Processing & Folding@Home

Post Your Comment

43 Comments

Back to Article

BikeAR - Monday, October 2, 2006 - link

Staples - Monday, October 2, 2006 - link

Ryan Smith - Monday, October 2, 2006 - link

peternelson - Sunday, October 1, 2006 - link

mostlyprudent - Sunday, October 1, 2006 - link

imaheadcase - Sunday, October 1, 2006 - link

CupCak3 - Sunday, October 1, 2006 - link

Messire - Sunday, October 1, 2006 - link

lopri - Sunday, October 1, 2006 - link

GhandiInstinct - Sunday, October 1, 2006 - link

photoguy99 - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

Baked - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

smitty3268 - Sunday, October 1, 2006 - link

z3R0C00L - Monday, October 2, 2006 - link

tygrus - Monday, October 9, 2006 - link

Ryan Smith - Sunday, October 1, 2006 - link

smitty3268 - Sunday, October 1, 2006 - link

Clauzii - Monday, October 2, 2006 - link

photoguy99 - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

photoguy99 - Monday, October 2, 2006 - link

JarredWalton - Monday, October 2, 2006 - link

smitty3268 - Monday, October 2, 2006 - link

JarredWalton - Tuesday, October 3, 2006 - link

icarus4586 - Wednesday, October 4, 2006 - link

smitty3268 - Sunday, October 1, 2006 - link

Furen - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

ProviaFan - Sunday, October 1, 2006 - link

highlandsun - Sunday, October 1, 2006 - link

icarus4586 - Tuesday, October 3, 2006 - link

gersson - Saturday, September 30, 2006 - link

Pastuch - Saturday, September 30, 2006 - link

Griswold - Sunday, October 1, 2006 - link

JarredWalton - Sunday, October 1, 2006 - link

nomagic - Saturday, September 30, 2006 - link

Griswold - Sunday, October 1, 2006 - link

Furen - Saturday, September 30, 2006 - link

JarredWalton - Saturday, September 30, 2006 - link

ViRGE - Saturday, September 30, 2006 - link

inoculate86 - Thursday, March 24, 2011 - link

Log in

Don't have an account? Sign up now