For approximately the last year, my friend M and I have been trying to troubleshoot his triple GTX580 workstation. We discovered that when the system is in SLI mode, any heavy GPU processing would cause the system to shutdown. After monitoring the temperature of each component, we found that the graphic cards were reaching their temperature threshold and shutting the system down. This led us to a long and winding bit of troubleshooting for each software and hardware component, including cleaning the GPU blocks, rearranging the cards and updating drivers.
Fast forward to this afternoon, when we had the chance to fully remove each card, reattach the block and test each card individually. What we found put the failure into the Aqua blocks we were using.
This is a photo of the block immediately after removal. Clearly the connection between block and card was incomplete, to the point of only 10-20% of the surface area is making contact. Clearly this results in terrible heat transfer and a serious increase in temperature. The thing is that we installed the block correctly according to the instructions: thin layer of thermal paste and securely fastening each screw.
In order to correct the issue, we had to put a lot of thermal paste onto the chip. This is not optimal.
This is the only way we've been able to get a proper connection, and it results in a substantial bow in the PCB.
It appears to be a manufacturing issue, as installing the block correctly does not result in a proper connection. Is this a design error, known manufacturing defect or a bad batch? All three blocks we're using exhibit the same problem, which appears to be incorrect depth of the main chip pad.
EDIT: The end result of tightening the block and adding extra paste is a rather efficient bit of heat transfer. The cards are operating at approx 35C at idle, and rarely hit higher than 46C under load.