Products You May Like
Earlier this summer, Amazon released its MMO, New World, for beta testing. It didn’t take long for reports of failed RTX 3090 GPUs to begin popping up online. The failures were specifically concentrated in EVGA RTX 3090 GPUs. Back when Ampere launched, some wondered whether the different capacitors used on some models were a potential source of instability.
The strange part of the story was the fact that software almost never kills hardware. There are two exceptions to this policy: Firmware updates (which stretch the definition of “software”) and applications that can somewhat mimic thermal viruses like Furmark and Prime95. Even then, well-cooled components with adequate VRM cooling typically do not fail. And the bare handful of applications that can cause component failure almost never do so without contributing factors like the age of the equipment and its operating temperature.
EVGA pledged to replace the affected cards and it analyzed 24 of the models and shared the results with PCWorld. According to Gordon Mah Ung: “Under an X-ray analysis, they [the affected RTX 3090s] appear to have “poor workmanship” on soldering around the card’s MOSFET circuits that powered the impacted cards.” The failures were confined to the RTX 3090 family and EVGA shipped new cards to gamers immediately, it says, rather than waiting for units to be returned.
MOSFETs (Metal Oxide Semiconductor Field Effect Transistors) are a critical component in power circuitry. Poor soldering around the MOSFETs themselves could cause failure by increasing temperatures and resistance to the point that the hardware failed.
Why Did RTX 3090s Fail When Running New World, Specifically?
Here’s where we have to split a hair. The EVGA cards died because they suffered from manufacturing defects, not because of something wrong with New World. But New World exposed the failure point because of design decisions Amazon made.
GPUs are enormous parallel calculation machines, but they don’t have a subjective understanding of “enough.” To a GPU, “enough” is either “the frame rate I’ve been ordered to deliver” or “as fast as I possibly can without violating one of my pre-programmed constraints on temperature, voltage, and current.” There’s nothing in between.
It’s possible that the reason the RTX 3090s died is that menu options don’t typically require a lot of RAM to store the data. If the rendering workload can be held within on-die L2 or requires minimal memory access, it means it can be fed to the execution units as quickly as those execution units can fetch it. A simple workload without bottlenecks is more likely to scale. In a more complicated test scenario (i.e., actual gameplay), the GPU’s execution efficiency falls and frame rates correspondingly drop. This limits the heat build-up on the chip.
The fact that the overwhelming majority of RTX 3090 cards didn’t die is proof that Amazon’s uncapped frame rate wasn’t enough of a problem to harm most cards. It killed a handful of GPUs that had pre-existing manufacturing defects, however, because doing nearly nothing as quickly as possible likely generated more heat than doing lots of complex things at a much lower frame rate.
I have a theory that this is a larger problem for GPUs than CPUs because of fundamental differences in their design. A GPU is designed to perform one operation across a huge range of cores simultaneously. A CPU is designed to execute complex, branchy code as quickly as possible. If you write a simplistic application on the CPU, what you’ll get is a single-core, single-thread program with SSE2 optimization — or x87, if you really want to get historic. If you want to run that code at peak performance, you have to optimize it for AVX2 or AVX-512. It needs to support multiple cores. The naive implementation on x86 doesn’t lend itself to high execution efficiency across a modern, multi-core chip. Storing the code in L1 or L2 will make it execute more quickly, but without multi-core support, you’ll still only be stressing a single core.
This is not to say that one cannot write an x86 power virus, or make coding mistakes that drive up power consumption, but GPUs seem to offer a swifter path to the same outcome. “Perform this minimally bottlenecked operation across the entire chip as quickly as you can” is an easier way to stress a GPU because GPUs are built to run that kind of operation to begin with.
Purely a theory on my part, but it would also explain why specific feature tests and sometimes older titles are a good way to test maximum power consumption on GPUs. Asking the card to do less sometimes encourages it to run hotter than asking it to do more.