14 Aug

Temperatures Rising

I recently purchased an Nvidia RTX 3060 Ti from the second hand market. The seller assured me the card was in great working order, and that temperatures were good in the small case he was testing it in before shipping. I trusted the seller, and skipped my own due diligence. Per MSI Afterburner and sources online, this card has a rated temperature of 83 degrees. This means, that while the card will not break at that temperature, it will start throttling to keep the chip from being damaged. The maximum temperature for this card is listed as 93C, and Nvidia states the card will operate until that temp, then shut down if the temperature goes over that. Some modern cards also have a hot spot temperature reading available, which is the hottest point on the card “measured from a network of thermal sensors across the GPU die

The tl;dr is: If you have an RTX 3060 Ti, specifically I can speak to the Asus 3060 Ti Dual Mini v2, and the hot spot is running at over 100 degrees while playing a 20 year old, albeit badly optimized game, you may have issues. If the card is under warranty, get it looked at. If not, and you can afford to risk your card breaking – take it apart at your own reconnaissance, clean it and replace the thermal compound. You might be positively surprised by the results. I certainly was.

My only regret is I do not have all the before pictures. I was too stressed at the time to remember those, and my mind was not on this blog post that I had not yet written. Or thought of.

GeForce RTX 3060 Ti on the Windows 11 desktop with some applications open, as seen through the GPU-Z Sensors view. This was taken AFTER the rigamarole below.
MSI Afterburner showing the temp limit at 83 degrees with a 100% power limit.

In the GPU-Z picture above, we can see that the difference between the GPU Temperature and the Hot Spot is about 10-12 degrees. Per several forum threads, this seems to be in line with what they should be. But this isn’t how things started out. I was seeing temperature differentials over over 20 degrees, with the hot spot going as high as 35 C above the GPU temp during gaming or stress testing!

Signs and portents

But let’s take a step back. When I got the card, I immediately noticed that it had some dust on it, so I proceeded to clean it out with controlled puffs of compressed air, making sure the fans didn’t spin while doing that to prevent any undue stress on the motor/bearings. This should be pretty much standard fare when getting any used part. And the seller was absolutely right, the card was externally in great shape. I plugged it into the machine, and ran some 3D Mark tests. Initial results were not promising: I was about 2000 points under similar builds. I noticed that like and idiot, I had some background apps on, with Steam downloading something or other at a high rate of speed. I made sure all apps were closed, that I was running in the high performance power mode in Windows, and ran the tests again. Now I was within a few hundred points of similar builds’ results. I’m not (anymore) one to go after the very last bits of performance, like I perhaps once was as a wee lad, so I called it a day. Next order of business: temperatures and stress testing. The card was running a bit hot: 83-85 degrees during testing, but what really got me worried was the hot spot temperature. 105.4 degrees! Trawling around the interwebs for a while showed me several similar discussions. Some of them calmed my nerves, stating that the hot spot could be as high as 110 degrees – but I was unable to find an official source for this. Eventually I just wanted to see if the thing was stable. Furmark it is, which, as someone called it “is good for nothing more than a cooling test”. My computer crashed after just a few minutes of Furmark. The crash made the fans ramp up to 100%, the screens both went black, and nothing I did got me back to windows. Hard reset was the only way to get back. I dismissed it as Furmark being Furmark – an absolute worst case for your GPU, which I think it even states somewhere. But inside, there was a gnawing sensation, tiny teeth nibbling at the edge of perception. Something wasn’t right.

The third stage of grief

My mind usually goes into a kind of obsessive loop when I hit an issue like this. It’s not right, and it needs to be fixed. Then I start bargaining with myself. I tell myself, “Self: If this card is broken, it’s your fault somehow, and anyway you can fix it with cash – just buy a new one. Stop worrying.” But I never convince myself, not really. So I shut it all down, and take the card out.

I attempted to take the card apart to see if there was something wrong with the thermal paste. There was, but I wouldn’t find that out – not just yet. The card seemed easy enough to disassemble: just six screws holding the heatsink into place, sandwiching the PCB between the metal back plate and heatsink. Trusty iFixit toolkit in hand, I deftly took those screws out. The heatsink wouldn’t budge though, and being careful not to rip chips out of the PCB, I only gave it a cursory wiggle. As Assembly 2023 was fast approaching, I decided to put things back together instead of risking damage to card or nerves at this point. After all, I was able to play things just fine! The only crash was in Furmark. For now.

Assembly 2023 came and went, and my computer performed admirably during the whole event. I ran my case fans at 12V for the duration, hoping to deliver enough air through the case to get me through the worst temp spikes. Playing for 5 hours straight isn’t really a normal use case for me, and the machine was stable throughout the event . I started to relax slightly, but still that other part of me was googling for thermal pads and thermal compound. Just in case.

The second going

Back home. We were playing WOW Classic, and I had the settings pretty maxed out, with a 1440p resolution. We were riding our elfin mounts through the musky confines of Dustwallow Marsh when my computer crashed. Again. Black screens. Fans at max. I tried disconnecting my screens, keyboard and mouse, and then connecting everything back. No dice. The machine was dead crashed. Event viewer showed issues with the dwm, and kernel issues referring to an nvidia driver and nvlddmkm. This thread pretty much describes the same issue I had. The DWM had apparently crashed, then restored, then crashed again a total of three or four times, but then given up the ghost. Pressing Win+Ctrl+Shift+B to restart the graphics driver also did nothing. Hard reset again. If the card crashes while playing the, admittedly not very well optimized, 2005 hit game World of Warcraft, there was an issue. It had to be solved.

Hands on

I took the card out again, removing the six screws. Using a hair dryer, I warmed up the card (actually using the lowest heat and blow setting). After a minute or two, the heatsink came off easily. I instantly saw what the problem was. The thermal compound on the GPU die was dry, flaky and caked. Because of my bewilderment at this, at most 2 year-old card, having dry thermal compound, I did not take a picture of the GPU in this state. I do have pictures of what the compound looked after I scraped it off:

I used a piece of paper and the plastic “pick” from the iFixit kit, and some pure isopropyl alcohol to clean off the GPU die. The thermal pads looked fine to me, so I left those alone. I didn’t have a replacement for those anyway, and due to their thickness and the components in contact with them, you cannot replace them with thermal compound.

GPU cleaned off. Some residue from the thermal pads can be seen on the memory chips surrounding the GPU chip.
The flipside, yo. Here we can see the cleaned off (mostly) copper pad of the heatsink, and the various strips and bits of thermal pad, which contact the memory chips as well as power delivery components on the PCB.
Another shot of the heatsink showing the surprisingly thick thermal pads! I didn’t have a caliper on hand, but I would estimate that those are at a minimum 3mm thick. But this isn’t an official number; ASUS doesn’t list that, nor is it in the thermal pad size database I linked to above. The pads were soft to the touch, and relatively clean and uniformly colored, so I left them alone.
Another shot of the thermal pads which with the card assembled would rest on the power delivery components.

I used a dab of Thermal Grizzly Kryonaut, my go-to crème, and put the card back together, remembering to attach the cable going to the GPU fans. Before that, I also took an old toothbrush to the heatsink. Some of that gunk was really stuck on there, unable to be evicted by compressed air.

The exciting conclusion

This story has a happy ending. The card cooled down by 20-40 degrees depending on use case, and this is looking at the hot spot. Here’s the first 3DMark stress test (20 minute Time Spy thing), showing the highest GPU temps, somewhere between 72 and 75 degrees. Note the hot spot temperature: less than 85! That’s down more than 20 degrees from what I saw in WOW Classic before the new thermal compound!

Pardon the quality, I snapped this with my phone because I didn’t want to disturb the stress test. I guess it would have been even more stressful to take a damn screenshot…

Also you can see the card is actually boosting in this screenshot, to 1770 MHz. Something it didn’t often do before. Whatever I’ve done to the card now, the GPU Temperature and Hot Spot temperatures do not diverge by more than 10-13 degrees.

I still run WOW classic at slightly lower settings, just because most of that stuff is not even noticeable. In the Feralas region, I get GPU temps of 66-69 degrees, with the hotspot reaching 80-84, depending on what I’m doing and where I am looking. I haven’t gone back to Dustwallow Marsh, not yet. Those wounds are still too fresh.

Leave a Reply

Your email address will not be published. Required fields are marked *