Opteron Meltdown?

Posted by cwright on 2007.03.20 @ 10:38

For one of Kosada’s consulting clients, we’ve set up this dual-core AMD Opteron server. Normally it runs all happy-like and does what it needs to. We take care of its basic needs, and it takes care of the rest.

Opteron after meltdown

However, Tyan — the motherboard manufacturer — saw fit to include only 1 of the typical 3 heat sink mounting tabs. Somehow, this 1 tab managed to suddenly break on Sunday, March 18, at about 4:45pm.

And now the processor has Opteron Cancer.

Opterons are not designed to run without a heat sink, so of course this lead to an almost instantaneous overheat — and the BIOS is configured such that the machine does a hard-power-off under high temperature conditions, so as to avoid thermal damage.

On the plus side, we were able to be on-site within a few hours of when this all happened, so we could assess the situation and make appropriate repairs. We replaced the mounting bracket with a spare, and continued on our way.

As the day progressed, our monitoring software showed some interesting behavior though.

Munin Temperature Graph demonstrating erratic behavior after heatsink break

The CPU temperature was abnormally high for mostly-idle conditions. In fact, it was as hot now as it used to be under full load. And to make it worse, the temperature wasn’t even stable, but instead erratically went up and down between about 50 and 70 degrees Celsius. We decided to replace the thermal paste, which binds the heat sink to the CPU, in the hopes that — because the heatsink’s been removed and replaced a few times — it was no longer operating at its optimum efficacy.

This thermal-paste-replacement procedure revealed an even more shocking discovery, however!

On the top of the CPU was some form of malignant cancerous tumor!

Opteron during server installation:

Opteron during server installation

Opteron meltdown, after thermal paste was removed:

Opteron meltdown, after thermal paste was removed

This tiny addition to the CPU adversely affects the transfer of thermal energy from the CPU to the heat sink. Because of this interruption, the temperature is now considerably higher, and more erratic. Basic attempts to pry the addition off, as if it were foreign-based, were to no avail.

Wherever this meltdown-bubble came from, and whatever it is, it is here to stay.

At least until we get that replacement CPU.

I have this same board. I found Tyan’s mounting system and AMD’s OEM heatsink retainers with all that plastic to be cheap-ass feeling.

I’ve since switched to the Intel Woodcrest/Conroe design, and never looked back. Its faster, lower power, and you can actually buy them easily whereas the availability of the Opteron Socket F / 2200 gear is hard to find.

Also, even with boxed processors, AMD sucks at warranty fulfillment. They never seem to have the same model/stepping of CPUs that broke to replace with under warranty. This is especially bad in the 8xx/8xxxx series, where getting warranty fulfillment for me has been impossible.

I’ve also written off Tyan and completely switched to Supermicro and Intel reference servers.

Also, based on your first picture, that circle-outline makes it look like the thermal paste originally used was making very poor contact wiht the heat spreader.

When I mounted mine with artic silver 5 (instead of using the ShinEtsu G751 inegrated on the oem heatsink), i used plastic gloves and spread a thin film on the heat spreader and on the sink, very thin (use gloves to avoid mixing your hand/body oils with the grease/paste), and then used a pea-sized amount in the middle.

I’ve been running mersenne MPRIME 2.53 at torture test #2 (maximum heat) and I cant break 50C core temp.

Mick, you’re dead on regarding the cheap feel to these parts. After repeated failures, we’ve gained a sure knowledge of that as well.

The circle on the chip was caused after the thing on the chip formed. Obviously, that caused terrible contact with the surface. Prior to that, we had cleaned the CPU and found no evidence of poor connectivity like that, so the “cancer” was the cause of it. That said, the grease was laid on a bit heavy for my taste.

I personally have never had a problem with model/stepping-level differences, so I wouldn’t consider it essential. It is, however, bad form if nothing else. It’s probably more of an issue in multi-cpu (not multi-core) setups, which I’ve not dealt with extensively.

When our heat sink properly connects, we also enjoy the cool core temps you mentioned, which is quite nice. Was that for your woodcrest/conroe setup, or your amd gear?