Opteron Meltdown?

For one of Kosada’s consulting clients, we’ve set up this dual-core AMD Opteron server. Normally it runs all happy-like and does what it needs to. We take care of its basic needs, and it takes care of the rest.

Opteron after meltdown

However, Tyan – the motherboard manufacturer – saw fit to include only 1 of the typical 3 heat sink mounting tabs. Somehow, this 1 tab managed to suddenly break on Sunday, March 18, at about 4:45pm.

And now the processor has Opteron Cancer. Opterons are not designed to run without a heat sink, so of course this lead to an almost instantaneous overheat – and the BIOS is configured such that the machine does a hard-power-off under high temperature conditions, so as to avoid thermal damage.

On the plus side, we were able to be on-site within a few hours of when this all happened, so we could assess the situation and make appropriate repairs. We replaced the mounting bracket with a spare, and continued on our way.

As the day progressed, our monitoring software showed some interesting behavior though.

Munin Temperature Graph demonstrating erratic behavior after heatsink break

The CPU temperature was abnormally high for mostly-idle conditions. In fact, it was as hot now as it used to be under full load. And to make it worse, the temperature wasn’t even stable, but instead erratically went up and down between about 50 and 70 degrees Celsius. We decided to replace the thermal paste, which binds the heat sink to the CPU, in the hopes that – because the heatsink’s been removed and replaced a few times – it was no longer operating at its optimum efficacy.

This thermal-paste-replacement procedure revealed an even more shocking discovery, however!

On the top of the CPU was some form of malignant cancerous tumor!

Opteron during server installation:

Opteron during server installation

Opteron meltdown, after thermal paste was removed:

Opteron meltdown, after thermal paste was removed

This tiny addition to the CPU adversely affects the transfer of thermal energy from the CPU to the heat sink. Because of this interruption, the temperature is now considerably higher, and more erratic. Basic attempts to pry the addition off, as if it were foreign-based, were to no avail.

Wherever this meltdown-bubble came from, and whatever it is, it is here to stay.

At least until we get that replacement CPU.