Gen 2 AMD server chips have a crash bug

A minor bug can cause a system crash after 1,044 days of uninterrupted uptime. Be sure to reboot before then.

programmer developer devops apps developer code hacker dark secrets by peopleimages getty

Semiconductors, especially CPUs, are immensely complex creations all done at the microscopic level. That there aren’t more bugs, for lack of a better word, is a testament to the efforts that these chipmakers put in to delivering solid products. But occasionally, something slips by.

AMD has issued an alert that an older processor line has a minor error. The problem exists in its Epyc 7002 line, code-named Rome, which was released three years ago. The bug, first noted on a Reddit thread, says that servers running Rome-era chips will hang after 1,044 days of uptime or nearly three years.

There is no way to reset the server other than to reboot. AMD says it will not fix the issue.

"AMD has successfully provided a remedy for an isolated challenge regarding 2nd Gen AMD EPYC processors where for some customers, a core within the processor could hang if running consistently for an extended period of time," a company spokesperson said via email.

The bug is in what’s known as the C6 Sleep State. To save energy when the CPU is idle, it can go into a low-power mode. CPUs have several power modes, which are collectively called "C-states" or "C-modes." Intel first introduced it with the 486 processor, so the idea is hardly new.

These C-state modes start at C0, which is the normal CPU operating mode. The higher the C number is, the deeper into sleep mode the CPU goes and the more signals are turned off. The deeper the sleep state, the more time the CPU needs to fully wake up.

With this bug, once a CPU goes into C6 past the 1,044-day mark, it gets stuck and a reboot is required. The fix is either reboot the server before the three-year mark or disable the sleep state that causes the bug.

That this bug even surfaced is testament to the CPU's performance; three years of uninterrupted uptime is remarkable.

You might think server updates would have dictated a reboot along the way, but then again, the Linux kernel can be patched without a reboot.

Significant CPU bugs do happen but not very often, and this certainly isn't one of them.

Related:

Copyright © 2023 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022