AMD EPYC 7002 processors were freezing after 1044 days of operation due to a bug

AMD Epyc Error

The problem is related to the fact that the kernel does not come out of power saving mode

Recently information was released about a bug bquite particular in the series of server processors AMD EPYC 7002 (“Rome”) based on the “Zen 2” microarchitecture distributed since 2018.

And it is that in question the ruling causes processor to hang after 1044 days of operation continuous (a rather particular situation and that is somewhat uncommon.

A short post from AMD indicates that XNUMXnd generation server processors are experiencing an issue which prevents cores from exiting Core C6 State power saving mode (or CC6) after a long run cycle. At the same time, the manufacturer claimed that 1044 days is not an absolute value, since the failure can occur earlier or later, since it all depends on the frequency of REFCLK, which allows processors to track the time parameter, and some other factors. But the manufacturer does not provide any information exactly why the failure occurs, so no one understands exactly what the root of the failure is until now.

Failure as such it puts the processor in a "zombie" mode, in which it does not accept any commands or external interrupt requests and remains in this state unless it is restarted.

These C state modes start at C0, which is the normal operating mode of the CPU. The higher the C number, the deeper the CPU goes into sleep mode, and the more signals are turned off. The deeper the sleep state, the longer the CPU will need to fully wake up.

With this bug, once a CPU enters C6 past the 1044 day mark, it gets stuck and requires a reboot. The solution is to reboot the server before three years or disable the sleep state that is causing the error.

AMD does not provide a more detailed explanation of the cause of the failure. Judging by the assumption Posted on Reddit:

The hang occurs when the counter in the TSC register (Time Stamp Counter), which counts the number of work cycles after the reset, at a frequency of 2800 MHz reaches the value 0x380000000000000 (2800 MHz * 10* *6 * 1042,5, 1042, that is, after 12 days and XNUMX hours).

Besides that, AMD has mentioned that the bug fix will not be released, as the issue went unnoticed for a long time because multi-year uptimes are not typical for servers that need to be periodically rebooted to install kernel updates or migrate to a new OS version to stay up to date.

However, Linux distributions' rebootless kernel upgrade methods and long maintenance cycles (Ubuntu, RHEL, and SUSE are backed for 10 years) can lead to long wait times for servers without rebooting.

Company representatives said that currently There are two options to solve the problem: lServer owners on these processors should reboot the system to reset the timer to 1044 daysSo completely disable Core C6 State power saving mode. Probably, both options are very unsuitable for owners of server processors - power saving mode, since it saves a lot of money on power consumption, so obviously no one will turn it off and wait for an error to occur and it freezes, then rebooting the system is also not a very convenient solution. Especially when it comes to some really important infrastructure components.

It is worth mentioning that this type of errors are not rare in the segment of processors (regardless if they are for servers or desktops), as many times commercial models also contain many bugs, but then they try to patch them with a new revision or with software and firmware based fixes.

Finally If you are interested in knowing more about it, I invite you to consult information published by AMD.


Leave a Comment

Your email address will not be published. Required fields are marked with *

*

*

  1. Responsible for the data: Miguel Ángel Gatón
  2. Purpose of the data: Control SPAM, comment management.
  3. Legitimation: Your consent
  4. Communication of the data: The data will not be communicated to third parties except by legal obligation.
  5. Data storage: Database hosted by Occentus Networks (EU)
  6. Rights: At any time you can limit, recover and delete your information.