Quanta Winterfell RAM stability and RAPL Mode

As I previously posted, I got my hands on a few inexpensive Quanta Winterfell compute nodes.  The systems have been running well except for occasional RAM errors that I’ve been trying to debug.

I’ve been using PC3-10600R of 2GB, 4GB, and 8GB sizes, single and dual rank.  The system would seem to be more stable when RAM was only inserted into the first slot of each bank (with the white tabs), but produce more errors once the second slots were populated.  This wasn’t 100% consistent either though.

In FreeNAS this would appear as read errors on the console.  Under VMWare it would appear as a complete system lockup and/or pink screen of death with a message about memory errors.

Both systems would run more reliably if I were to force the RAM speed to 1066MHz (instead of the default 1333MHz), but this of course reduced system performance and also wasn’t 100% consistent (some RAM combinations would still cause failures).

The BIOS is more of a developer-level BIOS, which means there are lots of settings available to tune all sorts of parameters, including changing RAM voltage and timing.   At first tried increasing the timing values, thinking maybe there wasn’t enough timing margin.  This really didn’t help, so then I reduced the clock rate and had improved but still flaky.

After some further testing I believe I’ve found a solution in the BIOS that results in stable performance.  Longer-term testing will be necessary, but initial results look good.

There is a setting called RAPL Mode.  RAPL stands for Running Average Power Limit, which essentially is a way to save power by limiting how much the RAM gets to use.  There are different modes, which appear to essentially be different algorithms to determine how much power the RAM should be using.  My best guess is that it is poorly implemented on these systems, which results in the RAM being starved of power and then getting corrupted.  Each mode is a newer algorithm which is supposed to result in better performance.  By default the system is set to Mode 1.

On one system I have 56GB of RAM and am running FreeNAS.  In this case I could adjust the RAM speed to 1333MHz and use RAPL Mode 0.  To test I copied data to the disk, which fills the ARC.  After 24hrs there have been no reported errors on the console (typically would start within a few minutes at 1333MHz previously).

On another system I have 112GB of RAM and am running VMWare ESXi.  At RAPL Mode 0 and 1333MHz the system hung immediately after a VM was migrated to it.  At RAPL disabled and 1333MHz the system has been up with no issues.  To test, I ran a 3D EM field solver that consumes 62GB of RAM when active on a task for about 45 minutes.

So in summary, RAPL might save power, but it could corrupt your data.  I will be disabling this feature on these systems and continue monitoring their performance.