Quanta Winterfell RAM stability and RAPL Mode

As I previously posted, I got my hands on a few inexpensive Quanta Winterfell compute nodes.  The systems have been running well except for occasional RAM errors that I’ve been trying to debug.

I’ve been using PC3-10600R of 2GB, 4GB, and 8GB sizes, single and dual rank.  The system would seem to be more stable when RAM was only inserted into the first slot of each bank (with the white tabs), but produce more errors once the second slots were populated.  This wasn’t 100% consistent either though.

In FreeNAS this would appear as read errors on the console.  Under VMWare it would appear as a complete system lockup and/or pink screen of death with a message about memory errors.

Both systems would run more reliably if I were to force the RAM speed to 1066MHz (instead of the default 1333MHz), but this of course reduced system performance and also wasn’t 100% consistent (some RAM combinations would still cause failures).

The BIOS is more of a developer-level BIOS, which means there are lots of settings available to tune all sorts of parameters, including changing RAM voltage and timing.   At first tried increasing the timing values, thinking maybe there wasn’t enough timing margin.  This really didn’t help, so then I reduced the clock rate and had improved but still flaky.

After some further testing I believe I’ve found a solution in the BIOS that results in stable performance.  Longer-term testing will be necessary, but initial results look good.

There is a setting called RAPL Mode.  RAPL stands for Running Average Power Limit, which essentially is a way to save power by limiting how much the RAM gets to use.  There are different modes, which appear to essentially be different algorithms to determine how much power the RAM should be using.  My best guess is that it is poorly implemented on these systems, which results in the RAM being starved of power and then getting corrupted.  Each mode is a newer algorithm which is supposed to result in better performance.  By default the system is set to Mode 1.

On one system I have 56GB of RAM and am running FreeNAS.  In this case I could adjust the RAM speed to 1333MHz and use RAPL Mode 0.  To test I copied data to the disk, which fills the ARC.  After 24hrs there have been no reported errors on the console (typically would start within a few minutes at 1333MHz previously).

On another system I have 112GB of RAM and am running VMWare ESXi.  At RAPL Mode 0 and 1333MHz the system hung immediately after a VM was migrated to it.  At RAPL disabled and 1333MHz the system has been up with no issues.  To test, I ran a 3D EM field solver that consumes 62GB of RAM when active on a task for about 45 minutes.

So in summary, RAPL might save power, but it could corrupt your data.  I will be disabling this feature on these systems and continue monitoring their performance.

5 thoughts on “Quanta Winterfell RAM stability and RAPL Mode”

    1. Best I could tell from the mechanical drawings is that the provided connector ends up slotting into two large metal bars in the chassis connected to a large power and ground plane. This didn’t seem practical for my needs, so I removed them and soldered extension wires and connected them to the server PSUs. Thanks to Bitcoin mining, it’s easy to get details on modifying a server supply to turn on and which ones are easy to use.

      1. Hi,

        Thanks for your reply. I finally got one and using a 30A power supply with 4mm single core wires directly to the motherboard and it works great.

        Just another question though. Have you been able to setup the remote access or console to this machine ?

        Thanks
        Kamal

    1. I put them on the shelf for now. I had one running as a FreeNAS backup server, and that got replaced with a more reliable Dell R620.
      I actually ran into another issue – the chassis has a hot swap controller that is power limited below what the max system might request. In this case the system will power off immediately. There is a jumper near the back to set it to a slightly higher limit, but it’s still well below what a fully loaded CPU/RAM configuration might be.
      At some point I might pull one off the shelf and play with it some more, but I’ve moved on to other things right now.

Leave a Reply

Your email address will not be published. Required fields are marked *