Quanta Winterfell RAM stability and RAPL Mode

As I previously posted, I got my hands on a few inexpensive Quanta Winterfell compute nodes.  The systems have been running well except for occasional RAM errors that I’ve been trying to debug.

I’ve been using PC3-10600R of 2GB, 4GB, and 8GB sizes, single and dual rank.  The system would seem to be more stable when RAM was only inserted into the first slot of each bank (with the white tabs), but produce more errors once the second slots were populated.  This wasn’t 100% consistent either though.

In FreeNAS this would appear as read errors on the console.  Under VMWare it would appear as a complete system lockup and/or pink screen of death with a message about memory errors.

Both systems would run more reliably if I were to force the RAM speed to 1066MHz (instead of the default 1333MHz), but this of course reduced system performance and also wasn’t 100% consistent (some RAM combinations would still cause failures).

The BIOS is more of a developer-level BIOS, which means there are lots of settings available to tune all sorts of parameters, including changing RAM voltage and timing.   At first tried increasing the timing values, thinking maybe there wasn’t enough timing margin.  This really didn’t help, so then I reduced the clock rate and had improved but still flaky.

After some further testing I believe I’ve found a solution in the BIOS that results in stable performance.  Longer-term testing will be necessary, but initial results look good.

There is a setting called RAPL Mode.  RAPL stands for Running Average Power Limit, which essentially is a way to save power by limiting how much the RAM gets to use.  There are different modes, which appear to essentially be different algorithms to determine how much power the RAM should be using.  My best guess is that it is poorly implemented on these systems, which results in the RAM being starved of power and then getting corrupted.  Each mode is a newer algorithm which is supposed to result in better performance.  By default the system is set to Mode 1.

On one system I have 56GB of RAM and am running FreeNAS.  In this case I could adjust the RAM speed to 1333MHz and use RAPL Mode 0.  To test I copied data to the disk, which fills the ARC.  After 24hrs there have been no reported errors on the console (typically would start within a few minutes at 1333MHz previously).

On another system I have 112GB of RAM and am running VMWare ESXi.  At RAPL Mode 0 and 1333MHz the system hung immediately after a VM was migrated to it.  At RAPL disabled and 1333MHz the system has been up with no issues.  To test, I ran a 3D EM field solver that consumes 62GB of RAM when active on a task for about 45 minutes.

So in summary, RAPL might save power, but it could corrupt your data.  I will be disabling this feature on these systems and continue monitoring their performance.

7 thoughts on “Quanta Winterfell RAM stability and RAPL Mode”

    1. Best I could tell from the mechanical drawings is that the provided connector ends up slotting into two large metal bars in the chassis connected to a large power and ground plane. This didn’t seem practical for my needs, so I removed them and soldered extension wires and connected them to the server PSUs. Thanks to Bitcoin mining, it’s easy to get details on modifying a server supply to turn on and which ones are easy to use.

      1. Hi,

        Thanks for your reply. I finally got one and using a 30A power supply with 4mm single core wires directly to the motherboard and it works great.

        Just another question though. Have you been able to setup the remote access or console to this machine ?


    1. I put them on the shelf for now. I had one running as a FreeNAS backup server, and that got replaced with a more reliable Dell R620.
      I actually ran into another issue – the chassis has a hot swap controller that is power limited below what the max system might request. In this case the system will power off immediately. There is a jumper near the back to set it to a slightly higher limit, but it’s still well below what a fully loaded CPU/RAM configuration might be.
      At some point I might pull one off the shelf and play with it some more, but I’ve moved on to other things right now.

  1. Hi im very broke sadly, just getting out of debt. Soon i will be buying 1 or 2 nodes or maybey just the boards. want them just for 2d animation, rendering and zbrush, i was looking at these quantas and hoping to buy 2 E5 2990v2s, and put one in each node. i have ram and hard drives . i was also looking at the 4 node supermicro fattwin server. can you help me make the right moves. i would greatly appreciate it. will like to buy two more cpus later to add to the boards. what i have learned is that qpi decreases the performance of each cpu when they are working together. i would render videos faster if i split the workload in half to two seperate cpus.

    1. I don’t have experience with the FatTwins, but if you’re budget constrained I don’t think it’s the best bang for your buck. After working with these Winterfell nodes I’ve found that they feel more like pre-production quality. They do work under typical circumstances, but they are in no way as reliable as a standard server. Because of the environment they were designed for (purpose-built for specific datacenter) I feel like they are at least one board rev away from being ready for real public consumption.
      Depending on how the FatTwins were designed, it might also not support the full power capabilities of the E5-2690v2. When pushing one of my nodes on a processor-intensive task I hit the power consumption limit of the board’s hotswap controller, which immediately cuts power to the system. I think the limit on 12V was around 40A on my board.
      Also comparing pricing on eBay, I think you’d get much more value from a set of used Dell R620 (or maybe R720) servers. I’ve actually settled on mostly R620s for my production needs. They cost less, don’t have weird quirks like the one mentioned in this article, and are a more full-featured server (but lacking the unique factor).

Leave a Reply to Chris Cancel reply

Your email address will not be published. Required fields are marked *