All posts by Plocmstart

Quanta Winterfell RAM stability and RAPL Mode

As I previously posted, I got my hands on a few inexpensive Quanta Winterfell compute nodes.  The systems have been running well except for occasional RAM errors that I’ve been trying to debug.

I’ve been using PC3-10600R of 2GB, 4GB, and 8GB sizes, single and dual rank.  The system would seem to be more stable when RAM was only inserted into the first slot of each bank (with the white tabs), but produce more errors once the second slots were populated.  This wasn’t 100% consistent either though.

In FreeNAS this would appear as read errors on the console.  Under VMWare it would appear as a complete system lockup and/or pink screen of death with a message about memory errors.

Both systems would run more reliably if I were to force the RAM speed to 1066MHz (instead of the default 1333MHz), but this of course reduced system performance and also wasn’t 100% consistent (some RAM combinations would still cause failures).

The BIOS is more of a developer-level BIOS, which means there are lots of settings available to tune all sorts of parameters, including changing RAM voltage and timing.   At first tried increasing the timing values, thinking maybe there wasn’t enough timing margin.  This really didn’t help, so then I reduced the clock rate and had improved but still flaky.

After some further testing I believe I’ve found a solution in the BIOS that results in stable performance.  Longer-term testing will be necessary, but initial results look good.

There is a setting called RAPL Mode.  RAPL stands for Running Average Power Limit, which essentially is a way to save power by limiting how much the RAM gets to use.  There are different modes, which appear to essentially be different algorithms to determine how much power the RAM should be using.  My best guess is that it is poorly implemented on these systems, which results in the RAM being starved of power and then getting corrupted.  Each mode is a newer algorithm which is supposed to result in better performance.  By default the system is set to Mode 1.

On one system I have 56GB of RAM and am running FreeNAS.  In this case I could adjust the RAM speed to 1333MHz and use RAPL Mode 0.  To test I copied data to the disk, which fills the ARC.  After 24hrs there have been no reported errors on the console (typically would start within a few minutes at 1333MHz previously).

On another system I have 112GB of RAM and am running VMWare ESXi.  At RAPL Mode 0 and 1333MHz the system hung immediately after a VM was migrated to it.  At RAPL disabled and 1333MHz the system has been up with no issues.  To test, I ran a 3D EM field solver that consumes 62GB of RAM when active on a task for about 45 minutes.

So in summary, RAPL might save power, but it could corrupt your data.  I will be disabling this feature on these systems and continue monitoring their performance.

Quanta Winterfell FreeNAS Server

I recently acquired what is known as a Quanta Winterfell Open Compute blade.  Quanta seems to make a number of OEM solutions for large companies.  In this case, Open Compute is a standard for designing no-frills high-density server systems, utilized at least by Facebook.  So what I have here could have been processing my posts, likes, etc.

Quanta Winterfell with “cover” off

The no-frills means you get a very basic chassis, which doesn’t technically even have a front, or completely enclose the entire system!  Strange looking, but when it’s sitting in a datacenter by itself, who is going to care as long as it is doing its job.

The blade itself takes a 12V input (actually 12.5V nominal) from a large space connector.  Since I didn’t have the mating part, for now I just jammed some large-gauge wire into the spades and used one of my Agilent System DC power supplies, so I could also monitor current consumption.

Horrible connection for testing (don’t try this at home).
With 2GB RAM, it idles around 50W.

The barebones blade was $90 (free shipping), which included the heatsinks and a 10Gb SFP+ NIC.  The NIC alone runs about $50, so it wasn’t a bad deal overall.

All that was left to add was a CPU, RAM, and a video card.  The system can output the console over a built-in serial port and I believe serial over LAN, but for ease of bringup I opted for the video card for now.

I wanted to see if FreeNAS would boot, so I plugged a bootable USB key into a hidden USB port (there are only 2 total), and used the other port for a keyboard.  By default the hidden USB port is disabled, so after enabling this in the BIOS it booted right up!

FreeNAS boots.

It is actually very quiet too, at least when not heavily loaded.  The fans are large, so there isn’t any of that loud datacenter whirring sound you would attribute to that environment.

Next steps:

  • Replace Agilent supply with HP server supply (modify connector)
  • Add more RAM
  • Test external SAS card and hard drive shelf
  • Get 10Gb adapter up
  • Investigate headless boot (remove video card)
  • Order second CPU

Converting a NetApp DS4243 drive shelf into a vendor-generic JBOD array

NetApp makes some nice hardware that you can occasionally find for a low price on eBay.  Unfortunately, it is typically hard to reuse since NetApp tends to require specific firmware on the hard drives in their drive shelves.  So you are then locked into their harder to find and higher-priced drives.

With a bit of experimenting, I found a method to get around this for at least one family of hardware.

Netapp’s DS4243 is a 24-bay SAS 6Gbps drive shelf.  It typically is configured with a pair of supplies (can support up to 4) and two IOM3 modules (which only support 3Gbps, but other versions exist).  I managed to pick one of these up off eBay for just under $100 with the pair of supplies and IOM3 modules mentioned.

Note I didn’t even try to use the IOM3 modules.  There might be other ways around the limitations I read about online, but I found a simple and inexpensive option that allows the disk shelf to be used as a generic JBOD array.

I also had a Dell Compellent HB-1235 12-bay SAS  6Gbps drive shelf.  This drive shelf comes with a pair of much longer named controllers (HB-SBB2-E601-COMP) that already present the drive shelf as a JBOD array to FreeNAS.  It turns out, these were manufactured by a company called Xyratex, who just happens to also manufacture the Netapp DS4243.

So what would the chance be that a Dell controller would work in the Netapp drive shelf?

I did some research, and the form and fit of the controllers matched perfectly.  The connectors are identical and placed in the same locations.

Front of the modules.
Rear of the modules, showing identical connector types and placement.

Now there is a chance that the pinout could have changed, power rails could be different, or some other issue might exist due to the fact that these weren’t specified to be connected together, but I was willing to take that chance for the sake of research.  Designing hardware in a similar industry, I took a bet that they were at least close enough to do something without blowing up.  That only question for me was how well would it work.

So all there was left to do was to plug it in and power it up!

Status is green and SAS link lights all good to go!
Drives powered up and show activity.
The NetApp drive shelf even identifies properly!

I was curious if possibly the HB-1235 controller would only see half of the NetApp drive shelf, since it was specifically used in a 12-bay drive shelf.  I purposely inserted a 500GB drive in bay 24 to test if it would work, and it identified properly with no issues at all.

So they identified, but would there be any stability issues?  To at least get a first-order estimate of this, I copied roughly 250GB of data to the array of 6x 3TB drives and had no issues. This was done over a 1Gb link.  After the copy was complete, a scrub of the volume was also successful.

The HB-1235 with two modules and two supplies cost me $120, and the NetApp was around $80.  Each unit only needs one module to run (though the HB-1235 seems to want to run the power supply fans on high when only one module is inserted).  A separate modules runs about $50, so you on a good day on eBay you can have a full 24-bay generic disk shelf for less than $200.

Dell T620 Power Interface Board that won’t power up

I recently acquired a Dell T620 chassis that included everything except a motherboard and power supply.  I had an extra motherboard already, so I installed it only to find that it would not turn on.  The 12V_AUX LED on the motherboard would light up, but when pressing the power button it wouldn’t do anything.

I started debugging by swapping components from another chassis I had, at first thinking it was the frontpanel, switch, or maybe a bad cable.  It turns out that it was the power interface board (PIB) itself.

The board appeared to be in good shape, with no obvious scratches or parts missing.

It was time to get out the microscope and do a closer inspection.  Since there aren’t a lot of parts on the board it didn’t take long to find a suspect problem.  One of the parts appeared to have a solder bridge between two of the pins (6 and 7).

A closer look at the pins:

All it took was removing this solder bridge, and the system then powered up without any further problems.

I have no clue how this would have ever worked in this state, so I’m not sure how it even made it out of Dell’s factory.  It didn’t appear to be reworked based on my experience, so this is a very strange escape.

Either way, I was able to rather quickly find the issue and fix it, saving the need to purchase a replacement.

Building an Outdoor Gate Sensor

With wireless door sensors becoming inexpensive, it would be nice to have one that can work in all elements.  I found one that has the ability to hook up an external dry contact sensor, but it was still only rated for indoor use.  To deal with this, I built an outdoor enclosure to house the wireless receiver, and used an external magnetic reed sensor on the gate.

For the housing, Lowes has an outdoor-rated plastic box by Taymac.  The box comes with three openings and a few fittings and covers.  A multi-size pack of glands was from Amazon.

In my case, I didn’t want to use any of the included holes, and instead wanted to use a separate gland.  The covers have slots for flathead screwdrivers, so a gland wouldn’t make a good seal.  Given the Taymac boxes only come with two covers, I also needed a third cover (also found at Lowes).

I also used a piece of unopen corrugated tubing which covers the sensor wires.  

I then drilled a hole to fit the gland.  

Unfortunately the mounting end of the gland was pretty short and couldn’t be secured with the locking ring, so instead I used epoxy to secure it in place.

The sensor is mounted by drilling a hole through the board close to where the gate hinge is.  Near the hinge, the door won’t swing much relative to the hinge so there’s no need to worry about the wind causing it to send false positives.  

The magnet that activates the switch is mounted to the door.  There are spacers to ensure the magnet is mounted close to the switch when it is closed.

From there, the assembly can be mounted to the fence so wiring and sealing can be completed.  The box is mounted so the gland is higher than the sensor so water wouldn’t flow down into the box if any did manage to get into the tubing.

From there, I attached the sensor to the electronics.  Then I used sealant to cover all the mating interfaces as an extra measure to make sure it is water-tight.  This includes around the tubing/gland interface as well as in the end of the tube where the wires exit.  It looks messy, but it will help seal things.

After the sealant has set, the electronics can be assembled and the door can be tested before placing the cover on the box.

Testing confirms it works well, and doesn’t falsely trigger when I shake the gate.  

Final assembly.  As you can see, it’s next to the wall of my house so I couldn’t hide the box around the corner easily.  It will need to be accessed to change the battery occasionally.  This is mounted behind the entrance, so no one should mess with it, though I might still tack down the cable so it’s not hanging out as much.  

My Take on Climate Change

All the news these days about “denying science”.  Denial of vaccines, denial of climate change, denial that the earth is round!

To you, I say GREAT!  Believe whatever you want, but understand that just because you believe that doesn’t make it true.  If you woke up tomorrow and decided you didn’t believe in gravity, all of a sudden you aren’t going to float off into the sky.  Go ahead try it.  I’ll even put money on it (and as an engineer I’m not much of a betting man).

This thinking is the same as a young child, where if they close their eyes all of a sudden they disappeared.  Even if you close your eyes to these things, they are still there, you can’t do anything about that by ignoring it!  (Note I’m not calling you a young child here, only pointing out that this simple form of thinking isn’t using your true mental abilities).  So you can be a denier, for whatever worth that is to you, or you can look around (regardless of your beliefs) notice that things around you are changing, and decide that what YOU DO does make a difference in this world.

Science is objective.  Science does not have feelings.  Scientific method is meant to provide a tool, a mechanism, to determine cause and effect that is reproducible, that want more people can try it and find out if they get the same result.

Now climate change is tricky, because scientists must design models based on past data and their most objectively designed models they can devise.  The previous data consists of the limited data available from only a few decades of the earth’s very long history.  Note that as these decades have gone along, the number and quality of measurements have improved.  Computational power has improved vastly, greatly improving models, which can include any more variables.  Throughout these enhancements over the decades, the results of these simulations and studies haven’t changed though.  The consensus (97% of scientists from all around the world, all nationalities, all religions, all political views) agree that climate change is real.  Good luck getting 97% of anyone else to agree on something!

What makes for a good argument includes being able to see the other side’s point of view.  So let’s run an experiment (yes we’re still using science here, hopefully you can also see my point of view).

Let’s put ourselves in a sealed room.  Inside that room you have two ways of generating power.  There’s a solar panel that’s receiving light from a window, and a small gasoline generator.  Which one would you want to get power from: the solar panel which won’t fill the air with carbon monoxide, or the gasoline generator that will put you to sleep in a few minutes?

I bet you didn’t pull the cord on that generator, at least if you want to continue this conversation.  Now think about this – the earth (round or flat, whatever you believe), is only SO large.  It can’t get bigger, and if you believe NASA and SpaceX are actually putting people in space (I hope you do), you know it’s also very hard and expensive to get people off this planet (and even then, we’ve yet to find anywhere else to go).  Included with that earth is our atmosphere, which like that room we were standing in a minute ago is only so large.  Now think about that gas generator again, except now there’s millions of them, larger ones in our cars and trucks, all around the planet, all running at the same time!  Tell me how you don’t believe putting all that bad stuff in the air isn’t going to eventually cause a problem.

Now on to coal.  Coal is cheap.  Coal is easy to find.  Coal is easy to remove from the earth, process, and burn for energy.  Except coal is dirty, it’s made up of lots of additional materials that don’t burn, or when they do make nasty stuff that again makes it to the atmosphere.  So coal is only cheap if you say “I don’t care about all that junk in the air”.  As soon as you need to deal with the extra crap it starts getting more expensive to “scrub” the output.  And the more a company has to do to not let the junk out, the less profit they make, the less happy they are.

Energy companies look at this and compare to other technologies and say “hey, coal was great when it’s all we had, but there’s cheaper options available now.”  To the coal miners: thank you for your service.  I’m sorry the companies you worked for didn’t set you up for the future, but when your product is no longer needed nothing is going to change that fact.  It’s time to move on.  There are other things in the world than coal, no matter what some people want to tell you.

It’s like expecting someone to still want to buy a 1992 Honda Civic (sorry if you own one).  Unless this car is nostalgic to you, you could care less about this vehicle.  There are so many newer models, with some may additional features, and higher efficiency!  Coal is the 1992 Honda Civic of energy.  There’s a few people that want to hold onto it, but the world is moving on.

Other clean energy sources are getting cheaper, and others are being invented or made more efficient.  Many countries are now relying on solar, wind, and nuclear power.  I read an article yesterday that even claims that we could now efficiently use the difference in salinity where a river meets the ocean to generate energy!  If nothing else, I’ve learned that energy is everywhere!  Can we pick some sources that we can be in the same room with?  That will at least make me feel better about being on the same planet with them.

Again, I ask, why deny climate change?  How does it benefit you?  Why not say “sure, climate change could be happening, but I’m just not going to do anything about it.”  At least you can then get behind 97% of the scientists around the world on this topic.  I see the dilemma: at this point, you’re left with “I hate the planet” or “I hate scientists”.  Without scientists, we wouldn’t know anything about this, and we could at least blissfully continue whatever we want without that lingering thought.

So screw you science!  How dare you ruin my plans for blissful life and profit!?

To this science responds, “I don’t care what you think, get on board or beware of the consequences.”

The Earthwise Wood Chipper

I had a huge pile of sticks and brush that I needed to do something with, so I decided to purchase an electric wood chipper.  Even though this had mixed reviews, I decided on the Earthwise GS70015 Chipper/Shredder.  There were a few reviews about it not turning back on for a few people, but I figured they might be pushing it too hard, overheating the motor, etc.

Earthwise GS70015 Chipper/Shredder

After chipping two bins I opened up the top to clear a branch, closed it back up, and it wouldn’t turn back on. I checked the breaker, let it sit for a few minutes, and tried again but no luck. I then opened and tried securing the top, again with no luck. I noticed there was an interlock switch that makes sure the top has been secured to allow it to turn on, and given I had just opened it up before it stopped working this was highly suspect.
I decided it was worth a shot to take this interlock switch assembly apart to see if it was the issue, and turns out this is exactly the source. The switch itself has a cover over it to prevent liquid, dust, etc. to enter from the top, but under this assembly where the actual switch is there is NO protection of the actual switch.
When I opened this assembly I found that after only 2 bins of chipped material this area was already covered with dust. The switch is a simple sliding plunger that goes into the switch body. Whenever that switch is opened and closed, it allows dust to enter the switch body. If enough (and apparently not very much) material gets inside the switch body, then the contacts will be covered and it won’t be able to make electrical contact.

Interlock switch assembly.
Interlock switch with dust on plunger.

After verifying the issue with a DMM, I bypassed this interlock switch (reducing safety but increasing functionality) and was back in business. It worked for more than 2 more hours after this with no more on/off issues.

Switch bypassed by connecting both wires to the same terminal.

This really seems like a design oversight, unless their switch vendor/design changed recently and the change hadn’t been validated. While less safe, aside from this fatal design flaw, it operated very well and I cleared a large pile of branches rather quickly.

Dell Inspiron with Overstressed Parts

I was given a Dell Inspiron 3451 that wouldn’t turn on.  Some probing around showed that power is getting onto the motherboard, but none of the switchers that power the CPU would turn on.  After some time doing this I noticed that board was getting warm near the HDMI connector, so I then began to focus on closely checking the parts in that area.

What I found was one of the switching power supply controllers had a hole in the package.  This can happen when a part overheats, or has to deal with an electrical overstress event (which also turns into overheating and package cracking/venting).

TPS51225 with a tiny hole in the package.
Circuit with broken IC.

Some more searching found a second suspect part, in this case what appears to be a diode with a pit in the package.

Package with pitting.

From what I can tell, this traces a path back toward the HDMI port itself.

Location of suspect parts.

I haven’t tested all the diodes in that path, but I’m guessing there was an electrical overstress event (ESD or surge) through the HDMI port, which decided to take a path through these parts to GND.

Is this an issue with the design? It’s hard to say for sure, but ideally the energy wouldn’t make its way into major components.  Usually there are protection diodes (possibly all the other parts in the top left) that should provide a fast path to the board’s ground plane, but in this case it didn’t turn out well.

Given the damage done, it’s probably not worth attempting to repair, since I don’t know what other parts might also be damaged and this is a low-cost laptop to begin with.  Best option in this case is to salvage parts to repair other things.

The Dell Venue that Destroys Batteries

I have a Dell Venue tablet which has a removable battery.  It’s a good thing the battery is removable, because I’ve now how to replace it once a year over the last two years.

Dell battery label.

Each battery is made of two LiPO cells.  Over time the battery cells begin to puff up.  Since there’s no room for the expanding cells, the battery begins to push on the back of the LCD, causing discolored blotches on the screen.  Removing the battery makes this issue go away, but why are the batteries dying so fast?

Bulging battery.
More bulging battery.

This battery was disassembled, and the monitoring PCB was examined.  I didn’t find anything unusual there.  The on-board circuit more than likely only protects the battery from being discharged too much, and from overheating (given the thermistor attached).

Dell battery monitor circuit.

The voltage on each cell was probed and measured 4.3V.  A LiPo cell is fully charged at 4.2V, showing that the cells are being overcharged.  Overcharging a cell will stress it and can cause the cell to puff up like mine have.

Unfortunately the problem is with the tablet and not necessarily with the batteries or protection circuit, and that means battery 3 will eventually meet the same fate.  I don’t have a solution for this yet except for attempting to contact Dell.  Since it’s out of warranty they probably won’t care, but it’s unfortunate they’ve released a product that prematurely destroys batteries.