A DRAM Failure

bubblethink · on Oct 14, 2022

I'm in the market for a laptop with ECC memory that runs linux well. I haven't found any good options. Despite AMD not following Intel's insane ECC product segmentation, there are no AMD laptops that support ECC. On the Intel side, there are maybe 3 laptops that support ECC memory: Dell Precision 7670, HP zbook fury g9, Thinkpad P16. They are all quite similar with the major downside that the iGPU is shit for all the Intel CPUs that support ECC. Intel sells only high core count variants (HX) with ECC, which means less silicon for the iGPU. They expect "workstation" users to use a discrete GPU, which is so silly. I have no use for a discrete GPU in a laptop, especially an NVIDIA one. If any of the newer laptop companies (framework, starlabs, etc.) see this, please make a reasonable laptop with ECC memory.

adrian_b · on Oct 14, 2022

There are many Dell Precision models with ECC memory that work very well under Linux, but as you have said, all of them use desktop Intel CPUs packaged in laptop BGAs (which are branded as the HX series since Alder Lake), so they are meant to be used only with a discrete GPU (at Dell the cheapest option is an NVIDIA RTX A1000 at $200).

The H, P and U series of Intel CPUs do not support ECC, so there is no chance of laptops using them with ECC.

As a notable change from the earlier mobile Ryzen CPUs, the Ryzen 6000 series, a.k.a. Rembrandt, support ECC memory (with DDR5 SODIMM; I mean real ECC, not the internal ECC of DDR5).

Unfortunately, I am not aware of any laptop with Ryzen 6000 that takes advantage of this specification change and offers ECC. I doubt that any such laptop will be introduced. Maybe in 2023, with the next generation of mobile CPUs we might finally see some competition for Intel in the mobile workstation market segment.

I am currently using under Linux a rather old Dell Precision with ECC memory (and with an NVIDIA GPU). I would also like an upgrade for it and the current integrated GPUs with 768 FMA units from Intel Alder Lake H or AMD Rembrandt would be good enough to avoid the high power consumption and cost of a discrete GPU, but no such luck yet.

bubblethink · on Oct 14, 2022

The hp zbook has an AMD discrete GPU option, and the thinkpad P16 has an Intel ARC discrete GPU option. These will likely be better for linux than nvidia, but it's still not ideal. Dell also let you configure 7670 without a discrete GPU. This may also be OK for basic stuff, but it's hard to justify the price premium with such a mediocre GPU over a regular laptop.

acomjean · on Oct 14, 2022

Doesn't DDR5 have some ECC built in by default?

I don't think its a good as memory with ECC, but its something.

edit (looking up question...)according to wikipedia, it has some ECC, but I'm failing to see the difference between the explicitly ECC and the regular memory:

"Unlike DDR4, all DDR5 chips have on-die ECC, where errors are detected and corrected before sending data to the CPU. This, however, is not the same as true ECC memory with an extra data correction chip on the memory module. DDR5's on-die error correction is to improve reliability and to allow denser RAM chips which lowers the per-chip defect rate. There still exist non-ECC and ECC DDR5 DIMM variants; the ECC variants have extra data lines to the CPU to send error-detection data, letting the CPU detect and correct errors that occurred in transit."

https://en.wikipedia.org/wiki/DDR5_SDRAM

adrian_b · on Oct 14, 2022

The on-die ECC serves only to make DDR5 as reliable as DDR4, despite the smaller cells and higher speeds.

Real ECC must be end-to-end, from the memory controller inside the CPU to the memory chips and back to the memory controller.

This allows the CPU to be aware of any errors, which is essential e.g. for detecting the memory modules that will soon fail, so they must be replaced, like in this story, and like it also happened in my computers.

Moreover, not only the errors that happen in the memory cells are detected, but also the errors caused by electrical noise on the PCB traces or those caused by poor contacts in the DIMM sockets.

joshspankit · on Oct 15, 2022

And corrects errors from cosmic rays!

shrubble · on Oct 14, 2022

I had a server running Solaris on a Xeon processor and the 1 bank of ECC memory was bad.

Solaris evicted the memory pages from the DIMM or bank, marked it bad, and reduced the amount of memory it 'saw' without crashing or rebooting, since the remedy worked before anything bad happened.

I only noticed it because of top showing less RAM than I knew I had installed...

speed_spread · on Oct 14, 2022

I can not fathom the kind of testing rig that's required to validate that such an incredible mechanism works correctly.

sliken · on Oct 14, 2022

Seems pretty simple to me. Keep in mind it's not analyzing things to see where the error is (address line, I/O pin, which chip, etc). It's just a simple track errors per dimm, then remove the dimm. This works with Linux and several generation old Xeons with no problem. Kinda cool be be able just run free -h and see that a dimm is missing.

ThrowawayTestr · on Oct 14, 2022

LTT did a video on the testing Intel does for new CPUs. Some nice shots of the custom test rigs they use.

https://youtu.be/BtFdraQWVtM

photon-torpedo · on Oct 14, 2022

> The question was: which DIMM should we replace?

On server-class machines, ECC errors often also show up in the system event log, so one can run "ipmitool sel list" and inspect the most recent messages, and they often point to the failing DIMM in a nomenclature that corresponds to how the slots are labelled on the mainboard or in its manual.

In this case, they are using a "gaming" mainboard, so this strategy probably doesn't work (no nice system event log).

c0l0 · on Oct 14, 2022

System firmware can (but not always does) include a mapping between DIMM identifiers as exported by the Linux EDAC subsystem, and DIMM sockets on the mainboard. In absence of such a mapping, you can provide on yourself via `edac-ctl --register-labels`. Of course, someone will have to have figured out what that mapping actually is (but one can do that oneself, given a little patience) first :)

justsomehnguy · on Oct 14, 2022

Most modern system (since 2014-2016?) supports WHEA, which allows the OS to get notification and write it to the OS system log.

Not sure if this would be seen in dmesg.

Firefishy · on Oct 14, 2022

FYI: https://github.com/mchehab/rasdaemon (replaced mcelog) is the daemon for watching for these ECC and other kernel reported "Reliability, Availability and Serviceability" errors.

rasdaemon also attempts to report which physical DIMM / slot triggered the ECC error.

rroot · on Oct 14, 2022

I wonder if there's an bug/scandal unfolding: https://news.ycombinator.com/item?id=33148984

Fendii · on Oct 14, 2022

I'm totally lost on why you would think that?

Memory breaks.

I had my share of broken memory too. In data centers you see this regularly.

Those are just two different stories. Nothing really connection them to make a scandal out of this

KirillPanov · on Oct 14, 2022

... and the Nord Stream pipeline just spontaneously imploded due to natural wear and tear.

Gordonjcp · on Oct 14, 2022

What's that got to do with old parts failing?

_lqaf · on Oct 14, 2022

Think of it is paranoia-pareidolia. People see connections that aren't there and build conspiracy theories on top.

Gordonjcp · on Oct 14, 2022

Oh is that like how they switched on 5G on the mast down the road from me and s couple of days later the propshaft UJs in my old Range Rover have developed a squeak and vibration? It must be the 5G and not the 130,000 miles on the clock, right? Got to be the 5G!

NickRandom · on Oct 14, 2022

Or alternately - people like them spew hot-takes in the desperate hope for an upvote or two.

Fendii · on Oct 14, 2022

There are probably billion dram chips in the world.

And all of them need to be so perfect that a billion single cells are refreshed, read and written every few ns.

It's very normal that sometimes a ram module breaks.

That's the reason why ECC exist.

ECC also exists btw for bit flip from space radiation.

1970-01-01 · on Oct 14, 2022

Didn't Linus S. build this system? https://youtu.be/Kua9cY8q_EI?t=396

"Good enough" is what he was told. It sure was, until it wasn't.

PeterStuer · on Oct 14, 2022

I just had a (Ryzen) Win10 machine with what turned out to be a faulty (non-ECC) DDR4 module fail. The user complained of random reboots every few days. It took me weeks to find the exact cause. Nothing in the logs was directly informative, and the machine passed all tests. Only a prolonged Memtest86 run found the problem.

ksec · on Oct 14, 2022

> It took me weeks to find the exact cause

I guess people are too young to know this? But it was quite common in the old days to first check on faulty memory. And then it would be PSU, and Motherboard capacitor. Both are increasingly rare given how much improvement we have made over the past two decade. But faulty memory is still a thing.

PeterStuer · on Oct 14, 2022

As usual reality is murky.

Hardware checks including shorter memory tests were done, but this computer could not be on overnight. The prolonged test could be run when the user was out for a day.

This wasn't a corporate machine but a private one. As always several OS software configuration and driver issues were incrementally discovered and rectified, and as the crashes only happened very infrequently there was always the (wrong) thought of 'this might have been the problem', only to be refuted several days to a week later.

I'm not claiming I couldn't have done a better job on this one. Low frequency errors without a clear trace will always pose a challenge.

Terr_ · on Oct 14, 2022

Yeah, even today, the first thing I do with new memory is subject it to a few overnight tests when I'm not using the computer.

The testing doesn't need much human interaction or babysitting, and if a problem is found it's usually the kind where you want to know ASAP so you can do an RMA/refund in a timely fashion.

MayeulC · on Oct 14, 2022

You can also "fix" it by ignoring that memory area by adjusting kernel parameters. That's how I do it on one of my old laptops.

I wonder if the kernel couldn't perform ECC itself at a small performance cost by storing CRC bits in padding bits (that would require cooperation from the compiler), or other memory areas, checking them on load, and recomputing them before store operations. It may be easier to achieve on RISC architectures where load and stores are explicit.

I read a paper that seemed related to that some time ago, probably on an embedded microcontroller; the performance penalty was about 20% (you just need a few extra memory operations, performing ECC computation is pretty cheap, and is done 100% within the cache).

Alternatively, the kernel could checksum memory areas, and warn/panic if they change without having written to them. As a bonus, it could help protect against rowhammer attacks. Performance cost might be less.

Of course, ideally, ECC should be a standard feature, as it is on FLASH devices... CPUs should also support it even without explicit support from DRAM... Implementing the mechanisms I described above in hardware should be relatively easy.

kabdib · on Oct 14, 2022

... fond old memories of replacing a dud 4K DRAM on a Vax 11/780, all on my own (the company I was working for was too cheap to pay for a DEC service contract).

Unfond, more recent memories of being asked by the Dell support tech to remove all the DIMMs in the server and start putting them back, one by one.

48 DIMMs. 15 minute BIOS boot time. Didn't even bother doing the math.

It was a better use of our time and money to mention "contractual four hour response time" and "We're about to send email to our corporate lawyer, do you want to reconsider your support response?"

udev · on Oct 14, 2022

The 48 DIMMs problem is not so bad...

Remember those problems where you are given n gold coins (among which one is fake) and a balance, and the task is to find the minimal number of weighting operations to identify the fake coin.

So, you go binary search looking for the borked DIMM, load 24 DIMMs see if it crashes, if no crash -> the borked DIMM is in the other pile, rinse and repeat...

1970-01-01 · on Oct 14, 2022

Since RAM failures are related to solar events, could we log memory failure info and send the data (anonymously) to collectively operate a cosmic ray detector?

jerf · on Oct 14, 2022

It is possible for data sources to be so noisy in such unpredictable and pathological ways that there is no practical way to extract signal. I expect this would be one of those cases.

As a simple for instance, out of what would be thousands of interacting and correlated issues, if some new line of RAM is put out that has more errors than before, your detector would have a hard time not seeing that as an increase in cosmic rays. And the real world deals out correlations that are very difficult to deal with. They could creep out slowly, or, AWS might order a whackload of these over the course of weeks and turn them all on at Tuesday at 9am. Trying to filter all those out is not necessarily mathematically possible.

sliken · on Oct 14, 2022

Dunno. I've heard that phone accelometers can help detect earth quakes early to allow earlier warnings. Certainly that's a noisier signal than ECC errors.

Just aggregate the memory errors so you get a baseline for N machines and if that baseline goes to 10N then you likely had something interesting happen ... hopefully not WW3.

jerf · on Oct 17, 2022

"such unpredictable and pathological ways" is not an extraneous phrase. This is simply a mathematical fact; the perversity of the universe is greater than your mathematical toolkit can handle. There is no guarantee that you can extract signal from a stream, even if you "know" it must be there in some sense.

This is one of the great machine learning delusions.

sliken · on Oct 18, 2022

Right, but ECC errors are rare. If your base rate is 20 per hour for 100,000 devices, and you see 1,000,000 there's likely something going on.

My example was just to show that even MUCH noisier dataset (like phone motion) have been useful for detecting earthquakes that ECC should be MUCH easier.

coryrc · on Oct 14, 2022

Yes-ish, sort of already going. I read an article recently about unknown Firefox crashes being correlated to solar storms. Can't find the article now though; sorry.

metadat · on Oct 14, 2022

> gives us some information about the DIMMs, but unfortunately with no overlap with the information about the ranks. We eventually resorted to pulling a DIMM out and checking which ranks were still present when doing the grep above

Did they pull out the sticks of memory while the system was running? Can you remove and re-insert a stick and have any chance of the system continuing to operate?

I can easily imagine chips getting fried this way, unless they're specifically designed to handle such cases.

Nextgrid · on Oct 15, 2022

DIMM connectors don’t seem to guard against shorts during insertion (it’s very hard to insert a DIMM perfectly straight, usually one side goes first and the misalignment could temporarily short all the pins to their respective neighbours), so it would likely end up in major hardware damage.

accountofme · on Oct 16, 2022

Ram is not hot swappable. Never has been.

dis-sys · on Oct 14, 2022

interesting, so they are actually using Ryzen with ECC RAM (when most people would be using Ryzen with non-ECC RAM) and that saved them from some seriously corrupted data written back to their persistent storage.

wondering is it common for people to specifically monitor their system log for correctable error related messages, do they consider the memory is faulty when there are correctable errors?

photon-torpedo · on Oct 14, 2022

> do they consider the memory is faulty when there are correctable errors?

It depends on the frequency. Occasional CEs are somewhat expected (on a large enough scale) and one can live with them, after all that's what ECC is for. When CEs start happening frequently on one machine, most likely a DIMM is going bad and will worsen over time, so one should replace it.

dis-sys · on Oct 14, 2022

thanks for the info. this is exactly what I am doing. it does provide extra peace in mind knowing that my odds of having silent corruption is further reduced by doing such monitoring.

c0l0 · on Oct 14, 2022

Anyone who uses ECC DIMMs definitely MUST monitor what the memory controllers report to make optimal use of it.

However, you can also set a policy what the Linux kernel will/should do on its own when an ECC error condition has been detected: The `edac_core` module has options such as `edac_mc_panic_on_ue`, which, if set, will trigger a kernel panic upon detecting an Uncorrectable Error in system memory. Depending on your use case, this can be better or worse than just logging it.

account42 · on Oct 14, 2022

I do regularly look at dmesg on my Ryzen Threadripper system with ECC RAM.

Random correctable errors are rare but they do happen - at least if you overclock your RAM ("gaming" RAM often is already pre-overclocked). Might just be confirmation bias but I noticed ECC errors and then later heard there was a solar flare around the time.

I also replaced a DIMM that was starting to get more frequent ECC errors once. As OP found the mapping for consumer boards requires to some trial and error - my motherboard documentation even had a table but the numbering was different from the one used in Linux :/

I don't think I'm ever going to use a non-ECC desktop again, the additional cost is not that high for the extra safety against silent corruption.

c0l0 · on Oct 14, 2022

I have a script to relay new dmesg events into my (xfce) desktop session using libnotify. I figure others may find it useful, too:

https://paste.debian.net/1257030/

It gets started via xdg autostart here, and will tell me about new "stuff" that happens. For it to work, your user will have to have permission to read the kernel event log/debug ringbuffer. I achieve that by setting the appropriate sysctl:

    kernel.dmesg_restrict = 0

account42 · on Oct 14, 2022

I just keep `dmesg -w` running in a terminal window :)

dis-sys · on Oct 14, 2022

> I don't think I'm ever going to use a non-ECC desktop again, the additional cost is not that high for the extra safety against silent corruption.

same here, but sadly you don't get to choose what you get when purchasing laptops, it is simply impossible to get ECC ram if you run mbp.

vladvasiliu · on Oct 14, 2022

Even in PC-land, I don't think there's much choice when it comes to ECC in laptops.

I can only remember one model of Lenovo that had an option to have a Xeon CPU with ECC RAM. I've never seen one with an AMD CPU.

adrian_b · on Oct 14, 2022

There are many Dell, HP and Lenovo laptops with ECC memory, but all are very expensive (e.g. $2500 ... $7500 in a usable configuration, even if the prices may start a little under $2000, but in a useless configuration).

When browsing their Web sites, these models are not obvious, because they are in the section for "enterprise" laptops, listed under "mobile workstations".

Fendii · on Oct 14, 2022

For me as a nerd, yes.

Zfs based Nas with ECC, smart check for HDD, system check including ECC too.

Semaphor · on Oct 14, 2022

> (when most people would be using Ryzen with non-ECC RAM)

Is this true for servers? If I had a Ryzen based server, I’d use ECC RAM.

account42 · on Oct 14, 2022

I think it mainly applies to non-server systems where a) most people don't even know about ECC and b) non-server Ryzens can only use UDIMMs but there are not that many ECC UDIMMs available (probably just because of low demand) so you probably need to make some tradeoffs like paying more (more than +15% markup for the 9th bit) and won't have as fast chips available at the high end.

I think it is also not required for consumer Ryzen mainboards to support ECC but at least for the high end ones many do.

adrian_b · on Oct 14, 2022

Because only some Ryzen motherboards support ECC, one must always read carefully the technical specifications before buying a motherboard.

There are many ASUS and ASRock AM5 (and AM4) motherboards that support ECC, and for those it is typically writen in the memory section "supports ECC & Non-ECC unbuffered DIMMs".

When nothing like this is written, then ECC is not supported.

Moreover, all the motherboards with ECC support must have in the "Advanced" BIOS Setup an option for enabling ECC, which must be used, because the default is always to disable ECC.

With the Ryzen 7000 series there is an improvement over the previous Ryzen series, because in their specification it is written clearly that ECC is supported. Previously, the ECC support was not explicit, even if, unlike Intel they did not disable ECC, so you could hope that it works fine.

Now Intel no longer disables ECC in many Raptor Lake and Alder Lake desktop CPUs, but the motherboards with ECC support for Intel are much harder to find (because they must use a special workstation chipset, while for AMD it is enough to add the PCB traces for the ECC bits).

simoncion · on Oct 14, 2022

> Moreover, all the motherboards with ECC support must have in the "Advanced" BIOS Setup an option for enabling ECC, which must be used, because the default is always to disable ECC.

On both of my Ryzen ASUS motherboards (WS X570-ACE, and ROG STRIX X399-E GAMING) this is not true. I just slapped the DIMMs in there and powered the box up.

dmidecode thinks that the system has ECC enabled:

  dmidecode --type memory | grep -e "Error Correction"
   Error Correction Type: Multi-bit ECC

'amd64_edac' doesn't complain about being loaded on a non-ECC system.

The closed-source version of memtest86 reports that it's running on an ECC-enabled system.

adrian_b · on Oct 14, 2022

This may depend on the BIOS version, even on the same motherboard.

I also have the same ASUS Pro WS X570-ACE (bought in Q4 2019), which I use with ECC DIMMs, and I had to enable in BIOS the support for ECC.

In any case, one should always check for such an option in the BIOS, to avoid surprises.

iszomer · on Oct 14, 2022

I'm surprised no one has mentioned yet what could possibly be causing ECC errors or the trivial troubleshooting pathways you can take before replacing a DIMM or entire banks.

aaaaaaaaaaab · on Oct 14, 2022

What does uncorrectable error mean in RAM? I’ve read somewhere that they use Hamming codes for ECC RAM. Isn’t it the case that too many errors (more than floor((d - 1) / 2)) simply result in an incorrect codeword? Or do they know the error locations a priori? (i.e. erasure coding)

derefr · on Oct 14, 2022

> What does uncorrectable error mean in RAM?

First, let's define a correctable error — that means an error where you have enough additional information (in your Hamming-code stream, in a parity bit, whatever) to repair the error, e.g. a one-bit error when you're using ECC RAM.

An uncorrectable error, then, is one where you do have enough information to detect the corruption, but not enough information to figure out the correct repair for the corruption. (The amount of information required to detect corruption is always less than the amount of information required to correct it.)

With ECC RAM, usually exactly two bit-flips will produce a detected, but uncorrectable, error on read-back.

With non-ECC but "with a parity-bit per word" RAM (which exists, and is a bit cheaper than ECC RAM), you can't correct anything, only detect. (Which is sometimes all you need, if you're willing to do the calculation over again.)

All that being said: completely separately from these hardware-level features, some operating systems (e.g. macOS) compress memory, and generate page-level checksums of memory pages as they're compressed. A checksum failure during memory-page decompression can also trigger the kernel to throw this kind of "uncorrectable error" itself.

There may also exist (i.e. it wouldn't be impossible for there to be) RAM modules that continuously calculate page checksums for each page on each write; and then check the contents of pages against these, perhaps asynchronously, in sort of the same way a ZFS scrub works. I've never heard of this being done, but it feels like the sort of thing you'd implement in hardware for an extremely "ruggedized" system like a Mars rover. If this approach were to be implemented, it would also emit "uncorrectable" errors.

justsomehnguy · on Oct 14, 2022

> With ECC RAM, usually exactly two bit-flips will produce a detected, but uncorrectable, error on read-back.

>> HPE Fast Fault Tolerant (ADDDC)—Enables the system to correct memory errors and continue to operate in cases of multiple DRAM device failures on a DIMM. Provides protection against uncorrectable memory errors beyond what is available with Advanced ECC.

https://techlibrary.hpe.com/docs/iss/proliant-gen10-uefi/s_c...

https://www.hpe.com/us/en/collaterals/collateral.4aa4-3490.M...

hansvm · on Oct 14, 2022

Your intuition is right. Hamming codes work by packing that bit-space full of spheres and assuming the middle of each sphere is the codeword. Then error detection/correction is just mapping to the middle and detecting if that's different than what you started with. Since the space is completely filled, too many errors simply result in an incorrect codeword.

However, you can cheaply extend Hamming codes with, e.g., a parity bit, so that errors in a slightly larger radius are detectable, though for obvious reasons you couldn't correct such an error.

No comment on what sort of algorithm is used for ECC, though it might be worth mentioning that the above is a pretty general feature of error correction, where it's possible to cheaply or even for free be able to detect errors in a larger radius than you're able to correct.