Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Comparing DDR5 Memory from Micron, Samsung, SK Hynix (eetimes.com)
82 points by JoachimS on Feb 15, 2022 | hide | past | favorite | 56 comments


If anyone thinks that on-die ECC is a good thing as the manufactuers are touting, please go read the discussions on this topic over in the forums at www.realworldtech.com. The goal of on die ECC is purely to ensure that DRAM manufacturers are able to obtain better yields by reducing the impact of defects, which is not the same as ensuring data integrity. This means that it fails the "trust but verify" tenant. Even worse is that some failures may not even get reported to the system as is the case with ECC implemented in the memory controllers and caches of modern CPUs. The industry is trying to make this look like a good thing, but I'm on the same side as Linus Torvalds: all modern systems should ship with ECC memory. IBM got it right with parity memory in the IBM PC.


It's entirely possible that on-die ECC is still a good thing for the end user - to really judge you need to compare the error rate (and proportion the ecc corrected) of dies that would have previously failed validation. It may be that it's good for both - IE more dies can be used (so higher supply and lower prices to the consumer), yet the un-fixed error rate is still lower than dies that would have previously passed validation but lack on-die ECC.

I doubt any manufacturer would make that public, however, but an estimate may be made if error rates actually start increasing in the real world due to ddr5 allowing this.

I agree that end-to-end ECC really should be the default for consumer products these days, but so long as the big players see it as a "Professional User" product differentiation point it'll always be more expensive than it should be.


> so long as the big players see it as a "Professional User" product differentiation point it'll always be more expensive than it should be.

Right. The more important Linus to speak up for ECC isn't Torvalds. It's Linus Sebastian, of Linus Tech Tips. He's made a few videos on ECC targeted towards gamers. Gamers drive the enthusiast PC market and when they start caring, more ECC gets made which will drive the cost down a bit. Last time I bought 32GB DDR4 UDIMM ECC there was literally one SKU. Not manufacturer. Not brand. SKU. One single item in production in the entire world. 16GB wasn't much better off, either.

It's a hard sell, though. Non-ECC will always be cheaper because it costs less to produce. Gamers don't really care that ECC prevents one crash in years because they are used to frequent crashes already. They are largely being fed dogshit from the AAA gaming industry and they have learned to just deal with it. Crashes are just part of being on the bleeding edge of gaming and Nvidia/Radeon drivers. One less crash in a sea of crashes isn't something gamers are lining up for. But a better model GPU or bigger SSD? It's an obvious choice.


> Gamers don't really care that ECC prevents one crash in years because they are used to frequent crashes already.

I work on GPU drivers for one of those companies.

We regularly get reports and backtraces that cannot be reproduced, or "Cannot Happen" without some external factor (e.g. some other bit of code poking around our memory space). Often they're just silently dropped or ignored on the long tail of issues that nobody can get any traction on.

My understanding is the stats from hyperscalers is that ECC correction events happen a lot more than "Common Knowledge" may imply - I wonder just what proportion of things that are blamed on software may actually be due to hardware issues like this?

Again, without a significant change in the market (IE enough gamers start using ECC to actually be statistically relevant and comparing stability) this cannot really be tested, but I've wondered.


Except that anyone using Intel desktop CPUs pretty much can't use ECC thanks to marketing deciding that ECC is a market segmentation feature.

The real way to make ECC happen industry wide is for OS vendors like Microsoft to make it a platform requirement. A no ECC, no boot policy would change things overnight. Sadly, we can't even get DRAM manufacturers to fix row hammer properly, so the likelihood of this happening is pretty much nil.


If people cared, they would buy ECC capable chips. In fact my desktop is a Xeon e3-1230v5, which as cheaper and slightly slower (3.4 vs 3.6 GHz or something) then the equivalent i7. It was $50 more for the motherboard and $100 more for the ram. I'm sure if the market flooded to ECC capable chips (the silicon is the same) Intel would sell them.

So many people grumble, but I'm not really sure Intel should push ECC if desktops users aren't willing to pay a modest premium for it.

Many cheer AMD, which does not disable ECC on desktop chips, but neither do they promise ECC will actually work. It's a confusing mess between physical capacity (ram increases by 16GB when you add a 16GB dimm), and the actual correction of errors and telling the OS about the event. Only on the EPYC does AMD test and certify that ECC will work.


> If people cared, they would buy ECC capable chips.

People don’t care, because people don’t know. RowHammer-vulnerable RAM is like canned food with botulinum and gasoline with lead. In 100 years, we’ll all be flabbergasted that everyone in the industry wasn’t arrested for letting it happen.


Heh, dunno, the rates I've seen likely are 100x or lower than other common cause of mistakes, corruption, and confusion.

Do I want ECC, sure. Would bitflips ECC could prevent be in any in the top 10 reasons in year for file corruption, application crashes, and OS crashes etc... unlikely.


You’re not wrong, but that’s not really the point. The rest of the computer industry needs serious reform, too. That doesn’t take away from needing ECC in “consumer” products.


You don't fix rowhammer with ECC, you fix it by tracking row activation counts and then refreshing nearby rows. Without any cheap-outs like using small hash tables.



I've read that paper, yes. It works by bypassing target row refresh, because all those chips are trying to be clever about their row tracking and failing.

Track row activation counts for real, no estimates and no hash tables, and you're not vulnerable to this.


> I'm not really sure Intel should push ECC if desktops users aren't willing to pay a modest premium for it.

$50 more for the motherboard and $100 more for the RAM is not a "modest premium" for most of the desktop market. But aside from that, the bigger problem is that consumers (and especially gamers) will always prioritize the advantages that are more easily quantified. When forced to choose between ECC support and a few hundred MHz plus the option to overclock further, consumers will pick the faster processor.

If Intel offered a part at the high end of the product line that had both ECC and overclocking enabled, it would be guaranteed to sell quite a few units, because a lot of consumers will always want to own the top of the line part. And such a part would be an opportunity for a better experiment to determine how much premium users are willing to pay for ECC, if Intel also kept offering the current -K overclockable parts that don't have ECC.


Well the CPU was $50 cheaper so that offset the ECC premium. I cared about ECC much more than the 166 MHz difference or whatever it was. For people buying an i7 (this was before the i9 I believe) + 16GB ram in 2015, $100 is a pretty small premium (less than 10%). Decent mid range CPUs were $300 (gtx 1070). There were lower end Xeons equivalent to the i5 as well (lacking SMT/hyperthreading).


Your comments about AMD are not entirely true: AMD officially supports ECC in the Pro variants of their desktop CPUs / APUs. I'm currently using a Ryzen 3800X with ECC memory and EDAC reports that it's working. ASRock lists qualified ECC DIMMs for virtually all of their motherboards. It's more that it's a feature motherboard vendors can support if they want to rather than being a platform requirement. But yes, it is sad that AMD doesn't make it an officially supported feature of the platform.


Right, but sadly AMD Ryzen Pro aren't sold directly to consumers (well unless you buy grey market on ebay), even if they did there's no such thing as a Ryzen Pro motherboard, or a Ryzen Pro BIOS. Not to mention AMD has been slow to upstream various EDAC related kernel patches.

It's not really a supported config, there's no guarantee it will work, no promise from AMD (other than it's not disabled), and you can't return a Ryzen CPU because the ECC doesn't work.

Various tests from various reddit posts show that some vendors "qualify" ECC dimms as working (an inserted Dimm increases ram available), some correct bits but don't correctly tell the kernel, and others actually do correct and report. So it's basically just a big mess. I'd buy an Epyc, but sadly unlike Intel the premium for a "server" chip is huge, if you can find them anywhere near MSRP. I looked for a Epyc 7313p near MRSP (announced in March 2021) without luck and finally gave up and bought a Ryzen.

With all that said, ASRock is from what I can tell the best AMD motherboard company to support ECC with AMD. I've got a ASRock X570D4U, other than it refusing to allow sharing a IPMI/BMC interface with the system, I'm pretty happy with it.


> AMD officially supports ECC in the Pro variants of their desktop CPUs / APUs

ECC is only officially supported on the Ryzen Threadripper PRO, and that's largely because it uses registered memory which will practically always have ECC capability.

Regular Ryzen PRO CPUs have the same ECC support as non-PRO Ryzen CPUs: functional, but not validated by AMD.

Ryzen PRO APUs have (non-validated) support for ECC, whereas non-PRO APUs don't support ECC at all.


You can use ECC by buying the Xeon version which is only slightly more expensive.


Though the motherboards and ram modules are also harder to find and with far less choice. (and actually the processors are too, depending on your market)


Yes, sure, motherboards and CPUs have far less choices. But generally plenty for any normal use case. Motherboards and CPUs have a crazy number of flavors, none of which matter one whit for normal use cases. It's so bad motherboards are differentiating on the manliness of the heat sink design, lighting options, and the color of the motherboard.


> My understanding is the stats from hyperscalers is that ECC correction events happen a lot more than "Common Knowledge" may imply - I wonder just what proportion of things that are blamed on software may actually be due to hardware issues like this?

What is the order of magnitude of ECC correction events?



Looks like a good paper, but transistors have gotten quite a bit smaller since 2009, and voltages have gotten lower as well. So the energy required to flip a bit has shrunk quite a bit since 2009.


Linus[0] already made a video agreeing with Linus[1]'s rant: https://www.youtube.com/watch?v=pPeCNrNTr3k

I can actually give you a GAMING (or at least enthusiast) use case for ECC, too: memory overclock validation. Right now, it's kind of a dark art, but ECC failures would absolutely give you a very explicit early warning that you're pushing the CPU IMC too hard.

[0] Pronounced LIE-nus

[1] Pronounced lee-nus


I can also tell you in a 'gamer'-adjacent field a very large proportion of users have a very loose definition of stability when overclocking. It survived a single short run of a single use case that only runs one load pattern? Must be perfectly stable. Anything else crashes? It must be the shitty game or gpu driver!

I hope that ecc will give more warning that you're close to the edge (e.g. any ecc correction means it's unstable and you should back off), but fear people will just keep pushing until the ecc can no longer even correct the errors.

Professionally I work on gpu drivers for systems that do not allow overclocking, so shouldn't affect our stats, but it certainly is a mess for other teams. I've noticed that it's even become a common suggestion on fan forums for gpu manufacturers to disable any overclock and revalidate when changing vendors or even updating drivers, presumably as they're so close to the edge small changes in the access patterns will then cause it to explode.


I don't see how the on-die error rates could do anything but improve.

A system with no error correction has to be so close to perfect. Even if you put in much worse cells, going from zero error correction to mild error correction should leave you significantly better off.


DDR5 contains two forms of ECC. The first is standard ECC which is used to correct for bit flips in transmission. The second on-die ECC is used to correct bit flips on the die, hence the name. The world has already accepted that standard ECC on high speed interfaces is a good idea, so why would on-die ECC be a bad idea? Yes, they correct different error types, but they both attempt to correct corrupted bits and the do so in a mathematically similar way.

All that said, there are still ECC (has an ECC memory) and no-ECC dimms for DDR5. So if the on-die ECC is concerning for anyone, they can still get a DIMM with a separate ECC memory. But the ECC happening at the interface between the DIMM and the CPU will still exist always and you will have to trust it.


Again, going back to the discussions over on RWT: some of the less robust forms of ECC that DRAM manufacturers typically implement can end up amplifying the problem by turning double bit flips into silent multi bit flips which makes the memory controller's job much harder. DRAM manufacturing process tech is not optimized for logic like CPUs are, and those limitations really do constrain how much logic (or "how good") the ECC implemented on DRAM chips is. I trust CPU manufacturers to get memory controllers right more than I trust DRAM manufactures to get ECC right for one simple reason: row hammer.


The on-die ECC for DDR5 is typically:

* mandatory (an hypothetical DDR5 without could have error rates so high it would basically not work)

* an implementation detail (if the raw error rate was not that high, there would be no on-die ECC)

* not reported to the CPU

It's a complete different beast than real ECC. It's not that it is bad or concerning, it is that it does not provide RAS services and, like ECC-less DDR4, should be reserved for consumer electronics for basically only tasks like entertainment. Actually, in a better world most consumer electronic should have real ECC (instead of none at all or implementation detail on-die) -- but sadly for now vendors do not do that.


Just out curiosity: do you have this same natural aversion to hard disk drives and NAND flash? Both of these, and many others, are utterly dependent on math to overcome their naturally terrifying basic error rates.


Heh, not to mention CDs, Blurays, HD TV (broadcasts), etc.

Although it does complicate things, much like happened when relatively stupid discs of spinning rust (which might lie about sync) transitioned to SSDs which are a blackbox with fancy algorithms (for cache, wear leveling, minimizing write amplication, etc), hidden storage, ram (for cache), even some with fast and slow flash. So now you have to be very careful to test long enough to exceed cache (flash or ram), and if randomly writing you have to sustain it till you consume the entire drive to ensure the housekeeping is included in your performance numbers.

So this raises the spectre of dimms with so many errors that a detectable number of lookups might require ECC corrections and additional latency. So we will need a SLA from the dimms, something like 99.99% of the time with less than X ns of latency.


None of the mainstream DDR DRAM protocols support variable latency, so you effectively have that 100% SLA guarantee. I should qualify that may not be true for some of IBM's mainframe systems. IBM has created DIMMs that communicate with high speed serial protocols (using SERDES) between the memory controller and a buffer chip on the DIMMs, mostly to make it easier to scale the amount of memory in these system up beyond what would be possible when using a pad limited protocol like DDR.

Now that memory controllers are on the CPU (well, the I/O die for AMD), there's actually a whole lot more slack for the timing needed to do ECC. While the I/O transistors use big, slow transistors for driving the off die signals, the logic is all implemented using the same 3-5+GHz transistors used. This is in constrast to the transistors used in DRAM that are tuned to minimize leakage and are a heck of a lot slower. Putting complicated logic on the DRAM die is a fundamental mismatch with that goal, which encourages DRAM vendors to put the minimum possible logic on their dies. ECC gets better with more logic, and worse with less. Do you really want your DRAM vendor deciding how much ECC is Good Enough for you? I don't.


Ah, I didn't know about the lack of variable latency, thanks. Does CXL.mem fix that?

For similar reasons I wish that SSDs, MicroSDs and similar were just stupid devices with simple read/write logic. Then various filesystems, databases, or key/value stores could compete for popularity for things like load leveling, minimizing write amplification, error checking, write performance, IOPs, etc.


No vendor wants to sell raw flash without wear leveling that's exposed to software. It's a significant liability from a warranty perspective, as the manufacturer has to potentially replace the device when the user make a mistake or uses poorly written / buggy software that wears out the flash out prematurely. The one place where that does make sense is for hardware devices using SoCs that include interfaces to directly attach NAND flash, but in that case it's usually the hardware vendor that provides the software, leaving random end users out of the loop.


Dunno, today's warranties often require a tool to verify the drive, which checks the total writes. So if you exceed the lifetime writes you are out of warranty.

Seems trivial to extend that to writes per cell, so if you screw up you lose the warranty.


Not just IBM. Apple supported ECC ram in the Powermac G5, and shipped all "Cheese grater" Mac Pros with ECC ram.


Why is it that LPDDR is recently faster than DDR of the same 'generation'? I thought LPDDR is purely a lower voltage version of DDR, so I naively would have expected worse performance. Is it because it's typically closer (physically) to the CPU?


I believe it's just the advantages you get from very short trace lengths. Dimm slots are usually inches away, so you end up with long traces from CPU -> dimm slot, pay the overhead of the dimm slot connection, and then traces within a dimm.

LPDDR on the other hand move the individual dimm chips as close as possible to the CPU and don't have any connector. This also makes it much easier to have wider memory. A 13" MBP can have a 512 bit wide memory system with at least 16 channels in a thin/light laptop that is quite power efficient. To get similar with DIMMs you'd have to buy a dual socket server motherboard with 8 channels per socket and would be lucky to fit that in an ATX size motherboard in a 1.75" thick chassis.


DDR is typically a bus with more than 1 DIMM slot per channel. LPDDR is typically point to point. Electrically, it's a lot easier to meet signal integrity requirements on a point to point trace than it is to make a multi drop bus work properly.


>I thought LPDDR is purely a lower voltage version of DDR

The only similarity between LPDDR5 and DDR5 is its name. Otherwise you could think of them as completely different technology just like HBM and DDR.


DDR4L is the low-voltage version of DDR4. LPDDR4(x) are very different, and have at least as much in common with GDDRx as they do with desktop DDR standards.


More importantly, because of the low-power requirement, LPDDR typically have better binned dies than DDR.


LPDDR uses a wider bus so, at a similar clock rate, it is faster.


Maybe I am not understanding something, but I thought that total memory bandwidth is critical for Deep Learning applications. This is where HBM on-die would shine, no? I am deferring the purchase of a new desktop/server until processors with HBM come to market. I think AMD is shipping EPYC engineering samples with some version of memory and Intel is slated the release by the end of the year. Am I wrong about this?


The only CPU with HBM is Sapphire Rapids and it may cost $20K; for that money you're probably better off buying an H100.


Looks like the h100 doesn't have a release date let alone a price. There are also some applications that don't scale well on a gpu (lots of interdependence) which could still benefit greatly from hbm on a cpu. Although deep learning typically scales very well.


Future deep learning hardware systems such as Nvidia's Grace system will use CPU memory in a hierarchy [0] (check the illustrations here).

While HBM memory is fast, it is limited in size (eg: 48GB-96GB), whereas DDRx memory can be in the TBs easily and that too inexpensively. You can think of the HBM-DDRx hierarchy analogously to the L1-L2-L(n) cache hierarchy for CPU processors. IIRC, even the CPU HBM implementations will use this HBM-DDRx hierarchy.

[0] https://www.nextplatform.com/2021/04/12/nvidia-enters-the-ar...


Why not just get a GPU? They seem much closer to the needs of deep learning than trying to shoe horn a HBM memory interface onto a chip that wasn't particularly designed for it.


>DDR5 provides both data and clock rates that double the performance up to at least 7,200 MB/s. Additionally, DDR5 lowers the operating voltage to 1.1V.

hmm? 7GB/s is the performance that modern disks achieve

edit. nvm: When it comes to bandwidth, DIMMs provide 38.4 GB/s or 44.8GB/s

I do wonder when disks will be close to RAM sticks because gap seems to be closing.

It'd be nice to do not have to buy RAM sticks and just use small 1TB NVMe M2 disk as both - ram and disk


Bandwidth isn’t the only constraint. You also have latency, finite write cycles, a minimum number of bytes that need to be rewritten at once. Also SSDs get those speeds by having onboard DRAM caches. That may not be the speed of the flash memory


> Also SSDs get those speeds by having onboard DRAM caches.

Usually not. Most SSDs that have DRAM only use it to hold the address mapping tables, not any user data. The DRAM cache for that internal metadata helps with random IO performance (saving an extra flash read before the real IO can be done), but has minimal impact on sequential read throughput, which is what has actually reached 7+ GB/s.


And again, minimal die size reduction, lower yield. Has been like that for the past 10 years With no significant improvement in the foreseeable future.

It is time software take a new look at memory usage again. RAM isn't cheap nor free. Now even 8GB memory is shared with GPU. Considering we now have 4x or 5x or pixel count we actually have less memory to use compared to the same 8GB 10 years ago.


I have some G.Skill DDR 5 and cannot get them to run at their 6000mhz XMP profile. Ive tried a Gigabyte Z690 Aorus Xtreme and an Asus ProArt Z690 Creator motherboard. Wouldn't mind trying another brand of memory but can't find anything in stock.

Looks like a common issue from my research.


Do have 2 or 4? Generally if you use 4 dimms on 2 memory channels the speeds are derated.


Article seems to imply that all DDR5 chips have ECC. Is this true ?


Yes, as discussed on other threads here, the ECC helps increase chip yields, but does not prevent offchip errors. So it's not equivalent to what people normally mean by ECC memory which stores parity that will correct single bit errors and detect 2 bit errors anywhere in the chip, dimm, dimm slot, motherboard, socket, or CPU areas.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: