Huh. My experience has been that AMD wins that unless your application is so small that it can fit into Intel's smaller cache. And the new 3D architecture from AMD I thought would make your developers drool, allowing them to actually inline everything instead of being scared of building apps that are too big to fit into cache
Not my experience at all and I work across different teams who own different latency sensitive apps. Most of them have unhygienically huge working sets.
Do you know this for a fact? I've done some work in the industry where I needed to make fast software, but never the like sub-microsecond tick-to-trade type fast, so I really don't know.
There was a great presentation from 2017 about some of Optiver's low latency techniques[1]. I had assumed they released it because the had obviated all of them by switching to FPGAs, but I don't know. Either way, he suggested that if you ever needed to ping main memory for anything, you already lost. So, I wouldn't have thought DDIO plays into their thinking much.
The idea is precisely that you want to avoid pinging main memory at all, which is possible (in the happy case) if you do things correctly with DDIO. Not everything is done in hardware where I am. I am wary of saying much because my employer frowns on it, and admittedly I work on the software more than the hardware, but DDIO is certainly important to us.
DDIO operates mostly transparently to software, with the I/O controller feeding DMAs into a slice of L3. Hardware can opt out by setting PCIe TLP header hints, and you have some system-wide configurability via MSRs, but it's not something a userspace application can take into its own hands.
I don’t know that this definitively answers that question. It’s possible to use a different architecture based on cost/performance and keep a small population of Intel machines in service because you want access to their superior PMUs. Most of what you learn on the latter would still apply to the former.