I don't think flops/$ is enough to really capture the difference here. You could...

I don't think flops/$ is enough to really capture the difference here.

You couldn't replicate the scale of compute this allows no matter how many V100's you had.

A huge amount of cost here is embodied in networking and memory.

If one were to design a chip that cared only for flop/$ without caring for all of the interconnect and memory, then the 4090 is a much fairer comparison, and even then that card isn't designed for a flop/$ optimisation.