Oh this is fantastic. I've been thinking a bit lately on how to do things other than image generation using these types of models, but you need to be quick to the punch these days. Kudos to these researchers, this is going to open the doors for a lot of applications.
Just a week ago I was wondering about this. I was wondering if diffusion models could be used first to generate a 3D character model, and then use another model to describe actions that animate the character. Here is the humble beginnings of such a thing.
I was also wondering about the possibility of generating, say, an anthropomorphized cartoon otter that can be animated using a model trained on both otter and human motion to produce a result that is something in between.
It could reduce the workload for producing animated stories by one or more orders of magnitude sometime in the possibly not-too-distant future.