This content originally appeared on DEV Community and was authored by Abhishek
Yesterday, I got it running in realtime with ease, on a single RTX 5090.
My current dataset for the public-facing demo is mostly GTA clips, and it already generalizes to any third-person game.
I’ll release this soon on lucidml.ai .
But I’m not stopping here. With the headroom, I’m training the final model on a much larger dataset, one that generalizes to literally any image.
Take a picture of yourself → turn it into a game.
All these clips you see are synthetically generated from our model.
Tech Specs :
The model is a Diffusion Transformer that generates video frame by frame.
It’s trained end to end from scratch and is NOT a WAN fine tune.4-8 x A100s were used in every training run.
Most runs take many days to converge.
What lets us stream in realtime is KV cache. (flash attention 2’s KV cache function) and this allows coherence over long windows like minutes on end.
You can check out my linkedin post for more updates!
This content originally appeared on DEV Community and was authored by Abhishek