Packing Input Frame Context in Next-Frame Prediction Models for Video Generation by GaggiX
- Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory.
- Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab
experiments. - Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame
(teacache). - No timestep distillation.
- Video diffusion, but feels like image diffusion.
A next-frame (or next-frame-section) prediction model looks like this:
So we have many input frames and want to diffuse some new frames.
The idea is that we can encode the input frames to some GPU layout like this:
This chart shows the logical GPU memory layout – frames images are not stitched.
Or, say the context length of each input frame.
Each frame is encoded with different patchifying kernel to achieve this.
For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying
kernel.
Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens.
I
6 Comments
ZeroCool2u
Wow, the examples are fairly impressive and the resources used to create them are practically trivial. Seems like inference can be run on previous generation consumer hardware. I'd like to see throughput stats for inference on a 5090 too at some point.
Jaxkr
This guy is a genius; for those who don’t know he also brought us ControlNet.
This is the first decent video generation model that runs on consumer hardware. Big deal and I expect ControlNet pose support soon too.
IshKebab
Funny how it really wants people to dance. Even the guy sitting down for an interview just starts dancing sitting down.
fregocap
looks like the only motion it can do…is to dance
WithinReason
Could you do this spatially as well? E.g. generate the image top-down instead of all at once
modeless
Could this be used for video interpolation instead of extrapolation?