Skip to content Skip to footer

Login orRegister

0 items - $0.00 0

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation by GaggiX

6CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation by GaggiX

ByHackTech April 19, 2025

Share This Article

Share Post

Newsletter

Sed ut perspiciatis unde.

Diffuse thousands of frames at full fps-30 with 13B models using 6GB laptop GPU memory.
Finetune 13B video model at batch size 64 on a single 8xA100/H100 node for personal/lab
experiments.
Personal RTX 4090 generates at speed 2.5 seconds/frame (unoptimized) or 1.5 seconds/frame
(teacache).
No timestep distillation.
Video diffusion, but feels like image diffusion.

A next-frame (or next-frame-section) prediction model looks like this:

So we have many input frames and want to diffuse some new frames.

The idea is that we can encode the input frames to some GPU layout like this:

This chart shows the logical GPU memory layout – frames images are not stitched.

Or, say the context length of each input frame.

Each frame is encoded with different patchifying kernel to achieve this.

For example, in HunyuanVideo, a 480p frame is likely 1536 tokens if using (1, 2, 2) patchifying
kernel.

Then, if changed to (2, 4, 4) patchifying kernel, a frame is 192 tokens.

I

Written by

HackTech

View all posts by HackTech

Show comments (6)

6 Comments

Post Author

ZeroCool2u

Posted April 19, 2025 at 2:03 pm

Wow, the examples are fairly impressive and the resources used to create them are practically trivial. Seems like inference can be run on previous generation consumer hardware. I'd like to see throughput stats for inference on a 5090 too at some point.

0Likes Log in to Reply
Post Author

Jaxkr

Posted April 19, 2025 at 3:02 pm

This guy is a genius; for those who don’t know he also brought us ControlNet.

This is the first decent video generation model that runs on consumer hardware. Big deal and I expect ControlNet pose support soon too.

0Likes Log in to Reply
Post Author

IshKebab

Posted April 19, 2025 at 3:15 pm

Funny how it really wants people to dance. Even the guy sitting down for an interview just starts dancing sitting down.

0Likes Log in to Reply
Post Author

fregocap

Posted April 19, 2025 at 3:37 pm

looks like the only motion it can do…is to dance

0Likes Log in to Reply
Post Author

WithinReason

Posted April 19, 2025 at 3:58 pm

Could you do this spatially as well? E.g. generate the image top-down instead of all at once

0Likes Log in to Reply
Post Author

modeless

Posted April 19, 2025 at 4:06 pm

Could this be used for video interpolation instead of extrapolation?

0Likes Log in to Reply

Leave a comment Cancel reply

You must be logged in to post a comment.

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Log in to your account