Papers
arxiv:2602.02092

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

Published on Feb 2
· Submitted by
taesiri
on Feb 3
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

FSVideo is a fast transformer-based image-to-video diffusion framework that uses a compressed video autoencoder, diffusion transformer architecture with enhanced layer memory, and multi-resolution generation strategy to achieve high performance with significantly reduced computation time.

AI-generated summary

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64times64times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.02092 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.02092 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.02092 in a Space README.md to link it from this page.

Collections including this paper 2