V1: Human-Centric Video Foundation Model

·🌐 Github ·

This repo contains Diffusers-format model weights for V1 Text-to-Video, Image-to-Video models, and Video-to-Video. You can find the inference code on our github repository V1.

Introduction

V1 is an open-source human-centric video foundation model. By fine-tuning HunyuanVideo on O(10M) high-quality film and television clips, V1 offers three key advantages:

🔑 Key Features

1. Video-to-Video Generation Pipeline

The V1 model is a hybrid architecture combining the HunyuanVideo model by Tencent and Stable Video Diffusion (SVD) by Stability AI. During inference, the model accepts a user prompt and an optional video input, which are processed before video generation. For Video-to-Video (V2V) generation, the system employs video interpolation techniques to extract frames from the input video. These frames are organized by timestamp and used as image inputs for the Stable Video Diffusion (SVD) model, alongside the user prompt, to generate the final video.

At the inference stage, the backend dynamically switches between the HunyuanVideo and Stable Video Diffusion (SVD) models based on the input file type. By default, V1 uses a fine-tuned version of the HunyuanVideo model. However, when a video file is detected in the user input, the system automatically switches to the Stable Video Diffusion (SVD) model, enabling a "Video-to-Video" generation workflow.

2. Advanced Model Capabilities

Open-Source Leadership: The Text-to-Video model achieves state-of-the-art (SOTA) performance among open-source models, comparable to proprietary models like Kling and Hailuo.
Advanced Facial Animation: Captures 33 distinct facial expressions with over 400 natural movement combinations, accurately reflecting human emotions.
Cinematic Lighting and Aesthetics: Trained on high-quality Hollywood-level film and television data, each generated frame exhibits cinematic quality in composition, actor positioning, and camera angles.

3. Self-Developed Data Cleaning and Annotation Pipeline

Our model is built on a self-developed data cleaning and annotation pipeline, creating a vast dataset of high-quality film, television, and documentary content.

Expression Classification: Categorizes human facial expressions into 33 distinct types.
Character Spatial Awareness: Utilizes 3D human reconstruction technology to understand spatial relationships between multiple people in a video, enabling film-level character positioning.
Action Recognition: Constructs over 400 action semantic units to achieve a precise understanding of human actions.
Scene Understanding: Conducts cross-modal correlation analysis of clothing, scenes, and plots.

4. Multi-Stage Image-to-Video Pretraining

Our multi-stage pretraining pipeline, inspired by the HunyuanVideo design, consists of the following stages:

Stage 1: Model Domain Transfer Pretraining: We use a large dataset (O(10M) of film and television content) to adapt the text-to-video model to the human-centric video domain.
Stage 2: Image-to-Video Model Pretraining: We convert the text-to-video model from Stage 1 into an image-to-video model by adjusting the conv-in parameters. This new model is then pretrained on the same dataset used in Stage 1.
Stage 3: High-Quality Fine-Tuning: We fine-tune the image-to-video model on a high-quality subset of the original dataset, ensuring superior performance and quality.

📦 Model Introduction

Model Name	Resolution	Video Length	FPS
V1-Hunyuan-I2V	544px960p	97	24
V1-Hunyuan-T2V	544px960p	97	24
V1-SVD-V2V	544px960p	97	24

Usage

Note: The V1 model is a hybrid of two models (tencent/HunyuanVideo and stabilityai/stable-video-diffusion-img2vid-xt) and cannot be loaded directly using DiffusionPipeline.from_pretrained("NullVoider/V1"). Instead, you need to clone the model repository locally and use the inference code provided in the associated GitHub repository.

Usage Guide

1. Clone the Model Repository Locally

The model weights are hosted on Hugging Face. Clone the repository to your local machine using git:

git clone https://huggingface.co/NullVoider/V1

Downloads last month: 15

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NullVoider/V1

Base model

stabilityai/stable-video-diffusion-img2vid-xt

Finetuned

(6)

this model