Upload QwenImageLayeredModularPipeline (#4)

fa40986 about 18 hours ago

3.19 kB

library_name: diffusers
tags:
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image

This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageLayeredAutoBlocks

Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

text_encoder (QwenImageLayeredTextEncoderStep)
- QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
vae_encoder (QwenImageLayeredVaeEncoderStep)
- Vae encoder step that encode the image inputs into their latent representations.
denoise (QwenImageLayeredCoreDenoiseStep)
- Core denoising workflow for QwenImage-Layered img2img task.
decode (QwenImageLayeredDecoderStep)
- Decode unpacked latents (B, C, layers+1, H, W) into layer images.

Model Components

image_resize_processor (VaeImageProcessor)
text_encoder (Qwen2_5_VLForConditionalGeneration)
processor (Qwen2VLProcessor)
tokenizer (Qwen2Tokenizer): The tokenizer to use
guider (ClassifierFreeGuidance)
image_processor (VaeImageProcessor)
vae (AutoencoderKLQwenImage)
pachifier (QwenImageLayeredPachifier)
scheduler (FlowMatchEulerDiscreteScheduler)
transformer (QwenImageTransformer2DModel)

Input/Output Specification

Inputs:

image (Image | list): Reference image(s) for denoising. Can be a single image or list of images.
resolution (int, optional, defaults to 640): The target area to resize the image to, can be 1024 or 640
prompt (str, optional): The prompt or prompts to guide image generation.
use_en_prompt (bool, optional, defaults to False): Whether to use English prompt template
negative_prompt (str, optional): The prompt or prompts not to guide the image generation.
max_sequence_length (int, optional, defaults to 1024): Maximum sequence length for prompt encoding.
generator (Generator, optional): Torch generator for deterministic generation.
num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt.
latents (Tensor, optional): Pre-generated noisy latents for image generation.
layers (int, optional, defaults to 4): Number of layers to extract from the image
num_inference_steps (int, optional, defaults to 50): The number of denoising steps.
sigmas (list, optional): Custom sigmas for the denoising process.
attention_kwargs (dict, optional): Additional kwargs for attention processors.
**denoiser_input_fields (None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
output_type (str, optional, defaults to pil): Output format: 'pil', 'np', 'pt'.

Outputs:

images (list): Generated images.