YiYiXu's picture
YiYiXu HF Staff
Upload QwenImageLayeredModularPipeline (#4)
fa40986
metadata
library_name: diffusers
tags:
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image
  - modular-diffusers
  - diffusers
  - qwenimage-layered
  - text-to-image

This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework.

Pipeline Type: QwenImageLayeredAutoBlocks

Description: Auto Modular pipeline for layered denoising tasks using QwenImage-Layered.

This pipeline uses a 4-block architecture that can be customized and extended.

Example Usage

[TODO]

Pipeline Architecture

This modular pipeline is composed of the following blocks:

  1. text_encoder (QwenImageLayeredTextEncoderStep)
    • QwenImage-Layered Text encoder step that encode the text prompt, will generate a prompt based on image if not provided.
  2. vae_encoder (QwenImageLayeredVaeEncoderStep)
    • Vae encoder step that encode the image inputs into their latent representations.
  3. denoise (QwenImageLayeredCoreDenoiseStep)
    • Core denoising workflow for QwenImage-Layered img2img task.
  4. decode (QwenImageLayeredDecoderStep)
    • Decode unpacked latents (B, C, layers+1, H, W) into layer images.

Model Components

  1. image_resize_processor (VaeImageProcessor)
  2. text_encoder (Qwen2_5_VLForConditionalGeneration)
  3. processor (Qwen2VLProcessor)
  4. tokenizer (Qwen2Tokenizer): The tokenizer to use
  5. guider (ClassifierFreeGuidance)
  6. image_processor (VaeImageProcessor)
  7. vae (AutoencoderKLQwenImage)
  8. pachifier (QwenImageLayeredPachifier)
  9. scheduler (FlowMatchEulerDiscreteScheduler)
  10. transformer (QwenImageTransformer2DModel)

Input/Output Specification

Inputs:

  • image (Image | list): Reference image(s) for denoising. Can be a single image or list of images.
  • resolution (int, optional, defaults to 640): The target area to resize the image to, can be 1024 or 640
  • prompt (str, optional): The prompt or prompts to guide image generation.
  • use_en_prompt (bool, optional, defaults to False): Whether to use English prompt template
  • negative_prompt (str, optional): The prompt or prompts not to guide the image generation.
  • max_sequence_length (int, optional, defaults to 1024): Maximum sequence length for prompt encoding.
  • generator (Generator, optional): Torch generator for deterministic generation.
  • num_images_per_prompt (int, optional, defaults to 1): The number of images to generate per prompt.
  • latents (Tensor, optional): Pre-generated noisy latents for image generation.
  • layers (int, optional, defaults to 4): Number of layers to extract from the image
  • num_inference_steps (int, optional, defaults to 50): The number of denoising steps.
  • sigmas (list, optional): Custom sigmas for the denoising process.
  • attention_kwargs (dict, optional): Additional kwargs for attention processors.
  • **denoiser_input_fields (None, optional): conditional model inputs for the denoiser: e.g. prompt_embeds, negative_prompt_embeds, etc.
  • output_type (str, optional, defaults to pil): Output format: 'pil', 'np', 'pt'.

Outputs:

  • images (list): Generated images.