Diffusers documentation
DreamLite
DreamLite
DreamLite is a text-to-image and image-editing model from ByteDance. It pairs a custom 2D U-Net
(DreamLiteUNetModel) with the Qwen3-VL multimodal encoder as its prompt / image-instruction encoder,
and uses an AutoencoderTiny (TAESD-style) VAE for fast latent encode/decode.
Two pipelines are exposed:
| Pipeline | Modes | CFG | Use case |
|---|---|---|---|
| DreamLitePipeline | text-to-image and image-editing (auto-selected by whether image is None) | 3-branch dual CFG (guidance_scale on text branch, image_guidance_scale on image branch, à la InstructPix2Pix) | Highest quality |
| DreamLiteMobilePipeline | text-to-image and image-editing (auto-selected by whether image is None) | None — distilled, single UNet forward per step | On-device / low-latency |
Official checkpoints:
- Base model: carlofkl/DreamLite-base
- Distilled mobile model: carlofkl/DreamLite-mobile
Both pipelines auto-detect text-to-image vs. image-editing mode from whether the
imageargument is provided. There is no separateImg2Imgclass.
When loading an input image for editing, prefer
diffusers.utils.load_image(...)over rawPIL.Image.open(...).load_imageenforces an RGB conversion and applies EXIF orientation, both of which the pipeline assumes. A plainImage.openof an RGBA / palette / EXIF-rotated source will silently produce a different latent conditioning and degrade output quality.
Text-to-image (Base)
import torch
from diffusers import DreamLitePipeline
pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
image = pipe(
prompt="a dog running on the grass",
negative_prompt="",
height=1024,
width=1024,
num_inference_steps=28,
generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_t2i.png")Image editing (Base)
Pass an image to enter edit mode. Both guidance_scale (text branch) and image_guidance_scale
(image branch) are active here.
import torch
from diffusers import DreamLitePipeline
from diffusers.utils import load_image
pipe = DreamLitePipeline.from_pretrained("carlofkl/DreamLite-base", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
image = pipe(
prompt="turn the cat into a corgi",
image=source,
height=1024,
width=1024,
num_inference_steps=28,
generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_edit.png")Text-to-image (Mobile)
The mobile pipeline is distilled and skips CFG entirely — a single UNet forward per step. It accepts the
same prompt / height / width / num_inference_steps arguments, but ignores guidance_scale and
image_guidance_scale if passed (a warning is logged).
import torch
from diffusers import DreamLiteMobilePipeline
pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
image = pipe(
prompt="a dog running on the grass",
height=1024,
width=1024,
num_inference_steps=4,
generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_mobile_t2i.png")Image editing (Mobile)
import torch
from diffusers import DreamLiteMobilePipeline
from diffusers.utils import load_image
pipe = DreamLiteMobilePipeline.from_pretrained("carlofkl/DreamLite-mobile", revision="diffusers", torch_dtype=torch.bfloat16)
pipe = pipe.to("cuda")
source = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
image = pipe(
prompt="turn the cat into a corgi",
image=source,
height=1024,
width=1024,
num_inference_steps=4,
generator=torch.Generator("cpu").manual_seed(42),
).images[0]
image.save("dreamlite_mobile_edit.png")Notes and limitations
- Both pipelines force
batch_size = 1internally;num_images_per_promptcontrols how many samples are drawn from the same prompt rather than parallel batching. - The prompt encoder is
Qwen3-VL, which is a multimodal model. Loading the full pipeline therefore requires sufficient GPU memory for both the U-Net and the Qwen3-VL text encoder (~4 GB + ~0.7 GB in bf16 for the base release). - The VAE is
AutoencoderTinyand exposesencoder_block_out_channels;vae_scale_factoris derived from it at pipeline init time.
DreamLitePipeline
class diffusers.DreamLitePipeline
< source >( text_encoder: Qwen3VLForConditionalGeneration tokenizer: AutoTokenizer processor: Qwen3VLProcessor vae: AutoencoderTiny unet: DreamLiteUNetModel scheduler: FlowMatchEulerDiscreteScheduler )
DreamLite pipeline for text-to-image and instruction-based image editing.
The same pipeline supports both modes; the operating mode is auto-detected from the inputs:
image is None-> text-to-image (single CFG on text).image is not None-> image-to-image / instruction edit (dual CFG: text + image).
Components: text_encoder ([~transformers.Qwen3VLForConditionalGeneration]): Multimodal text/vision encoder used to produce conditioning embeddings. tokenizer ([~transformers.AutoTokenizer]): Tokenizer for text-only (generate) mode. processor ([~transformers.Qwen3VLProcessor]): Multimodal processor for edit mode (text + image template). vae ([~diffusers.AutoencoderTiny]): Mobile-friendly tiny VAE for latent encode/decode. unet ([~diffusers.DreamLiteUNetModel]): DreamLite UNet (GQA + qk_norm + depthwise-separable convs). scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]): Flow-matching Euler scheduler with dynamic shift.
Note:
batch_size is currently forced to 1; num_images_per_prompt is supported.
__call__
< source >( prompt: typing.Optional[str] = None negative_prompt: typing.Optional[str] = None image: typing.Optional[PIL.Image.Image] = None height: typing.Optional[int] = None width: typing.Optional[int] = None guidance_scale: float = 3.5 image_guidance_scale: float = 1.5 num_inference_steps: int = 30 sigmas: typing.Optional[typing.List[float]] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True max_sequence_length: int = 200 text_pad_embedding: typing.Optional[torch.Tensor] = None )
Parameters
- prompt — Text prompt.
- negative_prompt — Negative text prompt (defaults to empty string).
- image — Optional input image. If provided, the pipeline runs in edit / image-to-image mode with dual classifier-free guidance; otherwise it runs in text-to-image mode.
- height — Output resolution (height). Defaults to
default_sample_size * vae_scale_factor(1024). The same default applies in both T2I and I2I; pass an explicit value to override. - width — Output resolution (width). Defaults to
default_sample_size * vae_scale_factor(1024). The same default applies in both T2I and I2I; pass an explicit value to override. - guidance_scale — CFG scale on the text branch (both modes).
- image_guidance_scale — Additional CFG scale on the image branch (edit mode only).
- num_inference_steps — Number of denoising steps.
- sigmas — Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
- num_images_per_prompt — Output images per prompt (note:
batch_sizeis forced to 1). - generator — Random generator(s).
- output_type —
"pil","np","pt"or"latent". - return_dict — If True, returns a DreamLitePipelineOutput; else a tuple
(images,). - max_sequence_length — Maximum number of user-prompt tokens kept after dropping the chat-template
prefix. Only applies to
generatemode (theeditmode uses the multimodal processor’s native padding). - text_pad_embedding — Optional learned pad embedding for masked positions.
Run the DreamLite pipeline.
DreamLiteMobilePipeline
class diffusers.DreamLiteMobilePipeline
< source >( text_encoder: Qwen3VLForConditionalGeneration tokenizer: AutoTokenizer processor: Qwen3VLProcessor vae: AutoencoderTiny unet: DreamLiteUNetModel scheduler: FlowMatchEulerDiscreteScheduler )
DreamLite Mobile pipeline: a distilled, classifier-free-guidance-free variant of DreamLitePipeline for fast few-step inference (default 4 steps).
The operating mode is auto-detected from inputs (same as the base pipeline):
image is None-> text-to-image.image is not None-> image-to-image / instruction edit.
Because classifier-free guidance is distilled away, guidance_scale and image_guidance_scale are
accepted for API parity with DreamLitePipeline but are ignored in the denoising loop. negative_prompt
is intentionally absent.
Components (identical to the base pipeline): text_encoder ([~transformers.Qwen3VLForConditionalGeneration]): Multimodal text/vision encoder. tokenizer ([~transformers.AutoTokenizer]): Tokenizer for text-only (generate) mode. processor ([~transformers.Qwen3VLProcessor]): Multimodal processor for edit mode. vae ([~diffusers.AutoencoderTiny]): Mobile-friendly tiny VAE. unet ([~diffusers.DreamLiteUNetModel]): DreamLite UNet. scheduler ([~diffusers.FlowMatchEulerDiscreteScheduler]): Flow-matching Euler scheduler with dynamic shift.
Note:
batch_size is currently forced to 1; num_images_per_prompt is supported.
__call__
< source >( prompt: typing.Union[str, typing.List[str]] = None image: typing.Optional[PIL.Image.Image] = None height: typing.Optional[int] = None width: typing.Optional[int] = None num_inference_steps: int = 4 guidance_scale: typing.Optional[float] = None image_guidance_scale: typing.Optional[float] = None sigmas: typing.Optional[typing.List[float]] = None num_images_per_prompt: typing.Optional[int] = 1 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True max_sequence_length: int = 200 text_pad_embedding: typing.Optional[torch.Tensor] = None )
Parameters
- prompt — Text prompt.
- image — Optional input image. If provided, runs in edit / image-to-image mode; otherwise runs in text-to-image mode.
- height — Output resolution (height). Defaults to
default_sample_size * vae_scale_factor(1024). - width — Output resolution (width). Defaults to
default_sample_size * vae_scale_factor(1024). - num_inference_steps — Number of denoising steps. Defaults to 4 (distilled).
- guidance_scale — Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.
- image_guidance_scale — Accepted for API parity with DreamLitePipeline; ignored because CFG was distilled away.
- sigmas — Optional explicit FlowMatch sigmas; defaults to a uniform linspace.
- num_images_per_prompt — Output images per prompt (note:
batch_sizeis forced to 1). - generator — Random generator(s).
- output_type —
"pil","np","pt"or"latent". - return_dict — If True, returns a DreamLitePipelineOutput; else
(images,). - max_sequence_length — Maximum number of user-prompt tokens kept after dropping the chat-template
prefix. Only applies to
generatemode (theeditmode uses the multimodal processor’s native padding). - text_pad_embedding — Optional learned pad embedding for masked positions.
Run the distilled DreamLite Mobile pipeline.
DreamLitePipelineOutput
class diffusers.DreamLitePipelineOutput
< source >( images: typing.Union[typing.List[PIL.Image.Image], numpy.ndarray] )
Output class for DreamLite pipelines.