# BLIP-Diffusion

BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://huggingface.co/papers/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation.

The abstract from the paper is:

*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).*

The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization.

`BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/).

> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

## BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]
#### diffusers.BlipDiffusionPipeline[[diffusers.BlipDiffusionPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.36.0/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L80)

Pipeline for Zero-Shot Subject Driven Generation using Blip Diffusion.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.36.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__diffusers.BlipDiffusionPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.36.0/src/diffusers/pipelines/blip_diffusion/pipeline_blip_diffusion.py#L192[{"name": "prompt", "val": ": typing.List[str]"}, {"name": "reference_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": typing.List[str]"}, {"name": "target_subject_category", "val": ": typing.List[str]"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "neg_prompt", "val": ": typing.Optional[str] = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`List[str]`) --
  The prompt or prompts to guide the image generation.
- **reference_image** (`PIL.Image.Image`) --
  The reference image to condition the generation on.
- **source_subject_category** (`List[str]`) --
  The source subject category.
- **target_subject_category** (`List[str]`) --
  The target subject category.
- **latents** (`torch.Tensor`, *optional*) --
  Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
  generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
  tensor will be generated by random sampling.
- **guidance_scale** (`float`, *optional*, defaults to 7.5) --
  Guidance scale as defined in [Classifier-Free Diffusion
  Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
  of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
  `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
  the text `prompt`, usually at the expense of lower image quality.
- **height** (`int`, *optional*, defaults to 512) --
  The height of the generated image.
- **width** (`int`, *optional*, defaults to 512) --
  The width of the generated image.
- **num_inference_steps** (`int`, *optional*, defaults to 50) --
  The number of denoising steps. More denoising steps usually lead to a higher quality image at the
  expense of slower inference.
- **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) --
  One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
  to make generation deterministic.
- **neg_prompt** (`str`, *optional*, defaults to "") --
  The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
  if `guidance_scale` is less than `1`).
- **prompt_strength** (`float`, *optional*, defaults to 1.0) --
  The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
  to amplify the prompt.
- **prompt_reps** (`int`, *optional*, defaults to 20) --
  The number of times the prompt is repeated along with prompt_strength to amplify the prompt.
- **output_type** (`str`, *optional*, defaults to `"pil"`) --
  The output format of the generate image. Choose between: `"pil"` (`PIL.Image.Image`), `"np"`
  (`np.array`) or `"pt"` (`torch.Tensor`).
- **return_dict** (`bool`, *optional*, defaults to `True`) --
  Whether or not to return a [ImagePipelineOutput](/docs/diffusers/v0.36.0/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) instead of a plain tuple.0[ImagePipelineOutput](/docs/diffusers/v0.36.0/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`

Function invoked when calling the pipeline for generation.

Examples:
```py
>>> from diffusers.pipelines import BlipDiffusionPipeline
>>> from diffusers.utils import load_image
>>> import torch

>>> blip_diffusion_pipe = BlipDiffusionPipeline.from_pretrained(
...     "Salesforce/blipdiffusion", torch_dtype=torch.float16
... ).to("cuda")

>>> cond_subject = "dog"
>>> tgt_subject = "dog"
>>> text_prompt_input = "swimming underwater"

>>> cond_image = load_image(
...     "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/dog.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 25
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

>>> output = blip_diffusion_pipe(
...     text_prompt_input,
...     cond_image,
...     cond_subject,
...     tgt_subject,
...     guidance_scale=guidance_scale,
...     num_inference_steps=num_inference_steps,
...     neg_prompt=negative_prompt,
...     height=512,
...     width=512,
... ).images
>>> output[0].save("image.png")
```

**Parameters:**

tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder

text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt

vae ([AutoencoderKL](/docs/diffusers/v0.36.0/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image

unet ([UNet2DConditionModel](/docs/diffusers/v0.36.0/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.

scheduler ([PNDMScheduler](/docs/diffusers/v0.36.0/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.

qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.

image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.

ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.

**Returns:**

`[ImagePipelineOutput](/docs/diffusers/v0.36.0/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple``

## BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]
#### diffusers.BlipDiffusionControlNetPipeline[[diffusers.BlipDiffusionControlNetPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.36.0/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L87)

Pipeline for Canny Edge based Controlled subject-driven generation using Blip Diffusion.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.36.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the
library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

__call__diffusers.BlipDiffusionControlNetPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.36.0/src/diffusers/pipelines/controlnet/pipeline_controlnet_blip_diffusion.py#L240[{"name": "prompt", "val": ": typing.List[str]"}, {"name": "reference_image", "val": ": Image"}, {"name": "condtioning_image", "val": ": Image"}, {"name": "source_subject_category", "val": ": typing.List[str]"}, {"name": "target_subject_category", "val": ": typing.List[str]"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "guidance_scale", "val": ": float = 7.5"}, {"name": "height", "val": ": int = 512"}, {"name": "width", "val": ": int = 512"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "neg_prompt", "val": ": typing.Optional[str] = ''"}, {"name": "prompt_strength", "val": ": float = 1.0"}, {"name": "prompt_reps", "val": ": int = 20"}, {"name": "output_type", "val": ": typing.Optional[str] = 'pil'"}, {"name": "return_dict", "val": ": bool = True"}]- **prompt** (`List[str]`) --
  The prompt or prompts to guide the image generation.
- **reference_image** (`PIL.Image.Image`) --
  The reference image to condition the generation on.
- **condtioning_image** (`PIL.Image.Image`) --
  The conditioning canny edge image to condition the generation on.
- **source_subject_category** (`List[str]`) --
  The source subject category.
- **target_subject_category** (`List[str]`) --
  The target subject category.
- **latents** (`torch.Tensor`, *optional*) --
  Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
  generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
  tensor will be generated by random sampling.
- **guidance_scale** (`float`, *optional*, defaults to 7.5) --
  Guidance scale as defined in [Classifier-Free Diffusion
  Guidance](https://huggingface.co/papers/2207.12598). `guidance_scale` is defined as `w` of equation 2.
  of [Imagen Paper](https://huggingface.co/papers/2205.11487). Guidance scale is enabled by setting
  `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to
  the text `prompt`, usually at the expense of lower image quality.
- **height** (`int`, *optional*, defaults to 512) --
  The height of the generated image.
- **width** (`int`, *optional*, defaults to 512) --
  The width of the generated image.
- **seed** (`int`, *optional*, defaults to 42) --
  The seed to use for random generation.
- **num_inference_steps** (`int`, *optional*, defaults to 50) --
  The number of denoising steps. More denoising steps usually lead to a higher quality image at the
  expense of slower inference.
- **generator** (`torch.Generator` or `List[torch.Generator]`, *optional*) --
  One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
  to make generation deterministic.
- **neg_prompt** (`str`, *optional*, defaults to "") --
  The prompt or prompts not to guide the image generation. Ignored when not using guidance (i.e., ignored
  if `guidance_scale` is less than `1`).
- **prompt_strength** (`float`, *optional*, defaults to 1.0) --
  The strength of the prompt. Specifies the number of times the prompt is repeated along with prompt_reps
  to amplify the prompt.
- **prompt_reps** (`int`, *optional*, defaults to 20) --
  The number of times the prompt is repeated along with prompt_strength to amplify the prompt.0[ImagePipelineOutput](/docs/diffusers/v0.36.0/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple`

Function invoked when calling the pipeline for generation.

Examples:
```py
>>> from diffusers.pipelines import BlipDiffusionControlNetPipeline
>>> from diffusers.utils import load_image
>>> from controlnet_aux import CannyDetector
>>> import torch

>>> blip_diffusion_pipe = BlipDiffusionControlNetPipeline.from_pretrained(
...     "Salesforce/blipdiffusion-controlnet", torch_dtype=torch.float16
... ).to("cuda")

>>> style_subject = "flower"
>>> tgt_subject = "teapot"
>>> text_prompt = "on a marble table"

>>> cldm_cond_image = load_image(
...     "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/kettle.jpg"
... ).resize((512, 512))
>>> canny = CannyDetector()
>>> cldm_cond_image = canny(cldm_cond_image, 30, 70, output_type="pil")
>>> style_image = load_image(
...     "https://huggingface.co/datasets/ayushtues/blipdiffusion_images/resolve/main/flower.jpg"
... )
>>> guidance_scale = 7.5
>>> num_inference_steps = 50
>>> negative_prompt = "over-exposure, under-exposure, saturated, duplicate, out of frame, lowres, cropped, worst quality, low quality, jpeg artifacts, morbid, mutilated, out of frame, ugly, bad anatomy, bad proportions, deformed, blurry, duplicate"

>>> output = blip_diffusion_pipe(
...     text_prompt,
...     style_image,
...     cldm_cond_image,
...     style_subject,
...     tgt_subject,
...     guidance_scale=guidance_scale,
...     num_inference_steps=num_inference_steps,
...     neg_prompt=negative_prompt,
...     height=512,
...     width=512,
... ).images
>>> output[0].save("image.png")
```

**Parameters:**

tokenizer (`CLIPTokenizer`) : Tokenizer for the text encoder

text_encoder (`ContextCLIPTextModel`) : Text encoder to encode the text prompt

vae ([AutoencoderKL](/docs/diffusers/v0.36.0/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : VAE model to map the latents to the image

unet ([UNet2DConditionModel](/docs/diffusers/v0.36.0/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : Conditional U-Net architecture to denoise the image embedding.

scheduler ([PNDMScheduler](/docs/diffusers/v0.36.0/en/api/schedulers/pndm#diffusers.PNDMScheduler)) : A scheduler to be used in combination with `unet` to generate image latents.

qformer (`Blip2QFormerModel`) : QFormer model to get multi-modal embeddings from the text and image.

controlnet ([ControlNetModel](/docs/diffusers/v0.36.0/en/api/models/controlnet#diffusers.ControlNetModel)) : ControlNet model to get the conditioning image embedding.

image_processor (`BlipImageProcessor`) : Image Processor to preprocess and postprocess the image.

ctx_begin_pos (int, `optional`, defaults to 2) : Position of the context token in the text encoder.

**Returns:**

`[ImagePipelineOutput](/docs/diffusers/v0.36.0/en/api/pipelines/ddim#diffusers.ImagePipelineOutput) or `tuple``