Model Card for Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Prompt-CAM checkpoints trained with a ViT-B DINO and DINOv2 backbone on fine-grained image classification datasets (CUB-200-2011, Oxford-IIIT Pet, Stanford Cars, Stanford Dogs, Birds-525). These checkpoints can be used to produce per-class attention maps to explore fine-grained trait distinctions between different specified species.

Model Details

Model Description

Prompt-CAM is a simple yet effective interpretable transformer that requires no architectural modifications to pretrained ViTs. It injects class-specific prompts into any ViT to make attention maps interpretable for fine-grained analysis. The prompts act as class queries, and the resulting cross-attention between prompts and image patches produces human-interpretable heatmaps highlighting the visual traits the model uses to distinguish each class.

Developed by: Arpita Chowdhury, Dipanjyoti Paul, Zheda Mai, Jianyang Gu, Ziheng Zhang, Kazi Sajeed Mehrab, Elizabeth G. Campolongo, Daniel Rubenstein, Charles V. Stewart, Anuj Karpatne, Tanya Berger-Wolf, Yu Su, and Wei-Lun Chao
Model type: Vision Transformer with class-specific prompt injection
License: MIT
Fine-tuned from model: ViT-B DINO and ViT-B DINOv2

Model Sources

Repository: Imageomics/Prompt_CAM
Paper: Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (CVPR 2025), Open-Access
Demo: Interactive Colab demo and local demo.ipynb

Uses

Direct Use

Prompt-CAM can be used directly for:

Fine-grained image classification — predicting the species/category of an image among a large set of visually similar classes.
Visual interpretability — generating per-class attention heatmaps that highlight which image regions and traits the model uses for each class, supporting scientific understanding of what distinguishes species or categories.

Downstream Use

Prompt-CAM can be extended to new fine-grained datasets by following the extension instructions in the repository. It is well-suited for biological image datasets where understanding discriminative traits (e.g., plumage patterns, markings) is as important as classification accuracy.

Out-of-Scope Use

The model is not designed for general-purpose object detection or segmentation.
Performance may degrade significantly on image domains far from the training distribution (e.g., applying a bird-trained model to medical images).

Bias, Risks, and Limitations

Prompt-CAM inherits any biases present in the pretrained ViT backbone (DINO / DINOv2) and in the fine-tuning datasets (e.g., geographic or photographic biases in CUB-200-2011).
Classification performance is tied to image quality; low-resolution or heavily occluded images may yield less reliable predictions and attention maps.

Recommendations

Users should treat attention heatmaps as model explanations to be verified with domain expertise rather than ground-truth biological annotations.

How to Get Started with the Model

Set up the environment (using env_setup.sh):

conda create -n prompt_cam python=3.10
conda activate prompt_cam
source env_setup.sh

Download a checkpoint from this repository (see Training Data table below) and place it in checkpoints/{backbone}/{dataset}/model.pt. Then visualize class-specific attention maps by running:

CUDA_VISIBLE_DEVICES=0 python visualize.py \
    --config ./experiment/config/prompt_cam/dino/cub/args.yaml \
    --checkpoint ./checkpoints/dino/cub/model.pt \
    --vis_cls 23

Output heatmaps are saved to visualization/dino/cub/class_23/.

For an interactive experience, see the Colab demo or demo.ipynb.

Training Details

Training Data

Each checkpoint is trained on the official training split of its respective dataset.

Backbone	Dataset	Checkpoint
DINO (ViT-B/16)	CUB-200-2011	Prompt_CAM_checkpoint_dino_cub.pt
DINO (ViT-B/16)	Stanford Cars	Coming soon
DINO (ViT-B/16)	Stanford Dogs	Coming soon
DINO (ViT-B/16)	Oxford-IIIT Pet	Coming soon
DINO (ViT-B/16)	Birds-525	Coming soon
DINOv2 (ViT-B/14)	CUB-200-2011	Coming soon
DINOv2 (ViT-B/14)	Stanford Dogs	Coming soon
DINOv2 (ViT-B/14)	Oxford-IIIT Pet	Coming soon

Training Procedure

Only the class-specific prompt tokens are trained; the pretrained ViT backbone weights are kept frozen. The number of prompt tokens equals the number of classes in the dataset.

Dataset images are organized as:

dataset_name/
├── train/
│   ├── 001.ClassName/
│   │   ├── img1.jpg
│   │   └── ...
│   └── ...
└── val/
    ├── 001.ClassName/
    │   ├── img2.jpg
    │   └── ...
    └── ...

Please see the Data Preparation section of our GitHub repository for more details on training and validation setup, including preprocessing scripts.

Preprocessing

Step	Train	Val
Resize	240 × 240	224 × 224
Crop	RandomCrop 224 × 224	—
Flip	RandomHorizontalFlip	—
Normalize	ImageNet Inception mean/std	ImageNet Inception mean/std

Training Hyperparameters

Hyperparameter	Value
Optimizer	SGD
Learning rate	0.001 – 0.005 (dataset-dependent)
Min LR	1e-6
Momentum	0.9
Weight decay	0.001
Epochs	100 – 130
Warmup epochs	20
Warmup LR init	1e-6
Batch size	16
Drop path rate	0.0
VPT dropout	0.0
Precision	fp32

Speeds, Sizes, Times

Hardware: NVIDIA RTX A6000
Training time: ≤ 1 hour per checkpoint
Checkpoint size: ~350 MB (ViT-B backbone + prompt tokens)

Evaluation

To evaluate a checkpoint on the test split, run:

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 main.py \
    --config ./experiment/config/prompt_cam/dino/cub/args.yaml \
    --gpu_num 4

Testing Data, Factors & Metrics

Testing Data

Each model is evaluated on the official test (val) split of its training dataset.

Backbone	Dataset	Checkpoint
DINO (ViT-B/16)	CUB-200-2011	Prompt_CAM_checkpoint_dino_cub.pt
DINO (ViT-B/16)	Stanford Cars	Coming soon
DINO (ViT-B/16)	Stanford Dogs	Coming soon
DINO (ViT-B/16)	Oxford-IIIT Pet	Coming soon
DINO (ViT-B/16)	Birds-525	Coming soon
DINOv2 (ViT-B/14)	CUB-200-2011	Coming soon
DINOv2 (ViT-B/14)	Stanford Dogs	Coming soon
DINOv2 (ViT-B/14)	Oxford-IIIT Pet	Coming soon

Metrics

Top-1 classification accuracy on the official test split.

Results

Backbone	Dataset	acc@1
DINO (ViT-B/16)	CUB-200-2011	73.2
DINO (ViT-B/16)	Stanford Cars	83.2
DINO (ViT-B/16)	Stanford Dogs	81.1
DINO (ViT-B/16)	Oxford-IIIT Pet	91.3
DINO (ViT-B/16)	Birds-525	98.8
DINOv2 (ViT-B/14)	CUB-200-2011	74.1
DINOv2 (ViT-B/14)	Stanford Dogs	81.3
DINOv2 (ViT-B/14)	Oxford-IIIT Pet	92.7

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: NVIDIA RTX A6000
Hours used: ≤ 1 hour per checkpoint
Cloud Provider: N/A (local cluster)
Compute Region: United States
Carbon Emitted: ~0.13 kg CO₂ eq. per checkpoint

Technical Specifications

Model Architecture and Objective

Prompt-CAM adds a set of learnable class-specific prompt tokens to the input sequence of a frozen pretrained ViT. Each prompt token corresponds to one class. During the forward pass, the self-attention between each class prompt and the image patch tokens produces a spatial attention map that reveals which patches are most relevant for that class. Only the prompt tokens are optimized during training; all ViT parameters remain frozen.

Compute Infrastructure

Hardware

NVIDIA RTX A6000 GPU.

Software

Python 3.10
PyTorch
timm 1.0.24

See env_setup.sh for the full environment.

Citation

BibTeX:

If you find our work helpful, please consider citing our paper:

@inproceedings{Chowdhury_2025_CVPR,
    author    = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
    title     = {Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2025},
    pages     = {4375--4385},
    doi       = {10.1109/CVPR52734.2025.00413}
}

Model Citation:

@software{Chowdhury_Prompt_CAM_2025,
    author    = {Chowdhury, Arpita and Paul, Dipanjyoti and Mai, Zheda and Gu, Jianyang and Zhang, Ziheng and Mehrab, Kazi Sajeed and Campolongo, Elizabeth G. and Rubenstein, Daniel and Stewart, Charles V. and Karpatne, Anuj and Berger-Wolf, Tanya and Su, Yu and Chao, Wei-Lun},
    license   = {MIT},
    title     = {{Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis}},
    doi       = {<doi once generated>},
    url       = {https://huggingface.co/imageomics/Prompt-CAM},
    version   = {1.0.0},
    month     = {June},
    year      = {2026}
}

APA:

Paper:

Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis. 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4375–4385. doi:10.1109/CVPR52734.2025.00413 Model Citation:

Chowdhury, A., Paul, D., Mai, Z., Gu, J., Zhang, Z., Mehrab, K. S., Campolongo, E. G., Rubenstein, D., Stewart, C. V., Karpatne, A., Berger-Wolf, T., Su, Y., & Chao, W.-L. (2025). Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis (Version 1.0.0). https://huggingface.co/imageomics/Prompt-CAM

Acknowledgements

Our model builds on pretrained DINO and DINOv2 ViT backbones. We thank the authors for their excellent work.

We also acknowledge:

This work was supported by the Imageomics Institute, which is funded by the US National Science Foundation's Harnessing the Data Revolution (HDR) program under Award #2118240 (Imageomics: A New Frontier of Biological Information Powered by Knowledge-Guided Machine Learning). Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Model Card Authors

Arpita Chowdhury

Model Card Contact

Arpita Chowdhury — GitHub Issues - email: arpitachowdhurytonney@gmail.com

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for imageomics/Prompt-CAM

Prompt-CAM: Making Vision Transformers Interpretable for Fine-Grained Analysis

Paper • 2501.09333 • Published Jan 16, 2025 • 1