You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation

MultiDiffSense is a ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.

Model Details


Architecture	ControlNet built on Stable Diffusion 1.5
Task	Depth map + Text Prompt to Multi-Modal Tactile sensor image generation
Input	512x512 depth map (viridis colourmap) + text prompt
Output	512x512 tactile sensor image
Training	~150 epochs, frozen SD backbone, lr=1e-5, batch size 8
Parameters	~860M (SD 1.5) + ~360M (ControlNet copy)

Supported Tactile Sensor Modalities

Sensor	Description	Image Example
TacTip	Optical tactile sensor with pin-based deformation markers
ViTac	Vision-based tactile sensor (no markers)
ViTacTip	Combined vision-tactile sensor

Files

File	Description
`multidiffsense.ckpt`	Trained checkpoint (trained on short prompts + depth maps)

Usage

Clone the GitHub repository and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:

git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txt

# Single depth map:
python multidiffsense/controlnet/generate.py \
    --source_image path/to/depth_map.png \
    --prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'

# Batch generation from a prompt file:
python multidiffsense/controlnet/generate.py \
    --dataset_dir datasets \
    --prompt_json datasets/test/prompt_ViTacTip.json

See the GitHub repository for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.

Citation

@inproceedings{multidiffsense2026,
    title     = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
    author    = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
    year      = {2026}
    url       = {https://arxiv.org/abs/2602.19348}
}

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Paper for sirine16/MultiDiffSense

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose

Paper • 2602.19348 • Published 5 days ago