You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation

GitHub Paper License

MultiDiffSense is a ControlNet-based diffusion model that generates realistic and physically grounded tactile sensor images using dual-conditioning from depth maps and text prompts. It translates text prompts and rendered depth maps of 3D objects into multi-modal tactile sensor outputs across three sensor modalities.

Model Details

Architecture ControlNet built on Stable Diffusion 1.5
Task Depth map + Text Prompt to Multi-Modal Tactile sensor image generation
Input 512x512 depth map (viridis colourmap) + text prompt
Output 512x512 tactile sensor image
Training ~150 epochs, frozen SD backbone, lr=1e-5, batch size 8
Parameters ~860M (SD 1.5) + ~360M (ControlNet copy)

Supported Tactile Sensor Modalities

Sensor Description Image Example
TacTip Optical tactile sensor with pin-based deformation markers
ViTac Vision-based tactile sensor (no markers)
ViTacTip Combined vision-tactile sensor

Files

File Description
multidiffsense.ckpt Trained checkpoint (trained on short prompts + depth maps)

Usage

Clone the GitHub repository and follow the installation instructions, then run inference. The checkpoint is downloaded automatically on first run:

git clone https://github.com/sirine-b/MultiDiffSense.git
cd MultiDiffSense
pip install -r requirements.txt

# Single depth map:
python multidiffsense/controlnet/generate.py \
    --source_image path/to/depth_map.png \
    --prompt '{"sensor_context": "captured by a high-resolution vision only sensor ViTac.", "object_pose": {"x": 0.12, "y": -0.34, "z": 1.5, "yaw": 15.0}}'

# Batch generation from a prompt file:
python multidiffsense/controlnet/generate.py \
    --dataset_dir datasets \
    --prompt_json datasets/test/prompt_ViTacTip.json

See the GitHub repository for full documentation on dataset preparation, training from scratch, evaluation, and ablation studies.

Citation

@inproceedings{multidiffsense2026,
    title     = {MultiDiffSense: Diffusion-Based Multi-Modal Visuo-Tactile Image Generation Conditioned on Object Shape and Contact Pose},
    author    = {Sirine Bhouri and Lan Wei and Jian-Qing Zheng and Dandan Zhang},
    booktitle = {IEEE International Conference on Robotics and Automation (ICRA)},
    year      = {2026}
    url       = {https://arxiv.org/abs/2602.19348}
}

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for sirine16/MultiDiffSense