SpiritSight Agent: Advanced GUI Agent with One Look

📄 Paper • 🤖 Models • 🌐 Project Page • 📚 Datasets

Introduction

SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.

Models

We recommend fine-tuning the base model on custom data.

Model	Checkpoint	Size	License
SpiritSight-Agent-2B-base	🤗 HF Link	2B	InternVL
SpiritSight-Agent-8B-base	🤗 HF Link	8B	InternVL
SpiritSight-Agent-26B-base	🤗 HF Link	26B	InternVL

Datasets

Coming soon.

Inference

conda create -n spiritsight-agent python=3.9

pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

python infer_SSAgent-2B.py

Citation

If you find this repo useful for your research, please kindly cite our paper:

@misc{huang2025spiritsightagentadvancedgui,
      title={SpiritSight Agent: Advanced GUI Agent with One Look}, 
      author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
      year={2025},
      eprint={2503.03196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03196},
}