When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems
Abstract
Hybrid multi-agent systems combining large and small language models offer flexible inference trade-offs, but optimal architecture depends heavily on specific tasks and performance metrics.
The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.
Community
If you use an edge device-sized, or self hosted, LM to power your agentic system, you will usually observe subpar performance; on the other hand, while cloud-based frontier models can deliver satisfactory performance, they also come with potentially high API costs.
In this paper, we explore how this dilemma can worked around by putting a Multi-Agentic spin on the idea of Hybrid AI. In our system, an Executor agent living on device receives periodic assistance from a Supervisor agent living on the cloud. We explore the design space of such a system and make some non-trivial observations: we see that edge-sized Executors can indeed benefit from assistance from the cloud, resulting in performance superior to an edge-only setup for less API costs than a cloud-only setup; that the best-performing multi-agent architecture depends on the nature of the task; and that our Hybrid MAS is fundamentally different from a routing system.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Hera: Learning Long-Horizon Coordination for Device-Cloud Collaborative LLM Agents (2026)
- Rethinking Scale: Deployment Trade-offs of Small Language Models under Agent Paradigms (2026)
- AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices (2026)
- Experience as a Compass: Multi-agent RAG with Evolving Orchestration and Agent Prompts (2026)
- AgenticCache: Cache-Driven Asynchronous Planning for Embodied AI Agents (2026)
- Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems (2026)
- Architecture Matters for Multi-Agent Security (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper