SemBridge: Language Transfer in Sparse Encoders via Multilingual Semantic Bridges
Abstract
SemBridge enhances cross-lingual sparse encoder adaptation by using multilingual bridge models to establish semantic alignments and improve retrieval performance across multiple languages.
Sparse encoders offer high-precision retrieval by representing term importance within a vocabulary space, yet their English-centric structures pose a critical impediment to language transfer for non-English languages. To overcome this structural limitation, we propose SemBridge, a novel embedding initialization method designed for cross-lingual adaptation in sparse encoders by leveraging multilingual bridge models. SemBridge establishes semantic alignments between source and target vocabularies using multilingual dense embeddings as a bridge. Rather than directly relying on all source tokens, SemBridge selects a small set of semantically related source-language tokens and uses them to initialize each target-language token, effectively filtering out semantic noise and reconstructing target tokens as precise linear combinations of core synonyms. This accelerates convergence during fine-tuning and improves training efficiency. Extensive experiments across five languages and four sparse architectures demonstrate that SemBridge achieves superior zero-shot retrieval performance and consistently improves retrieval performance after fine-tuning compared to existing baselines. These results validate SemBridge as a practical solution for deploying high-performance sparse retrieval systems in diverse linguistic environments.
Community
the entmax-based sparsification of bridge similarities to form a sparse weight vector and then reconstruct target embeddings from source ones is a clean, clever way to inject cross-language signal without blowing up the sparse encoder. my main question is how sensitive this is to alpha beyond the two checkpoints they show (2 vs 4) and whether an adaptive per-token alpha could curb semantic drift for rare or domain-specific terms. the arxivlens breakdown helped me parse the method details, it’s a nice quick read that clarifies the bridge step and the reconstruction flavor. if they can show stable transfer under skewed vocab distributions, SemBridge could be a really practical plug-in for multilingual sparse retrieval.
Get this paper in your agent:
hf papers read 2605.26002 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper