CodeCompass-Embed

CodeCompass-Embed is a code embedding model fine-tuned from Qwen2.5-Coder-0.5B for semantic code search and retrieval tasks.

Model Highlights

  • 🏆 #1 on CodeTrans-DL (code translation between frameworks)
  • 🥇 #4 on CodeSearchNet-Python (natural language to code search)
  • ⚡ 494M parameters, 896-dim embeddings
  • 🔄 Bidirectional attention (converted from causal LLM)
  • 🎯 Mean pooling with L2 normalization
  • 📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE

Model Details

Property Value
Base Model Qwen2.5-Coder-0.5B
Parameters 494M
Embedding Dimension 896
Max Sequence Length 512 (training) / 32K (inference)
Pooling Mean
Normalization L2
Attention Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (NDCG@10). Sorted by CSN-Python.

Model Params CSN-Python CodeTrans-DL Text2SQL SO-QA CF-ST Apps
SFR-Embedding-Code 400M 0.9505 0.2683 0.9949 0.9107 0.7258 0.2212
Jina-Code-v2 161M 0.9439 0.2739 0.5169 0.8874 0.6975 0.1538
CodeRankEmbed 137M 0.9378 0.2604 0.7686 0.8990 0.7166 0.1993
CodeCompass-Embed 494M 0.9228 0.3305 0.5673 0.6480 0.4080 0.1277
Snowflake-Arctic-Embed-L 568M 0.9146 0.1958 0.5401 0.8718 0.6503 0.1435
BGE-M3 568M 0.8976 0.2194 0.5728 0.8501 0.6437 0.1445
BGE-Base-en-v1.5 109M 0.8944 0.2125 0.5265 0.8581 0.6423 0.1415
CodeT5+-110M 110M 0.8702 0.1794 0.3275 0.8147 0.5804 0.1179

CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# Enable bidirectional attention
for layer in model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:
Query: {t}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

query_emb = encode(["sort a list"], is_query=True)
code_embs = encode(["def sort(lst): return sorted(lst)"])
similarity = (query_emb @ code_embs.T).item()

Instruction Templates

Task Template
NL to Code `Instruct: Find the most relevant code snippet given the following query:
Query: {q}`
Code to Code `Instruct: Find an equivalent code snippet given the following code snippet:
Query: {q}`
Tech Q&A `Instruct: Find the most relevant answer given the following question:
Query: {q}`
Text to SQL `Instruct: Given a natural language question and schema, find the corresponding SQL query:
Query: {q}`

Documents do not need instruction prefixes.

Training

  • Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
  • Loss: InfoNCE (τ=0.05) with 7 hard negatives per sample
  • Batch Size: 1024 (via GradCache)
  • Steps: 950
  • Hardware: NVIDIA H100

Limitations

  • Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
  • Trained on Python/JavaScript/Java/Go/PHP/Ruby

Citation

@misc{codecompass2026,
  author = {Faisal Mumtaz},
  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}

License

Apache 2.0

Downloads last month
42
Safetensors
Model size
0.5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for faisalmumtaz/codecompass-embed

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(22)
this model

Dataset used to train faisalmumtaz/codecompass-embed

Evaluation results