CodeCompass-Embed

CodeCompass-Embed is a code embedding model fine-tuned from Qwen2.5-Coder-0.5B for semantic code search and retrieval tasks.

Model Highlights

🏆 #1 on CodeTrans-DL (code translation between frameworks)
🥇 #4 on CodeSearchNet-Python (natural language to code search)
⚡ 494M parameters, 896-dim embeddings
🔄 Bidirectional attention (converted from causal LLM)
🎯 Mean pooling with L2 normalization
📏 Trained at 512 tokens, extrapolates to longer sequences via RoPE

Model Details

Property	Value
Base Model	Qwen2.5-Coder-0.5B
Parameters	494M
Embedding Dimension	896
Max Sequence Length	512 (training) / 32K (inference)
Pooling	Mean
Normalization	L2
Attention	Bidirectional (all 24 layers)

Benchmark Results (CoIR)

Evaluated on the CoIR Benchmark (NDCG@10). Sorted by CSN-Python.

Model	Params	CSN-Python	CodeTrans-DL	Text2SQL	SO-QA	CF-ST	Apps
SFR-Embedding-Code	400M	0.9505	0.2683	0.9949	0.9107	0.7258	0.2212
Jina-Code-v2	161M	0.9439	0.2739	0.5169	0.8874	0.6975	0.1538
CodeRankEmbed	137M	0.9378	0.2604	0.7686	0.8990	0.7166	0.1993
CodeCompass-Embed	494M	0.9228	0.3305	0.5673	0.6480	0.4080	0.1277
Snowflake-Arctic-Embed-L	568M	0.9146	0.1958	0.5401	0.8718	0.6503	0.1435
BGE-M3	568M	0.8976	0.2194	0.5728	0.8501	0.6437	0.1445
BGE-Base-en-v1.5	109M	0.8944	0.2125	0.5265	0.8581	0.6423	0.1415
CodeT5+-110M	110M	0.8702	0.1794	0.3275	0.8147	0.5804	0.1179

CodeCompass-Embed ranks #1 on CodeTrans-DL and #4 on CSN-Python.

Usage

import torch
import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("faisalmumtaz/codecompass-embed", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("faisalmumtaz/codecompass-embed")

# Enable bidirectional attention
for layer in model.layers:
    layer.self_attn.is_causal = False

model.eval()

def encode(texts, is_query=False):
    if is_query:
        texts = [f"Instruct: Find the most relevant code snippet given the following query:
Query: {t}" for t in texts]
    
    inputs = tokenizer(texts, padding=True, truncation=True, max_length=512, return_tensors="pt")
    
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
        hidden = outputs.hidden_states[-1]
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        embeddings = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
        embeddings = F.normalize(embeddings, p=2, dim=-1)
    
    return embeddings

query_emb = encode(["sort a list"], is_query=True)
code_embs = encode(["def sort(lst): return sorted(lst)"])
similarity = (query_emb @ code_embs.T).item()

Instruction Templates

Task	Template
NL to Code	`Instruct: Find the most relevant code snippet given the following query:
Query: {q}`
Code to Code	`Instruct: Find an equivalent code snippet given the following code snippet:
Query: {q}`
Tech Q&A	`Instruct: Find the most relevant answer given the following question:
Query: {q}`
Text to SQL	`Instruct: Given a natural language question and schema, find the corresponding SQL query:
Query: {q}`

Documents do not need instruction prefixes.

Training

Data: 8.8M samples from CoRNStack, StackOverflow, CodeSearchNet
Loss: InfoNCE (τ=0.05) with 7 hard negatives per sample
Batch Size: 1024 (via GradCache)
Steps: 950
Hardware: NVIDIA H100

Limitations

Weaker on Q&A style tasks (StackOverflow-QA, CodeFeedback)
Trained on Python/JavaScript/Java/Go/PHP/Ruby

Citation

@misc{codecompass2026,
  author = {Faisal Mumtaz},
  title = {CodeCompass-Embed: A Code Embedding Model for Semantic Code Search},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/faisalmumtaz/codecompass-embed}
}

License

Apache 2.0

Downloads last month: 42

Safetensors

Model size

0.5B params

Tensor type

BF16

Model tree for faisalmumtaz/codecompass-embed

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-Coder-0.5B

Finetuned

(22)

this model

Dataset used to train faisalmumtaz/codecompass-embed

Evaluation results

NDCG@10 on CodeTrans-DL
self-reported

0.331
NDCG@10 on CodeSearchNet Python
self-reported

0.923