CWE Classifier (RoBERTa-base)
A fine-tuned RoBERTa-base model that maps CVE (Common Vulnerabilities and Exposures) descriptions to CWE (Common Weakness Enumeration) categories. 125M parameters, 205 CWE classes.
Performance
Internal Test Set (27,780 agreement-filtered samples)
| Metric | Score |
|---|---|
| Top-1 Accuracy | 87.4% |
| Top-3 Accuracy | 94.7% |
| Macro F1 | 0.607 |
| Weighted F1 | 0.872 |
CTI-Bench External Benchmark (NeurIPS 2024, zero training overlap)
| Benchmark | Strict Top-1 | Hierarchy-aware Top-1 |
|---|---|---|
| cti-rcm (2023-2024 CVEs) | 75.6% | 86.5% |
| cti-rcm-2021 (2011-2021 CVEs) | 71.8% | 82.8% |
Comparison on CTI-Bench cti-rcm (strict exact match)
All scores below use the official CTI-Bench evaluation protocol: strict exact CWE ID match.
| Model | Params | Type | Top-1 Accuracy | Source |
|---|---|---|---|---|
| Sec-Gemini v1 (Google)* | — | closed | ~86% | Google Security Blog |
| SecLM (Google)* | — | closed | ~85% | Google Cloud Blog |
| This model | 125M | open | 75.6% | — |
| Foundation-Sec-8B-Reasoning (Cisco) | 8B | open | 75.3% | arXiv 2601.21051 |
| GPT-4 | ~1.7T | closed | 72.0% | CTI-Bench paper |
| Foundation-Sec-8B (Cisco) | 8B | open | 72.0% (±1.7%) | arXiv 2504.21039 |
| WhiteRabbitNeo-V2-70B | 70B | open | 71.1% | arXiv 2504.21039 |
| Foundation-Sec-8B-Instruct (Cisco) | 8B | open | 70.4% | arXiv 2601.21051 |
| Llama-Primus (Trend Micro) | 8B | open | 67.8% | HuggingFace |
| GPT-3.5 | ~175B | closed | 67.2% | CTI-Bench paper |
| Gemini 1.5 | — | closed | 66.6% | CTI-Bench paper |
| LLaMA3-70B | 70B | open | 65.9% | CTI-Bench paper |
| LLaMA3-8B | 8B | open | 44.7% | CTI-Bench paper |
*Sec-Gemini and SecLM scores are approximate, estimated from published comparison charts. Exact values were not reported.
Competitive with the best open-weight models at 64x fewer parameters (125M vs 8B). Note: the 0.3pp difference vs Cisco Foundation-Sec-8B-Reasoning is not statistically significant (95% CIs overlap on n=1000). The Cisco models are general-purpose LLMs; ours is a task-specific encoder.
TF-IDF baseline comparison
A TF-IDF + Logistic Regression baseline reaches 84.9% top-1 on the same test set, but only 45.2% Macro F1 vs our 60.7% — a +15.5pp Macro F1 gap showing the model's advantage on rare CWE classes that keyword matching cannot handle.
Hierarchy-aware evaluation (supplementary)
This model predicts specific child CWEs (e.g., CWE-121 Stack Buffer Overflow) while CTI-Bench ground truth often uses generic parent CWEs (e.g., CWE-119 Buffer Overflow). When parent↔child equivalences are counted as correct:
| Benchmark | Strict Top-1 | Hierarchy-aware Top-1 |
|---|---|---|
| cti-rcm (2023-2024 CVEs) | 75.6% | 86.5% (+10.9pp) |
| cti-rcm-2021 (2011-2021 CVEs) | 71.8% | 82.8% (+11.0pp) |
Note: Other models in the table above were evaluated with strict matching only. Hierarchy-aware scores are not directly comparable and are shown separately for transparency.
Usage
from transformers import pipeline
classifier = pipeline("text-classification", model="xamxte/cwe-classifier-roberta-base", top_k=3)
result = classifier("A SQL injection vulnerability in the login page allows remote attackers to execute arbitrary SQL commands via the username parameter.")
print(result)
# [[{'label': 'CWE-89', 'score': 0.95}, {'label': 'CWE-564', 'score': 0.02}, ...]]
Manual inference
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import json
model_name = "xamxte/cwe-classifier-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Load label map
from huggingface_hub import hf_hub_download
label_map_path = hf_hub_download(repo_id=model_name, filename="cwe_label_map.json")
with open(label_map_path) as f:
label_map = json.load(f)
id_to_label = {v: k for k, v in label_map.items()}
# Predict
text = "CVE Description: A buffer overflow in the PNG parser allows remote code execution via crafted image files."
inputs = tokenizer(text, return_tensors="pt", max_length=384, truncation=True, padding=True)
with torch.no_grad():
logits = model(**inputs).logits
top3 = torch.topk(logits, 3)
for score, idx in zip(top3.values[0], top3.indices[0]):
print(f"{id_to_label[idx.item()]}: {score.item():.3f}")
Training
- Base model: FacebookAI/roberta-base (125M params)
- Dataset: xamxte/cve-to-cwe — 234,770 training samples with Claude Sonnet 4.6 refined labels
- Training method: Two-phase fine-tuning
- Phase 1: Freeze first 8/12 transformer layers, train classifier head (4 epochs, lr=1e-4)
- Phase 2: Unfreeze all layers, full fine-tuning (9 epochs, lr=2e-5)
- Key hyperparameters: max_length=384, batch_size=32, label_smoothing=0.1, cosine scheduler, bf16
- Hardware: NVIDIA RTX 5080 (16GB), ~4 hours total
- Framework: HuggingFace Transformers + PyTorch
Label Quality
Training labels were refined using Claude Sonnet 4.6 via the Anthropic Batch API (~$395 total cost). The test/validation sets contain only agreement-filtered samples where NVD and Sonnet labels agree (73.1% exact match; 84.5% with hierarchy-aware matching). This biases evaluation toward unambiguous cases — real-world accuracy on arbitrary NVD entries will be lower. See the dataset card for details.
CWE Hierarchy
This model predicts specific (child) CWE categories where possible. For example, buffer overflows are classified as CWE-121 (Stack) or CWE-122 (Heap) rather than the generic CWE-119. This provides more actionable information for vulnerability triage, but means strict accuracy on benchmarks using parent CWEs appears lower than actual performance.
Limitations
- 205 CWE classes only: Covers the most common CWEs in NVD. Rare CWEs not in the training set will be mapped to the closest known class.
- English only: Trained on English CVE descriptions from NVD.
- Description-based: Uses only the text description, not CVSS scores, CPE, or other metadata.
- Single-label: Predicts one CWE per CVE, though some vulnerabilities may involve multiple weakness types.
Citation
@model{cve_to_cwe_classifier_2025,
title={CWE Classifier (RoBERTa-base)},
year={2025},
url={https://huggingface.co/xamxte/cwe-classifier-roberta-base}
}
- Downloads last month
- 33
Model tree for xamxte/cwe-classifier-roberta-base
Base model
FacebookAI/roberta-baseDataset used to train xamxte/cwe-classifier-roberta-base
Space using xamxte/cwe-classifier-roberta-base 1
Collection including xamxte/cwe-classifier-roberta-base
Papers for xamxte/cwe-classifier-roberta-base
Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report
Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report
CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Evaluation results
- Top-1 Accuracy on cve-to-cwe (test split)test set self-reported0.874
- Top-3 Accuracy on cve-to-cwe (test split)test set self-reported0.947
- Macro F1 on cve-to-cwe (test split)test set self-reported0.607
- Strict Top-1 on CTI-Bench cti-rcmself-reported0.756
- Hierarchy-aware Top-1 on CTI-Bench cti-rcmself-reported0.865