π‘οΈ Epistemic Agent v2 - Autonomy Calibration Hub
This model is a Calibrated Epistemic Agent trained specifically for the OpenEnv India Hackathon 2026. It was fine-tuned using Group Relative Policy Optimization (GRPO) to master the balance between autonomous action and information gathering.
π§ Model Description
Unlike typical LLMs that "hallucinate" or guess when faced with ambiguous instructions, this agent has been trained to use the INVESTIGATE action when it detects uncertainty.
- Objective: Learn when to take direct autonomous action vs. when to pause and gather forensics.
- Algorithm: GRPO (Group Relative Policy Optimization)
- Base Model: Qwen2.5-0.5B-Instruct
- Task Alignment: Autonomy Calibration Benchmark (OpenEnv)
π Training Performance
The agent was trained on high-ambiguity scenarios across three domains: Email Triage, DevOps Incidents, and Financial Requests.
| Benchmark | Blind Baseline | Calibrated Agent (Ours) | Improvement |
|---|---|---|---|
| Email Triage | 0.378 | 0.798 | +42.0% |
| DevOps Incident | 0.572 | 0.939 | +36.7% |
| Financial Request | 0.773 | 0.990 | +21.7% |
Key Behavioral Signal:
The model demonstrates an Investigation Rate of 100% on ambiguous signals, effectively resolving partial observability before committing to high-stakes decisions.
π οΈ Training Procedure
- Steps: 100
- Group Size (G): 8 generations per prompt
- Reward Range: (0.01, 0.99) - Strictly OpenEnv compliant.
- Penalty Logic: Severe negative rewards (-0.90) for "Act" decisions on Ambiguous states.
π How to Use
This model is designed to be used in conjunction with the Autonomy Calibration Benchmark.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
adapter = "JOY0021/autonomy-grpo-agent-v2"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(base_model)
model = PeftModel.from_pretrained(model, adapter)
- Downloads last month
- 49