etpi-phase1 — Structured-CoT SFT seed for Qwen 3.6 27B

etpi-phase1 is Qwen 3.6 27B with one round of supervised fine-tuning on grammar-constrained, structured-CoT coding-agent trajectories. The adapter (LoRA rank 128, target_modules=all-linear) has been merged into the base weights — this is a normal HuggingFace model checkpoint with no PEFT runtime dependency.

This is phase 1 of a multi-phase training pipeline. Phase 2 is RLVR (GRPO) against R2E-Gym, starting from this checkpoint.

What's different about this model

The base Qwen 3.6 27B reasons in long free-form chains-of-thought (hundreds-to-thousands of tokens per turn). This SFT seed teaches the model a compact 3-slot IR inside <think>...</think>:

<think>
STATE: <one short line: current workspace state>
ACTION: <one short line: what to do next and why>
EXPECT: <one short line: what the observation should look like>
</think>
<tool_call>...</tool_call>   # or whatever the harness expects

Typical thinking-token spend per turn drops from ~1000+ → ~50-100 while preserving multi-turn coherence and tool-use ability.

Intended use

Drive a multi-turn coding agent — e.g., R2E-Gym / SWE-bench / Terminal-Bench style sandboxes — where the model:

  1. Reads a task instruction (GitHub issue, terminal prompt, etc.)
  2. Issues tool calls in a loop (bash, file editor, search)
  3. Observes results and iterates
  4. Submits when verified

The IR structure compresses the thinking phase; tool calls and answers are left unconstrained.

Training details

Base model Qwen/Qwen3.6-27B
Training data andthattoo/etpi-sft (318 verified-successful multi-turn trajectories)
Method LoRA r=128, α=256, dropout 0.05, target_modules=all-linear, merged at end of training
Optimizer AdamW (8-bit), lr 2e-4, cosine schedule, warmup 0.03
Epochs 2
Batch per-device 1, grad accum 8 (effective batch 8)
Sequence length 8192
Loss masking assistant tokens only (observations and user messages masked)
Other gradient checkpointing on, bf16, Liger fused CE loss
Hardware 1× H100 80GB (~40 min)

Final metrics:

  • train_loss (final): 0.096
  • mean_token_accuracy: 95.87%

Data provenance

Trajectories were generated by running grammar-constrained Qwen 3.6 27B as a multi-turn agent against R2E-Gym-Lite (real GitHub issue tasks). The grammar enforced the IR structure inside <think> while leaving tool calls free. Each task was rolled out 4 times (K=4 best-of-N). Only reward=1.0 trajectories (passed the R2E unit-test verifier) were kept; among those, the shortest trajectory per task was selected as the SFT target — encoding a brevity-given-correctness preference directly into the data.

See andthattoo/etpi-sft for the full dataset including system prompt, task instructions, and per-turn messages.

Limitations

  • Small training set (318 examples). SFT-seed scale, not full SFT. Expected to generalize the IR format rather than acquire new capabilities. The phase-2 RL run is where capability climbs.
  • Trained on one task distribution (R2E-Gym, Python SWE-style issues). Performance on other languages or task types is untested.
  • R2E paper authors use the Terminus 2 scaffold with an 80k-token- per-turn budget to report 77.2% on SWE-bench Verified for the base Qwen 3.6 27B. This model is intentionally trained under a tighter scaffold (bash-only loop, IR-constrained thinking) with a different efficiency objective — direct numerical comparison is not apples to apples.
  • Phase 1 only. No RL has been applied. The expected pass@1 lift on R2E-Gym from SFT alone is modest; the real lift comes from phase 2.

Recommended inference

The model has internalized the IR format. You can drive it with or without grammar enforcement at inference time:

from transformers import AutoModelForCausalLM, AutoTokenizer

m = AutoModelForCausalLM.from_pretrained("andthattoo/etpi-phase1", torch_dtype="bfloat16", device_map="auto")
t = AutoTokenizer.from_pretrained("andthattoo/etpi-phase1")

messages = [
    {"role": "system", "content": "You are a software-engineering agent. ..."},
    {"role": "user", "content": "<github-issue text>"},
]
prompt = t.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
ids = t(prompt, return_tensors="pt").to(m.device)
out = m.generate(**ids, max_new_tokens=512, temperature=0.0)
print(t.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True))

For maximum compression-safety in production, apply a GBNF grammar that enforces the IR shape on the <think> block.

License

Apache 2.0 (matches base Qwen 3.6 27B license).

Acknowledgements

  • Base model: Qwen team
  • Training environment: R2E-Gym
  • Inference: SGLang
  • Tooling: TRL, PEFT, Liger Kernel, bitsandbytes
Downloads last month
41
Safetensors
Model size
27B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for andthattoo/etpi-phase1

Base model

Qwen/Qwen3.6-27B
Finetuned
(178)
this model

Dataset used to train andthattoo/etpi-phase1