Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR Paper β’ 2509.02522 β’ Published Sep 2, 2025 β’ 26
Self-Improving Language Models with Bidirectional Evolutionary Search Paper β’ 2605.28814 β’ Published 6 days ago β’ 56
DVAO: Dynamic Variance-adaptive Advantage Optimization for Multi-reward Reinforcement Learning Paper β’ 2605.25604 β’ Published 8 days ago β’ 133
SkillOpt: Executive Strategy for Self-Evolving Agent Skills Paper β’ 2605.23904 β’ Published 11 days ago β’ 214
Anti-Self-Distillation for Reasoning RL via Pointwise Mutual Information Paper β’ 2605.11609 β’ Published 21 days ago β’ 195
RLPR: Extrapolating RLVR to General Domains without Verifiers Paper β’ 2506.18254 β’ Published Jun 23, 2025 β’ 35
Reinforcement-aware Knowledge Distillation for LLM Reasoning Paper β’ 2602.22495 β’ Published Feb 26 β’ 5
Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning Paper β’ 2602.01058 β’ Published Feb 1 β’ 45
Running 346 LLM Embeddings Explained: A Visual and Intuitive Guide π 346 How Language Models Turn Text into Meaning, From Traditional