-
Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
Paper • 2502.04322 • Published • 3 -
narutatsuri/evaluation-actionable
Text Classification • 8B • Updated • 165 -
narutatsuri/evaluation-informative
Text Classification • 8B • Updated • 130 -
narutatsuri/response_selection_model-actionable
Text Classification • 8B • Updated • 4
Narutatsu Ri
narutatsuri
AI & ML interests
None yet
Recent Activity
updated a dataset 5 days ago
narutatsuri/lrm_safety-artifacts published a dataset 6 days ago
narutatsuri/lrm_safety-artifacts authored a paper 2 months ago
Self-Distillation Zero: Self-Revision Turns Binary Rewards into Dense Supervision