ICD-10 subgroup classifier — group R (Russian)

Multi-label classifier over 3-character ICD-10 subgroups inside chapter R.
Fine-tuned from ai-forever/ruBert-base on Russian clinical text.

Intended use / Назначение

  • EN: Decision-support signal for suggesting candidate ICD-10 subgroups from Russian clinical notes. Not a substitute for clinician judgment; not validated for autonomous diagnosis.
  • RU: Вспомогательный сигнал для предложения кандидатных 3-символьных кодов МКБ-10 по русскому клиническому тексту. Не заменяет врача и не предназначен для автономных клинических решений.

Training data / Обучающие данные

  • Source CSV: datasets/subgroups/group_R.csv
  • SHA-256: 21a418494d3ff6317ac92d9c923ef993540245007a47edd48c4a1bd8a78d86a9
  • Produced by ml/build_subgroup_datasets.ipynb (iterative multi-label stratification by parse_id).
  • Splits: train=187 · val=36 · test=36
  • Labels: 18 (ordered, includes R_OTHER for rare codes collapsed during dataset build).

Metrics (test split)

metric value
macro_f1 0.4026
micro_f1 0.6400
weighted_f1 0.6784
subset_accuracy 0.5556
hit@1 0.8056
hit@3 0.8333
recall@3 0.8333
mrr 0.8395

Full per-label breakdown in metrics.json.

Limitations / Ограничения

  • Russian only; heavy reliance on clinical abbreviations (АД, ТТГ, УЗИ, etc.).
  • Training text had PII redacted (*ДАТА*, *ГОРОД*, ...); model may behave differently on non-redacted input.
  • Small chapters (train rows < 250) were trained with heavy regularization; some labels may have low support.
  • Rare labels without positives in train are kept in the label map (see label_map.json → rare_label_ids) for interface stability but will effectively never fire.

Inference

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

repo = "Dmitry43243242/icd10-ru-subgroup-r"
tok = AutoTokenizer.from_pretrained(repo)
mdl = AutoModelForSequenceClassification.from_pretrained(repo)
mdl.eval()

text = "жалобы пациента..."
inp = tok(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    probs = torch.sigmoid(mdl(**inp).logits)[0]
preds = [mdl.config.id2label[i] for i, p in enumerate(probs.tolist()) if p >= 0.5]
top3 = sorted(
    [(mdl.config.id2label[i], p) for i, p in enumerate(probs.tolist())],
    key=lambda x: -x[1],
)[:3]
print(preds, top3)

Citation / Ссылка

Built as part of the ai-app ICD-10 classification pipeline. Upstream model: ai-forever/ruBert-base (ai-forever).

Downloads last month
64
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Dmitry43243242/icd10-ru-subgroup-r

Finetuned
(43)
this model