LAnoBERT checkpoints (BGL / HDFS / Thunderbird)

From-scratch, custom-vocabulary BERT encoders trained with a masked-language- modeling objective on normal system logs only (no next-sentence prediction), following the LAnoBERT log anomaly detection method. One checkpoint per dataset, stored as a subfolder of this repo.

Code: https://github.com/yukyunglee/LAnoBERT
Paper: Yukyung Lee, Jina Kim, Pilsung Kang. LAnoBERT: System log anomaly detection based on BERT masked language model. Applied Soft Computing, Vol. 146, 2023, 110689. https://doi.org/10.1016/j.asoc.2023.110689

subfolder	dataset	vocab	batch	AUROC / best-F1 (`error_mean`)
`bgl`	BGL	1000	32	1.000 / 1.000
`hdfs`	HDFS	200	32	0.997 / 0.969
`thunderbird`	Thunderbird	10000	32	1.000 / 1.000

Usage

from transformers import AutoModelForMaskedLM, AutoTokenizer

sub = "bgl"  # or "hdfs" / "thunderbird"
tok = AutoTokenizer.from_pretrained("yukyung/LAnoBERT", subfolder=sub)
model = AutoModelForMaskedLM.from_pretrained("yukyung/LAnoBERT", subfolder=sub)

Scoring

Anomaly score = mean per-word cross-entropy over a log line (error_mean), which is length-adaptive and balanced across datasets. See the code repository for the full inference pipeline.

Citation

@article{lee2023lanobert,
  title   = {LAnoBERT: System log anomaly detection based on BERT masked language model},
  author  = {Lee, Yukyung and Kim, Jina and Kang, Pilsung},
  journal = {Applied Soft Computing},
  volume  = {146},
  pages   = {110689},
  year    = {2023},
  issn    = {1568-4946},
  doi     = {10.1016/j.asoc.2023.110689}
}

Downloads last month: -; Downloads are not tracked for this model. How to track