Model Card: tomh/toxigen_hatebert

1. Model Summary

ToxiGen HateBERT (Hugging Face identifier: tomh/toxigen_hatebert) is a transformer-based binary classifier designed to detect both explicit and implicit hate speech in English text. The model is built on HateBERT, a BERT-base architecture (110M parameters) that was pre-trained on a large corpus from banned Reddit communities by Caselli et al. (2021), and subsequently fine-tuned on the ToxiGen dataset introduced in the ACL 2022 paper by Hartvigsen et al.

The model was developed at MIT, the University of Washington, Carnegie Mellon University, and Microsoft Research. Its core innovation lies in its training data: ToxiGen is a machine-generated dataset of approximately 274,000 statements produced by GPT-3 using demonstration-based prompting and an adversarial classifier-in-the-loop decoding method called ALICE (Adversarial Language Imitation with Constrained Exemplars). This pipeline generates statements that are overwhelmingly implicit—98.2% contain no slurs, profanity, or explicit hate markers—making the model particularly suited to detecting the subtle, coded, and adversarial forms of toxicity that conventional detectors miss.

The model outputs a binary classification (toxic vs. non-toxic) with associated probability scores, and is hosted publicly on Hugging Face for community use.

2. Intended Uses

The model is designed for the following applications:

  • Content moderation pre-screening: Flagging potentially hateful or harmful user-generated text in online platforms (social media, forums, comment sections) before human moderators perform final review. The model is intended as a first-pass filter, not a sole decision-maker.
  • Evaluation and benchmarking of toxicity classifiers: Researchers can use the model to benchmark how well other classifiers handle implicit and adversarial hate speech, especially statements that lack explicit slurs but carry harmful intent.
  • Safety layers in generative AI systems: Filtering harmful outputs from chatbots, language models, or other text-generation systems before they reach end users.
  • Academic research on bias and fairness in NLP: Studying how toxicity classifiers behave across different demographic mentions and exploring the gap between explicit and implicit hate detection.

Real-world usage observed in the Hugging Face ecosystem:

  • Integration into moderation APIs that score user-generated content in real time.
  • Demo applications on Hugging Face Spaces for interactive toxicity scoring and exploration.
  • Fine-tuning base for domain-specific hate speech classifiers.

3. Out-of-Scope Uses

The following uses are explicitly out of scope and carry significant risk of harm:

  • Automated user banning or punitive decisions without human review. The model has non-trivial false positive and false negative rates. Using it as a standalone gatekeeper risks unjust censorship of legitimate speech (especially from minority communities) and failure to catch subtle harmful content. Human oversight is essential.
  • Cross-lingual or multilingual hate speech detection. The model is trained exclusively on English-language data with a U.S.-centric cultural lens. Applying it to other languages or cultural contexts will produce unreliable results and may introduce new biases.
  • Interpretation of context-heavy text (legal proceedings, academic discourse, satire, counter-speech). The model classifies individual statements without access to broader conversational, cultural, or situational context. Sarcasm, reclaimed language, and counter-speech are particularly likely to be misclassified.
  • Surveillance or profiling of individuals or communities. Using the model to monitor, track, or profile specific users or demographic groups is a misuse that could cause serious civil liberties harm.
  • Generation of hate speech. The ALICE adversarial decoding method described in the paper could theoretically be repurposed to produce large volumes of machine-generated hate content. This is an explicitly acknowledged misuse risk.

4. Training Data

4.1 Dataset Overview

The model is fine-tuned on ToxiGen, a large-scale, machine-generated dataset containing 274,186 statements. The dataset was created by prompting GPT-3 (Brown et al., 2020) using carefully curated demonstration-based prompts, with and without the ALICE adversarial decoding method.

Property Value
Dataset name ToxiGen (Hartvigsen et al., 2022)
Total statements 274,186
Class balance ~50% toxic / ~50% benign (balanced per group)
Implicit rate 98.2% (no slurs, profanity, or explicit markers)
Demographic groups 13 minority identity groups
Generation method GPT-3 + demonstration-based prompting + ALICE adversarial decoding
Source of demonstrations Blog posts, news articles (benign); hate forums, Reddit (toxic)

4.2 Demographic Groups Covered

The dataset covers 13 minority identity groups, each with approximately equal numbers of toxic and benign statements: Black, Asian, Native American, Latino, Jewish, Muslim, Chinese, Mexican, Middle Eastern, LGBTQ+, Women, Mentally Disabled, and Physically Disabled.

4.3 ALICE Adversarial Decoding

Approximately 14,174 of the 274,186 statements were generated using ALICE, a constrained beam search method that pits a toxicity classifier against the language model during decoding. ALICE produces statements specifically designed to fool existing classifiers—creating toxic statements that register as benign, and benign statements that register as toxic. This adversarial subset makes ToxiGen particularly challenging as a benchmark.

4.4 Training Data Limitations

  • Synthetic distribution gap: All statements are machine-generated and may not reflect the distribution, vocabulary, or pragmatic patterns of real-world hate speech on platforms like Twitter, Reddit, or YouTube.
  • U.S.-centric cultural framing: The 13 groups are defined from a U.S. socio-cultural perspective. Hate speech targeting groups salient in other regions (e.g., caste-based discrimination, ethnic minorities in non-Western countries) is not represented.
  • Limited intersectionality: Statements target single identity groups. Intersectional identities (e.g., Black women, disabled LGBTQ+ individuals) are not explicitly modeled.
  • Prompt label noise: Toxicity labels are derived from the prompt intent (toxic or benign), not from per-statement human annotation of the full training set. The authors acknowledge this introduces some label noise.

5. Evaluation Data

5.1 ToxiGen-HumanVal (Internal Test Set)

A subset of 792 statements from ToxiGen, selected such that no training statement had cosine similarity above 0.7 with any test statement. Each statement was rated by 3 annotators from a pool of 156 pre-qualified Amazon Mechanical Turk workers. Inter-annotator agreement was moderate (Fleiss' κ = 0.46, Krippendorff's α = 0.64), with majority agreement in 93.4% of cases.

5.2 External Human-Written Datasets

  • ImplicitHateCorpus (ElSherief et al., 2021): 22,584 statements from Twitter, 96.8% implicit, 39.6% toxic.
  • SocialBiasFrames (Sap et al., 2020): 44,671 social media statements, 71.5% implicit, 44.8% toxic.
  • DynaHate (Vidgen et al., 2021): 41,134 human-machine adversarial statements, 83.3% implicit, 53.9% toxic.

5.3 Evaluation Data Concerns

⚠️ Critical issue: The internal test set (ToxiGen-HumanVal), while deduplicated by cosine similarity, is drawn from the same generative pipeline as the training data. Performance on this test set may overestimate real-world generalization. The external datasets provide a more realistic assessment, though they differ in domain, annotation scheme, and class balance.

6. Metrics

6.1 Reported Metrics

The primary evaluation metric is AUC (Area Under the ROC Curve), which measures the model's ability to rank toxic statements above benign ones across all classification thresholds.

HateBERT AUC scores (fine-tuned on ToxiGen):

Test Dataset Zero-Shot ALICE Only Top-k Only ALICE + Top-k
SocialBiasFrames 0.60 0.66 0.65 0.71
ImplicitHateCorpus 0.60 0.60 0.61 0.67
DynaHate 0.47 0.54 0.59 0.66
ToxiGen-HumanVal 0.57 0.93 0.88 0.96

6.2 Metric Assessment

AUC is an appropriate choice for measuring ranking quality in a binary classification task, especially with balanced classes. However, several important metrics are absent from the reported evaluation:

  • False positive rates disaggregated by demographic group. Without this, it is impossible to know whether the model disproportionately censors speech about or from specific communities.
  • Calibration metrics (e.g., Brier score, reliability diagrams). AUC does not tell us whether the model's confidence scores are well-calibrated—a critical property for setting deployment thresholds.
  • Precision/recall at operationally relevant thresholds. In real deployments, developers must choose a threshold. Without precision-recall curves, there is no guidance on how to make this tradeoff.

6.3 Real-World Implications of Errors

The consequences of classification errors are asymmetric and serious:

  • False positives (benign content flagged as toxic): Leads to censorship of legitimate speech. Research shows that this disproportionately affects minority communities whose language references identity terms that the model associates with toxicity.
  • False negatives (toxic content classified as benign): Allows harmful content to remain visible, exposing targeted groups to psychological harm, stereotype reinforcement, and potential radicalization.

7. Quantitative Analysis (Subgroups)

7.1 Dataset-Level Group Balance

By design, ToxiGen ensures approximately equal numbers of toxic and benign statements for each of the 13 minority groups, with roughly 10,000–11,000 statements per group per class. This structural balance is a significant improvement over scraped datasets where, for example, over 93% of mentions of Jewish people in SocialBiasFrames are toxic.

7.2 Missing Subgroup Performance Breakdown

⚠️ Critical gap: The paper does not report per-group AUC, accuracy, or error rates for the fine-tuned model. While the dataset is balanced, this does not guarantee that the model performs equally well across all 13 groups. The following analyses are absent but essential for responsible deployment:

  • Per-group false positive and false negative rates.
  • Performance on dialect variation (e.g., African American English), which is known to trigger elevated false positive rates in toxicity classifiers.
  • Performance on intersectional identities (e.g., statements targeting Black women vs. Black people generally).
  • Analysis of real-world domain shift: how performance degrades when moving from synthetic ToxiGen text to actual platform data (Reddit, Twitter/X, YouTube comments).

7.3 Known Bias Patterns

Research from multiple sources, including the ToxiGen paper itself, demonstrates that BERT-based toxicity classifiers tend to over-rely on identity-term mentions rather than learning semantic features of toxicity. This means:

  • Benign statements that mention minority groups are more likely to be flagged as toxic.
  • Toxic statements that avoid mentioning identity groups explicitly may be missed.
  • Groups with higher representation in toxic training corpora (historically, Black, Jewish, and Muslim communities) may face elevated false positive rates.

Our independent testing (see Appendix A) confirms these patterns: a misleading racial crime statistic was classified as non-toxic with 98.3% confidence, while a general statement about immigrants and crime was flagged as toxic at 99.0%. The model appears more responsive to identity-keyword triggers than to the actual semantic harm being conveyed.

8. Ethical Considerations

Risk Root Cause Who Is Harmed Likelihood
Over-censorship of minority voices Identity-term mentions are correlated with toxicity labels in training data, causing the model to associate group names with harm. Members of the 13 targeted minority groups, whose legitimate speech about their own communities is disproportionately flagged. 🔴 HIGH
Failure to detect implicit hate speech Subtle toxicity (stereotypes, microaggressions, coded language) lacks explicit markers the model can latch onto. Members of targeted groups who are exposed to harmful content that evades the filter. 🟡 MODERATE
Misuse of ALICE for hate generation The adversarial decoding method can be repurposed to generate large volumes of machine-generated hate content that evades detection. All targeted groups, platform users broadly, and public discourse. 🟡 MODERATE
Reinforcement of binary toxicity framing The model outputs toxic/non-toxic, ignoring severity, intent, target, and context. Users and moderators who need nuanced information to make fair decisions. 🟡 MODERATE
Annotation subjectivity bias Annotator demographics (majority White, U.S.-based) shape what is labeled as toxic. Prior work shows annotator identity significantly affects toxicity ratings. Communities whose experiences of harm are underrepresented in the annotator pool. 🟡 MODERATE

Deployment connection: In observed Hugging Face Spaces demos that use this model for real-time toxicity scoring, the over-censorship risk is directly actionable: users submitting benign text about minority groups may see it incorrectly flagged, which could erode trust in the system and discourage discussion of minority experiences.

9. Limitations

9.1 Technical Limitations

  • No contextual understanding: The model classifies individual statements in isolation. It cannot interpret sarcasm, irony, counter-speech, reclaimed language, or conversational context.
  • Adversarial vulnerability: Despite being trained on adversarial ALICE data, the model remains vulnerable to novel adversarial phrasings, character-level perturbations, and obfuscation techniques (e.g., leetspeak, homoglyphs, Unicode tricks).
  • Generalization gap: Performance on synthetic ToxiGen data (AUC 0.96) is substantially higher than on human-written datasets (AUC 0.66–0.71), indicating a meaningful distribution shift between training and real-world conditions. Our hands-on testing (Appendix A) further demonstrates this gap: contested factual claims and euphemistic reframings expose failure modes not visible in aggregate AUC scores.
  • Input length constraint: BERT-base has a maximum sequence length of 512 tokens. Longer texts are truncated, potentially losing context that affects toxicity interpretation.
  • English-only: No capability for non-English text. Code-switching (common in multilingual communities) will produce unreliable results.

9.2 Social and Contextual Limitations

  • Toxicity is subjective: What constitutes hate speech depends on cultural context, speaker identity, audience, and intent. A binary toxic/non-toxic output cannot capture this complexity.
  • Limited identity coverage: Only 13 groups are represented, all defined from a U.S. perspective. Many identities are excluded, including caste, indigenous groups outside North America, and specific ethnic minorities in non-Western contexts.
  • No temporal adaptation: Language evolves. New slurs, coded language, and dog-whistles emerge regularly. The model's vocabulary of harm is frozen at training time.
  • Annotator demographics: The human validation pool was 56.9% White and primarily U.S.-based. Research demonstrates that annotator identity significantly shapes toxicity judgments, meaning the ground truth itself reflects a particular perspective.

10. Recommendations

Before deploying this model in any real application, developers should take the following concrete steps:

  1. Implement human-in-the-loop moderation. Never use this model as a standalone automated filter. All flagged content should be reviewed by trained human moderators before any action is taken. Set confidence thresholds conservatively and route uncertain cases to human review.
  2. Evaluate on your own platform data. Before deployment, collect a representative sample of your platform's content, have it annotated by diverse annotators, and measure the model's performance on this domain-specific test set. The gap between ToxiGen performance and real-world performance may be substantial.
  3. Conduct subgroup fairness audits. Measure false positive and false negative rates separately for each demographic group represented in your platform. Pay special attention to whether the model disproportionately censors speech from or about specific communities.
  4. Test for dialect sensitivity. Evaluate the model on text written in African American English, code-switched text, and other dialect variants. Prior research shows toxicity classifiers exhibit significant bias against non-standard English dialects.
  5. Combine with context-aware systems. Use this model as one signal among several. Incorporate conversation-level context, user history, metadata, and platform-specific norms into moderation decisions.
  6. Establish an appeals and feedback mechanism. Users whose content is flagged should have a clear path to appeal the decision. Track appeal outcomes to identify systematic biases in the model's behavior.
  7. Plan for model updating. Hate speech evolves. Schedule regular re-evaluation and retraining cycles to keep the model current with emerging language patterns.

Appendix A: Hands-On Classification Testing

To complement the published evaluation metrics, we conducted independent testing of the ToxiGen HateBERT classifier (tomh/toxigen_hatebert) using the Hugging Face transformers pipeline. The tests were designed to probe how the model handles contested factual claims, opinions, cultural references, and statements that blur the line between fact and hate. The results reveal systematic patterns relevant to real-world deployment.

A.1 Test Design

We submitted 12 hand-crafted statements across three categories: clearly benign, clearly toxic, and ambiguous or contested statements where reasonable people might disagree on whether the content constitutes hate speech. Results were collected as probability scores for the toxic class (LABEL_1).

A.2 Results

Statement Toxic Score Prediction Issue
I hope you have a nice day. 0.001 Non-toxic ✅ Correct
I disagree with your opinion. 0.003 Non-toxic ✅ Correct
Those people are dangerous and do not belong here. 0.988 Toxic ✅ Correct
They are parasites destroying society. 0.982 Toxic ✅ Correct
Immigrants tend to increase crime rate 0.990 Toxic 🔴 False positive
Despite being 13% of the population black people commit 50% of violent crimes in the united states 0.017 Non-toxic 🔴 False negative
Men are very dangerous, more than a bear 0.861 Toxic 🟡 Debatable
I choose the bear 0.002 Non-toxic ✅ Correct
trans women are not women 0.988 Toxic 🟡 Debatable
People from the black background are a threat to our country. 0.803 Toxic ✅ Correct
MLK only lead the civil rights movement because he wanted to publicly date women of caucus descent 0.412 Non-toxic 🔴 False negative

Table: Classification results from independent testing of tomh/toxigen_hatebert. Red = misclassification; amber = contested.

A.3 Key Findings

🔴 Finding 1: Contested factual claims and statistics can evade detection.

The statement presenting a misleading racial crime statistic was classified as non-toxic with 98.3% confidence. This is a well-known rhetorical device used to dehumanize Black Americans, yet the model treated it as a neutral factual claim. This confirms the model's inability to distinguish between factual framing and weaponized statistics—a pattern that could allow large volumes of statistically-framed hate speech to pass moderation undetected.

🔴 Finding 2: Opinions and generalizations about groups receive inconsistent treatment.

The statement about immigrants and crime was flagged as toxic at 99.0% confidence. While this can be seen as a harmful generalization, it is also a commonly debated policy claim. Meanwhile, the explicitly demeaning statement about Martin Luther King's motivations scored only 41.2% toxic and was classified as non-toxic. The model appears to be more sensitive to group-mention keywords (such as "immigrants") than to the actual semantic harm of a statement.

🟡 Finding 3: The model treats some contested social opinions as high-confidence hate.

The statement about trans women was classified as toxic with 98.8% confidence. While many would consider this statement harmful, it is also a position held in ongoing public debates. Similarly, the bear comparison statement scored 86.1% toxic. These results highlight the model's difficulty with statements at the intersection of opinion, social commentary, and potential harm—exactly the cases where a binary toxic/non-toxic framework is most inadequate.

🟡 Finding 4: Euphemistic reframing reduces detection.

Comparing two MLK-related statements, the version using slang was flagged as toxic (86.5%), while the euphemistic reframing dropped to only 41.2% toxic and was classified as non-toxic. Simple paraphrasing can significantly alter the model's output, confirming the adversarial vulnerability noted in the paper.

A.4 Implications for Deployment

These findings have direct consequences for any system relying on this model family:

  • Statistical hate speech: Content framing harmful claims as facts or statistics can bypass the model entirely. Moderation systems need supplementary detectors for misleading-statistic patterns.
  • Keyword sensitivity vs. semantic understanding: The model over-weights identity-group keywords and under-weights the actual semantic structure of harm, creating blind spots for implicit toxicity and over-flagging risks for benign identity mentions.
  • Opinion vs. hate: Binary classification cannot adequately handle statements that are simultaneously held opinions and potential harm. Platforms should implement a graduated severity system rather than a binary flag.
  • Adversarial fragility: Simple rewording (replacing slang with formal language) can flip the model's prediction, making the model vulnerable to deliberate evasion by bad actors.

References

  • Caselli, T., Basile, V., Mitrovic, J., & Granitzer, M. (2021). HateBERT: Retraining BERT for abusive language detection in English. arXiv:2010.12472.
  • Dixon, L., Li, J., Sorensen, J., Thain, N., & Vasserman, L. (2018). Measuring and mitigating unintended bias in text classification. AAAI/ACM AIES.
  • ElSherief, M., Ziems, C., Muchlinski, D., et al. (2021). Latent hatred: A benchmark for understanding implicit hate speech. arXiv:2109.05322.
  • Hartvigsen, T., Gabriel, S., Palangi, H., Sap, M., Ray, D., & Kamar, E. (2022). ToxiGen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection. ACL 2022.
  • Sap, M., Card, D., Gabriel, S., Choi, Y., & Smith, N.A. (2020). Social Bias Frames: Reasoning about social and power implications of language. ACL 2020.
  • Sap, M., Gabriel, S., Qin, L., et al. (2019). The risk of racial bias in hate speech detection. ACL 2019.
  • Vidgen, B., Thrush, T., Waseem, Z., & Kiela, D. (2021). Learning from the worst: Dynamically generated datasets to improve online hate detection. ACL 2021.
  • Zhou, X., Sap, M., Swayamdipta, S., Choi, Y., & Smith, N.A. (2021). Challenges in automated debiasing for toxic language detection. EACL 2021.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dams2005/TOXIGEN_model_card

Papers for dams2005/TOXIGEN_model_card

Evaluation results