EmCoder / README.md

add no max seq len limit to README

32a5b3b verified 10 days ago

8.95 kB

	---
	language:
	- en
	license: cc-by-4.0
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- emotion-recognition
	- bayesian-deep-learning
	- mc-dropout
	- uncertainty-quantification
	- multi-label-classification
	datasets:
	- google-research-datasets/go_emotions
	- Skylion007/openwebtext
	- allenai/c4
	- wikimedia/wikipedia
	metrics:
	- precision
	- recall
	- f1
	model-index:
	- name: EmCoder
	results:
	- task:
	type: text-classification
	name: Multi-label Emotion Classification
	dataset:
	name: GoEmotions
	type: go_emotions
	split: test
	metrics:
	- name: Macro F1
	type: f1
	value: 0.488
	- name: Macro Precision
	type: precision
	value: 0.503
	- name: Macro Recall
	type: recall
	value: 0.503
	---

	# EmCoder
	<blockquote>
	<b>Probabilistic Emotion Recognition & Uncertainty Quantification</b><br>
	<b>28 Emotion multi-label Transformer classifier</b><br>
	<b>Live Demo & API Service:</b> <a href="https://yezdata-emcoder-api-ui.hf.space">Try EmCoder on Hugging Face Spaces</a>
	</blockquote>


	Unlike standard classifiers, EmCoder quantifies what it doesn't know using Monte Carlo Dropout, making it suitable for high-stakes AI pipelines.<br>
	EmCoder is optimized for MC Dropout inference and its architecture has no limit on maximum input length thanks to RoPE.


	## SOTA benchmark
	### Evaluation on the GoEmotions test split (macro avg metrics)
	EmCoder achieves highly competitive Macro F1-score with its compact size (~35% smaller than RoBERTa-base and ~45% smaller than ModernBERT), while providing per-class epistemic uncertainty quantification.
	\| Model \| Precision \| Recall \| F1-Score \| Params \| F1/M \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| EmCoder \| 0.503 \| 0.503 \| 0.488 \| 81.8M \| 0.0060 \|
	\| Google BERT (Original) \| 0.400 \| 0.630 \| 0.460 \| 110M \| 0.0042 \|
	\| RoBERTa-base \| 0.575 \| 0.396 \| 0.450 \| 125M \| 0.0036 \|
	\| ModernBERT-base \| 0.583 \| 0.535 \| 0.550 \| 149M \| 0.0037 \|


	## How to use
	### 1. Setup & Tokenization
	EmCoder uses the `ModernBERT` tokenizer for correct token-to-embedding mapping.
	Ensure you allow remote code execution since it's a custom architecture.
	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	repo_id = "yezdata/EmCoder"

	# Load the same tokenizer used during training
	tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")

	# Initialize with same config as training
	model = AutoModelForSequenceClassification.from_pretrained(repo_id, trust_remote_code=True)
	```
	### 2. Bayesian inference
	To obtain probabilistic outputs and uncertainty metrics, use the `mc_forward` method:
	```python
	# Perform 50 stochastic passes
	N_SAMPLES = 50
	MAX_BATCH_SIZE = 10 # optional sub-batching of N_SAMPLES

	inputs = tokenizer("I am so happy you are here!", return_tensors="pt")

	model.eval()
	with torch.no_grad():
	# Automatically keeps Dropout active, even when in model.eval
	outputs = model.mc_forward(
	**inputs,
	n_samples=N_SAMPLES,
	max_batch_size=MAX_BATCH_SIZE
	)

	# Bayesian Post-processing
	mc_logits = outputs.logits
	all_probs = torch.sigmoid(mc_logits) # (n_samples, B, 28)

	mean_probs = all_probs.mean(dim=0) # Mean Predicted Probability
	# base std estimation of Epistemic Uncertainty
	uncertainty = all_probs.std(dim=0)

	# Formatted Output
	m_probs = mean_probs.squeeze(0)
	u_vals = uncertainty.squeeze(0)

	print(f"{'Emotion':<15} \| {'Prob':<10} \| {'Uncertainty':<10}")
	print("-" * 40)

	sorted_indices = torch.argsort(m_probs, descending=True)

	for idx in sorted_indices:
	prob, unc = m_probs[idx].item(), u_vals[idx].item()
	label = model.config.id2label[idx.item()]

	if prob > 0.05: # Print only emotions with prob > 5%
	print(f"{label:<15} \| {prob:>8.2%} \| ±{unc:>8.4f}")
	```


	## Model Architecture
	![EmCoder Architecture](outputs/architecture.png)


	### Optimization
	The model is trained using a Weighted Binary Cross Entropy loss
	Where weights w are calculated using a logarithmic class-balancing scale to handle extreme label imbalance:

	$$
	w_{c} = \max\left( 0.1, \min\left( 20, 1 + \ln \left( \frac{N_{neg,c} + \epsilon}{N_{pos,c} + \epsilon} \right) \right) \right)
	$$



	## Performance on test set
	Using `thresholds.json` optimization of probabilty thresholds for binarizing predictions (from val set)
	\| \| precision \| recall \| f1-score \| support \|
	\|:---------------\|----------:\|---------:\|---------:\|----------:\|
	\| micro avg \| 0.524 \| 0.635 \| 0.574 \| 6329 \|
	\| macro avg \| 0.503 \|0.503 \|0.488 \| 6329 \|
	\| weighted avg \| 0.537 \| 0.635 \| 0.573 \| 6329 \|
	\| samples avg \| 0.562 \| 0.661 \| 0.584 \| 6329 \|
	\|----------------\|-----------\|----------\|----------\|-----------\|
	\| admiration \| 0.642 \| 0.681 \| 0.661 \| 504 \|
	\| amusement \| 0.731 \| 0.898 \| 0.806 \| 264 \|
	\| anger \| 0.491 \| 0.434 \| 0.461 \| 198 \|
	\| annoyance \| 0.352 \| 0.316 \| 0.333 \| 320 \|
	\| approval \| 0.273 \| 0.501 \| 0.354 \| 351 \|
	\| caring \| 0.271 \| 0.415 \| 0.327 \| 135 \|
	\| confusion \| 0.377 \| 0.392 \| 0.385 \| 153 \|
	\| curiosity \| 0.496 \| 0.648 \| 0.562 \| 284 \|
	\| desire \| 0.525 \| 0.373 \| 0.437 \| 83 \|
	\| disappointment \| 0.272 \| 0.305 \| 0.288 \| 151 \|
	\| disapproval \| 0.333 \| 0.461 \| 0.387 \| 267 \|
	\| disgust \| 0.422 \| 0.528 \| 0.469 \| 123 \|
	\| embarrassment \| 0.545 \| 0.324 \| 0.407 \| 37 \|
	\| excitement \| 0.467 \| 0.340 \| 0.393 \| 103 \|
	\| fear \| 0.565 \| 0.667 \| 0.612 \| 78 \|
	\| gratitude \| 0.946 \| 0.889 \| 0.917 \| 352 \|
	\| grief \| 0.667 \| 0.333 \| 0.444 \| 6 \|
	\| joy \| 0.603 \| 0.584 \| 0.593 \| 161 \|
	\| love \| 0.809 \| 0.782 \| 0.795 \| 238 \|
	\| nervousness \| 0.500 \| 0.174 \| 0.258 \| 23 \|
	\| optimism \| 0.614 \| 0.478 \| 0.538 \| 186 \|
	\| pride \| 0.583 \| 0.438 \| 0.500 \| 16 \|
	\| realization \| 0.270 \| 0.214 \| 0.238 \| 145 \|
	\| relief \| 0.118 \| 0.364 \| 0.178 \| 11 \|
	\| remorse \| 0.551 \| 0.768 \| 0.642 \| 56 \|
	\| sadness \| 0.576 \| 0.462 \| 0.512 \| 156 \|
	\| surprise \| 0.511 \| 0.482 \| 0.496 \| 141 \|
	\| neutral \| 0.564 \| 0.838 \| 0.674 \| 1787 \|



	### Entropy-based Uncertainty Decomposition
	EmCoder computes probabilistic uncertainty using Information Theory metrics over N stochastic forward passes

	Demonstration of model uncertainty utilization
	To validate uncertainty quantification, reject the top X% most uncertain (epistemic) classifications. The model's Macro F1 jumps from 0.488 to above 0.70, proving that the model's self-reported uncertainty is highly correlated with its actual error rate
	![F1 Rejection curve](outputs/f1_rejection_epistemic.png)


	Uncertainty quantification on GoEmotions test set for selected emotions
	- `admiration`: medium appereance
	- `fear`: minority representation
	- `neutral`: the most samples

	\| Admiration \| Fear \|
	\| :---: \| :---: \|
	\| ![Admiration Scatter](outputs/admiration_scatters.png) \| ![Fear Scatter](outputs/fear_scatters.png) \|

	Neutral
	![Neutral Scatter](outputs/neutral_scatters.png)




	Emotion uncertainty distribution
	\| Epistemic \| Aleatoric \|
	\| :---: \| :---: \|
	\| ![Epistemic Ridge](outputs/ridge_epistemic.png) \| ![Aleatoric Ridge](outputs/ridge_aleatoric.png) \|

	Co-occurrence Confusion Matrix (normalized to Recall %)
	![Confusion Matrix](outputs/confusion_matrix.png)


	## Workflow
	![EmCoder Workflow](outputs/workflow.png)


	## Concrete Dropout Experiment
	An experimental branch of EmCoder integrated Concrete Dropout (Gal et al., 2017) to dynamically learn optimal dropout probabilities. While this marginally sharpened the isolation of extreme edge-cases (yielding a slightly steeper first part on the F1-Rejection curve with an optimized p=0.15), the resulting heavier regularization constrained the capacity of compact EmCoder. This caused a slight degradation in standard macro metrics. Consequently, the production EmCoder model utilizes a fixed p=0.1 to maintain optimal encoder-classifier synergy.


	## Note
	Note that this model was trained on GoEmotions dataset (social networks domain) and it may not generalize well to other domains.


	## Citation
	If you use this model, please cite it as follows:

	```bibtex
	@misc{jez2026emcoder,
	author = {Václav Jež},
	title = {EmCoder},
	year = {2026},
	publisher = {Hugging Face},
	howpublished = {\url{https://huggingface.co/yezdata/EmCoder}},
	version = {1.0.0}
	}
	```