Text Classification
Transformers
Joblib
Safetensors
multilingual
binary-classification
amis
agriculture
Instructions to use faodl/agri-stocks-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use faodl/agri-stocks-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="faodl/agri-stocks-classifier")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("faodl/agri-stocks-classifier", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- AMIS Commodity Classifier
- Dataset Summary
- Threshold Comparison on Validation Split
- Threshold Comparison on Test Split
- Confusion Matrices on Test Split
- logistic_tfidf at threshold 0.500
- logistic_tfidf at threshold 0.501
- xgboost_tfidf at threshold 0.500
- xgboost_tfidf at threshold 0.117
- embedding-logistic_sentence_embeddings at threshold 0.500
- embedding-logistic_sentence_embeddings at threshold 0.756
- embedding-svm_sentence_embeddings at threshold 0.500
- embedding-svm_sentence_embeddings at threshold 0.332
- embedding-lightgbm_sentence_embeddings at threshold 0.500
- embedding-lightgbm_sentence_embeddings at threshold 0.214
- transformer at threshold 0.500
- transformer at threshold 0.978
- Validation-Tuned Thresholds
- Artifacts
- Inference
- Files
- Dataset Summary
AMIS Commodity Classifier
This model repository contains artifacts from an AMIS commodity relevance classifier training run. It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report.
- Dataset:
faodl/amis-agri-stocks - Dataset subset: ``
- Dataset revision:
main - Text column:
chunk_text - Label column:
label - Transformer:
FacebookAI/xlm-roberta-base - Generated at:
2026-06-01T11:29:07.060277+00:00
Dataset Summary
| Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length |
|---|---|---|---|---|---|
| train | 4861 | 4443 | 418 | 2257 | 702.2 |
| validation | 1012 | 932 | 80 | 484 | 700.2 |
| test | 1093 | 1010 | 83 | 484 | 700.9 |
Threshold Comparison on Validation Split
Validation metrics document threshold selection and tuning behavior; test metrics remain the primary estimate of out-of-sample performance.
| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
|---|---|---|---|---|---|---|---|
| logistic_tfidf | 0.500 | 0.950 | 0.671 | 0.713 | 0.691 | 0.926 | 0.718 |
| logistic_tfidf | 0.501 | 0.950 | 0.671 | 0.713 | 0.691 | 0.926 | 0.718 |
| xgboost_tfidf | 0.500 | 0.955 | 0.854 | 0.512 | 0.641 | 0.955 | 0.771 |
| xgboost_tfidf | 0.117 | 0.949 | 0.646 | 0.775 | 0.705 | 0.955 | 0.771 |
| embedding-logistic_sentence_embeddings | 0.500 | 0.883 | 0.396 | 0.900 | 0.550 | 0.932 | 0.641 |
| embedding-logistic_sentence_embeddings | 0.756 | 0.936 | 0.570 | 0.762 | 0.652 | 0.932 | 0.641 |
| embedding-svm_sentence_embeddings | 0.500 | 0.940 | 0.732 | 0.375 | 0.496 | 0.926 | 0.652 |
| embedding-svm_sentence_embeddings | 0.332 | 0.940 | 0.604 | 0.688 | 0.643 | 0.926 | 0.652 |
| embedding-lightgbm_sentence_embeddings | 0.500 | 0.945 | 0.707 | 0.512 | 0.594 | 0.940 | 0.673 |
| embedding-lightgbm_sentence_embeddings | 0.214 | 0.947 | 0.671 | 0.637 | 0.654 | 0.940 | 0.673 |
| transformer | 0.500 | 0.965 | 0.778 | 0.787 | 0.783 | 0.968 | 0.851 |
| transformer | 0.978 | 0.969 | 0.855 | 0.738 | 0.792 | 0.968 | 0.851 |
Threshold Comparison on Test Split
| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
|---|---|---|---|---|---|---|---|
| logistic_tfidf | 0.500 | 0.952 | 0.679 | 0.687 | 0.683 | 0.890 | 0.713 |
| logistic_tfidf | 0.501 | 0.952 | 0.679 | 0.687 | 0.683 | 0.890 | 0.713 |
| xgboost_tfidf | 0.500 | 0.958 | 0.803 | 0.590 | 0.681 | 0.914 | 0.700 |
| xgboost_tfidf | 0.117 | 0.927 | 0.514 | 0.663 | 0.579 | 0.914 | 0.700 |
| embedding-logistic_sentence_embeddings | 0.500 | 0.873 | 0.359 | 0.855 | 0.505 | 0.951 | 0.626 |
| embedding-logistic_sentence_embeddings | 0.756 | 0.928 | 0.516 | 0.783 | 0.622 | 0.951 | 0.626 |
| embedding-svm_sentence_embeddings | 0.500 | 0.952 | 0.772 | 0.530 | 0.629 | 0.950 | 0.646 |
| embedding-svm_sentence_embeddings | 0.332 | 0.939 | 0.578 | 0.711 | 0.638 | 0.950 | 0.646 |
| embedding-lightgbm_sentence_embeddings | 0.500 | 0.954 | 0.739 | 0.614 | 0.671 | 0.944 | 0.715 |
| embedding-lightgbm_sentence_embeddings | 0.214 | 0.948 | 0.671 | 0.614 | 0.642 | 0.944 | 0.715 |
| transformer | 0.500 | 0.949 | 0.636 | 0.759 | 0.692 | 0.958 | 0.783 |
| transformer | 0.978 | 0.959 | 0.744 | 0.699 | 0.720 | 0.958 | 0.783 |
Confusion Matrices on Test Split
Rows are true labels and columns are predicted labels.
logistic_tfidf at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 983 | 27 |
| RELEVANT | 26 | 57 |
logistic_tfidf at threshold 0.501
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 983 | 27 |
| RELEVANT | 26 | 57 |
xgboost_tfidf at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 998 | 12 |
| RELEVANT | 34 | 49 |
xgboost_tfidf at threshold 0.117
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 958 | 52 |
| RELEVANT | 28 | 55 |
embedding-logistic_sentence_embeddings at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 883 | 127 |
| RELEVANT | 12 | 71 |
embedding-logistic_sentence_embeddings at threshold 0.756
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 949 | 61 |
| RELEVANT | 18 | 65 |
embedding-svm_sentence_embeddings at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 997 | 13 |
| RELEVANT | 39 | 44 |
embedding-svm_sentence_embeddings at threshold 0.332
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 967 | 43 |
| RELEVANT | 24 | 59 |
embedding-lightgbm_sentence_embeddings at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 992 | 18 |
| RELEVANT | 32 | 51 |
embedding-lightgbm_sentence_embeddings at threshold 0.214
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 985 | 25 |
| RELEVANT | 32 | 51 |
transformer at threshold 0.500
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 974 | 36 |
| RELEVANT | 20 | 63 |
transformer at threshold 0.978
| True / Predicted | NOT_RELEVANT | RELEVANT |
|---|---|---|
| NOT_RELEVANT | 990 | 20 |
| RELEVANT | 25 | 58 |
Validation-Tuned Thresholds
logistic_tfidf: threshold0.501(validation F10.691); test F1 change vs 0.5:+0.000.xgboost_tfidf: threshold0.117(validation F10.705); test F1 change vs 0.5:-0.102.embedding-logistic_sentence_embeddings: threshold0.756(validation F10.652); test F1 change vs 0.5:+0.117.embedding-svm_sentence_embeddings: threshold0.332(validation F10.643); test F1 change vs 0.5:+0.009.embedding-lightgbm_sentence_embeddings: threshold0.214(validation F10.654); test F1 change vs 0.5:-0.030.transformer: threshold0.978(validation F10.792); test F1 change vs 0.5:+0.028.
Artifacts
logistic_tfidf:/content/agri-stocks-classifier/baselines/logisticxgboost_tfidf:/content/agri-stocks-classifier/baselines/xgboostembedding-logistic_sentence_embeddings:/content/agri-stocks-classifier/baselines/embedding-logisticembedding-svm_sentence_embeddings:/content/agri-stocks-classifier/baselines/embedding-svmembedding-lightgbm_sentence_embeddings:/content/agri-stocks-classifier/baselines/embedding-lightgbmtransformer:/content/agri-stocks-classifier/transformer
Inference
Install the runtime dependencies:
pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost lightgbm
Transformer
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO"
texts = [
"Rice export prices increased after new procurement rules were announced.",
"The finance ministry released its monthly fuel tax bulletin.",
]
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer")
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer")
threshold = float(getattr(model.config, "threshold", 0.5))
encoded = tokenizer(
texts,
truncation=True,
padding=True,
max_length=256,
return_tensors="pt",
)
with torch.no_grad():
logits = model(**encoded).logits
probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist()
for text, probability in zip(texts, probabilities):
label = model.config.id2label[int(probability >= threshold)]
print({"text": text, "probability_positive": probability, "label": label})
TF-IDF Baselines
Available baseline names in this run: "logistic", "xgboost".
import json
import joblib
from huggingface_hub import hf_hub_download
MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO"
BASELINE = "logistic"
texts = [
"Maize production forecasts were revised after delayed rains.",
"The central bank published new exchange rate statistics.",
]
model_path = hf_hub_download(
repo_id=MODEL_ID,
repo_type="model",
filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib",
)
report_path = hf_hub_download(
repo_id=MODEL_ID,
repo_type="model",
filename="report.json",
)
pipeline = joblib.load(model_path)
with open(report_path, encoding="utf-8") as handle:
report = json.load(handle)
threshold = next(
result["validation_best_threshold"]["threshold"]
for result in report["results"]
if result["model_type"] == f"{BASELINE}_tfidf"
)
probabilities = pipeline.predict_proba(texts)[:, 1]
for text, probability in zip(texts, probabilities):
label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
print({"text": text, "probability_positive": float(probability), "label": label})
Sentence-Embedding Baselines
Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm".
import joblib
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModel, AutoTokenizer
MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO"
BASELINE = "embedding-logistic"
texts = [
"Wheat export inspections rose as demand from importers increased.",
"The sports ministry announced a new stadium renovation plan.",
]
model_path = hf_hub_download(
repo_id=MODEL_ID,
repo_type="model",
filename=f"baselines/{BASELINE}/{BASELINE}.joblib",
)
artifact = joblib.load(model_path)
tokenizer = AutoTokenizer.from_pretrained(artifact["embedding_model_name"])
encoder = AutoModel.from_pretrained(artifact["embedding_model_name"])
encoder.eval()
encoded_batches = []
batch_size = artifact.get("embedding_batch_size", 64)
for start in range(0, len(texts), batch_size):
batch_texts = texts[start : start + batch_size]
inputs = tokenizer(
batch_texts,
padding=True,
truncation=True,
max_length=artifact.get("embedding_max_length", 256),
return_tensors="pt",
)
with torch.no_grad():
outputs = encoder(**inputs)
token_embeddings = outputs.last_hidden_state
attention_mask = inputs["attention_mask"].unsqueeze(-1).to(token_embeddings.dtype)
embeddings = (token_embeddings * attention_mask).sum(dim=1)
embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
if artifact.get("normalize_embeddings", True):
embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
encoded_batches.append(embeddings)
embeddings = torch.cat(encoded_batches).numpy()
probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1]
threshold = artifact["validation_best_threshold"]["threshold"]
for text, probability in zip(texts, probabilities):
label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
print({"text": text, "probability_positive": float(probability), "label": label})
Files
REPORT.md: Markdown report for this training run.report.json: Machine-readable report containing metrics and thresholds.transformer/: Fine-tuned Transformer artifacts, when Transformer training is enabled.baselines/: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled.*/validation_predictions.csvand*/test_predictions.csv: Split-level predictions.
Model tree for faodl/agri-stocks-classifier
Base model
FacebookAI/xlm-roberta-base