Model Card for Model ID

This is a bert-base-uncased model finetuned for sentiment analysis on product reviews in English. It predicts the sentiment of the tweet/text as Positive or Negative.

This model is intended for direct use as a sentiment analysis model for tweets/text analysis in English or for further finetuning on related sentiment analysis tasks.

Model Description

  • Model type: NLP sentiment classification
  • Language(s) (NLP): English
  • License: Apachi 2.0
  • Finetuned from model [optional]: base-Bert-uncased

Trained exclusively on synthetic English data generated by advanced LLMs. Dataset used: https://www.kaggle.com/datasets/zphudzz/tweets-clean-posneg-v1/data

Uses

  • Social media analysis
  • Customer feedback analysis
  • Product reviews classification

Training Details

Original dataset has 1+ Billions of rows in Dataset. We made downsampling to 200k of rows before preprocessing.
After Preprocessing size of dataset was reduced to 162+k values(rows). This amount was evenly distributed by labels: Positive(81781) and Negative(80733).

Training Data

Data are merged & cleaned dataset from sentiment140, twitter-tweets, & twitter-sentiment.

image/png Number of words in Text column. Most tweets has below 30 words

Training Procedure

πŸ“¦ Model & Tokenizer Model: BertForSequenceClassification with num_labels=2

Tokenizer: BertTokenizer with max length of 128

Dropout: hidden_dropout_prob=0.3 for regularization

Fine-tuned for 3 epochs. Achieved a train_acc_off_by_one of approximately 0.83 on the validation dataset. Accuracy on test set: 0.84 (84%)

image/png

image/png

Preprocessing and Dataset

Preprocessing in this case prepares a large-scale tweet dataset for binary sentiment classification by applying a robust, multi-stage cleaning pipeline. It ensures linguistic consistency, removes noise, balances class distribution, and filters for high-confidence English text β€” all in preparation for fine-tuning a BERT-based model.

πŸ“‚ Dataset Source Origin: Kaggle Dataset

File Used: final_clean_no_neutral_no_duplicates.csv

Initial Size: ~1.6M labeled tweets

πŸ”„ Label Normalization Original labels: 0 (negative), 4 (positive)

Converted to: 0 (negative), 1 (positive)

βš–οΈ Class Balancing Sampled 100,000 tweets from each class

Final balanced dataset: 200,000 rows

Shuffled and reset index for training consistency

🧹 Text Cleaning Steps Chat Abbreviation Expansion Replaces common acronyms (e.g., BRB β†’ Be Right Back, CEO β†’ Chief Executive Officer) using a custom dictionary.

Punctuation Removal Uses str.translate with string.punctuation for fast, vectorized stripping.

Lowercasing Converts all text to lowercase for uniform tokenization.

Emoji & Non-ASCII Removal

Regex-based emoji stripping

.encode('ascii', 'ignore') to remove non-UTF characters

URL, Email, Mention, Hashtag, HTML Tag Removal Regex-based filters to eliminate noisy tokens and metadata.

Whitespace Normalization Strips leading/trailing spaces and collapses multiple spaces.

🌐 Language Filtering Uses langdetect with detect_langs() for probabilistic language detection

Keeps only rows with β‰₯90% confidence in English

Parallelized with pandarallel for speed

Final Size After Filtering:

English rows kept: 162,514

Non-English rows removed: 37,481

Class distribution:

0: 81,781

1: 80,733

Dataset Preparation

  • Converting preprocessed df_balanced into Hugging Face Dataset

  • Appling static padding (max_length=128) for faster speed(in comparizon to Dynamic padding)

  • Spliting into train (70%), validation (15%), and test (15%)

Removes raw text column post-tokenization to optimize memory

Training Hyperparameters

  • Training regime: fp16 mixed precision

  • TrainingArguments( num_train_epochs=4, per_device_train_batch_size=8, per_device_eval_batch_size=16, gradient_accumulation_steps=2, learning_rate=2e-5, weight_decay=0.01, warmup_steps=500, label_smoothing_factor=0.1, fp16=True, load_best_model_at_end=True, metric_for_best_model='accuracy', save_total_limit=2, logging_steps=50, eval_strategy='epoch', save_strategy='epoch', )

Evaluation

Final Evaluation Test Accuracy: 83.54%

Label-wise Accuracy:

Negative (0): 86.07%

Positive (1): 81.01%

Confusion Matrix: [[10512 1702] [ 2310 9853]]

Classification Report:

Precision: ~83.6%

Recall: ~83.5%

F1-score: ~83.5%

Metrics

Accuracy

Precision

Recall

F1-score (macro average)

Results

image/png

image/png

Hardware

GPU T4(on Colab) or GPU T4 x2 (on Kaggle)

Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Mentalii/sentiment-tweets-pos-neg-epoch3

Finetuned
(6643)
this model