Model Card for Model ID

This is a bert-base-uncased model finetuned for sentiment analysis on product reviews in English. It predicts the sentiment of the tweet/text as Positive or Negative.

This model is intended for direct use as a sentiment analysis model for tweets/text analysis in English or for further finetuning on related sentiment analysis tasks.

Model Description

Model type: NLP sentiment classification
Language(s) (NLP): English
License: Apachi 2.0
Finetuned from model [optional]: base-Bert-uncased

Trained exclusively on synthetic English data generated by advanced LLMs. Dataset used: https://www.kaggle.com/datasets/zphudzz/tweets-clean-posneg-v1/data

Uses

Social media analysis
Customer feedback analysis
Product reviews classification

Training Details

Original dataset has 1+ Billions of rows in Dataset. We made downsampling to 200k of rows before preprocessing.
After Preprocessing size of dataset was reduced to 162+k values(rows). This amount was evenly distributed by labels: Positive(81781) and Negative(80733).

Training Data

Data are merged & cleaned dataset from sentiment140, twitter-tweets, & twitter-sentiment.

Number of words in Text column. Most tweets has below 30 words

Training Procedure

📦 Model & Tokenizer Model: BertForSequenceClassification with num_labels=2

Tokenizer: BertTokenizer with max length of 128

Dropout: hidden_dropout_prob=0.3 for regularization

Fine-tuned for 3 epochs. Achieved a train_acc_off_by_one of approximately 0.83 on the validation dataset. Accuracy on test set: 0.84 (84%)

Preprocessing and Dataset

Preprocessing in this case prepares a large-scale tweet dataset for binary sentiment classification by applying a robust, multi-stage cleaning pipeline. It ensures linguistic consistency, removes noise, balances class distribution, and filters for high-confidence English text — all in preparation for fine-tuning a BERT-based model.

📂 Dataset Source Origin: Kaggle Dataset

File Used: final_clean_no_neutral_no_duplicates.csv

Initial Size: ~1.6M labeled tweets

🔄 Label Normalization Original labels: 0 (negative), 4 (positive)

Converted to: 0 (negative), 1 (positive)

⚖️ Class Balancing Sampled 100,000 tweets from each class

Final balanced dataset: 200,000 rows

Shuffled and reset index for training consistency

🧹 Text Cleaning Steps Chat Abbreviation Expansion Replaces common acronyms (e.g., BRB → Be Right Back, CEO → Chief Executive Officer) using a custom dictionary.

Punctuation Removal Uses str.translate with string.punctuation for fast, vectorized stripping.

Lowercasing Converts all text to lowercase for uniform tokenization.

Emoji & Non-ASCII Removal

Regex-based emoji stripping

.encode('ascii', 'ignore') to remove non-UTF characters

URL, Email, Mention, Hashtag, HTML Tag Removal Regex-based filters to eliminate noisy tokens and metadata.

Whitespace Normalization Strips leading/trailing spaces and collapses multiple spaces.

🌐 Language Filtering Uses langdetect with detect_langs() for probabilistic language detection

Keeps only rows with ≥90% confidence in English

Parallelized with pandarallel for speed

Final Size After Filtering:

English rows kept: 162,514

Non-English rows removed: 37,481

Class distribution:

0: 81,781

1: 80,733

Dataset Preparation

Converting preprocessed df_balanced into Hugging Face Dataset
Appling static padding (max_length=128) for faster speed(in comparizon to Dynamic padding)
Spliting into train (70%), validation (15%), and test (15%)

Removes raw text column post-tokenization to optimize memory

Training Hyperparameters

Training regime: fp16 mixed precision
TrainingArguments( num_train_epochs=4, per_device_train_batch_size=8, per_device_eval_batch_size=16, gradient_accumulation_steps=2, learning_rate=2e-5, weight_decay=0.01, warmup_steps=500, label_smoothing_factor=0.1, fp16=True, load_best_model_at_end=True, metric_for_best_model='accuracy', save_total_limit=2, logging_steps=50, eval_strategy='epoch', save_strategy='epoch', )

Evaluation

Final Evaluation Test Accuracy: 83.54%

Label-wise Accuracy:

Negative (0): 86.07%

Positive (1): 81.01%

Confusion Matrix: [[10512 1702] [ 2310 9853]]

Classification Report:

Precision: ~83.6%

Recall: ~83.5%

F1-score: ~83.5%

Metrics

Accuracy

Precision

Recall

F1-score (macro average)

Results

Hardware

GPU T4(on Colab) or GPU T4 x2 (on Kaggle)

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for Mentalii/sentiment-tweets-pos-neg-epoch3

Base model

google-bert/bert-base-uncased

Finetuned

(6643)

this model