Model Card for Model ID
This is a bert-base-uncased model finetuned for sentiment analysis on product reviews in English. It predicts the sentiment of the tweet/text as Positive or Negative.
This model is intended for direct use as a sentiment analysis model for tweets/text analysis in English or for further finetuning on related sentiment analysis tasks.
Model Description
- Model type: NLP sentiment classification
- Language(s) (NLP): English
- License: Apachi 2.0
- Finetuned from model [optional]: base-Bert-uncased
Trained exclusively on synthetic English data generated by advanced LLMs. Dataset used: https://www.kaggle.com/datasets/zphudzz/tweets-clean-posneg-v1/data
Uses
- Social media analysis
- Customer feedback analysis
- Product reviews classification
Training Details
Original dataset has 1+ Billions of rows in Dataset. We made downsampling to 200k of rows before preprocessing.
After Preprocessing size of dataset was reduced to 162+k values(rows). This amount was evenly distributed by labels: Positive(81781) and Negative(80733).
Training Data
Data are merged & cleaned dataset from sentiment140, twitter-tweets, & twitter-sentiment.
Number of words in Text column. Most tweets has below 30 words
Training Procedure
π¦ Model & Tokenizer Model: BertForSequenceClassification with num_labels=2
Tokenizer: BertTokenizer with max length of 128
Dropout: hidden_dropout_prob=0.3 for regularization
Fine-tuned for 3 epochs. Achieved a train_acc_off_by_one of approximately 0.83 on the validation dataset. Accuracy on test set: 0.84 (84%)
Preprocessing and Dataset
Preprocessing in this case prepares a large-scale tweet dataset for binary sentiment classification by applying a robust, multi-stage cleaning pipeline. It ensures linguistic consistency, removes noise, balances class distribution, and filters for high-confidence English text β all in preparation for fine-tuning a BERT-based model.
π Dataset Source Origin: Kaggle Dataset
File Used: final_clean_no_neutral_no_duplicates.csv
Initial Size: ~1.6M labeled tweets
π Label Normalization Original labels: 0 (negative), 4 (positive)
Converted to: 0 (negative), 1 (positive)
βοΈ Class Balancing Sampled 100,000 tweets from each class
Final balanced dataset: 200,000 rows
Shuffled and reset index for training consistency
π§Ή Text Cleaning Steps Chat Abbreviation Expansion Replaces common acronyms (e.g., BRB β Be Right Back, CEO β Chief Executive Officer) using a custom dictionary.
Punctuation Removal Uses str.translate with string.punctuation for fast, vectorized stripping.
Lowercasing Converts all text to lowercase for uniform tokenization.
Emoji & Non-ASCII Removal
Regex-based emoji stripping
.encode('ascii', 'ignore') to remove non-UTF characters
URL, Email, Mention, Hashtag, HTML Tag Removal Regex-based filters to eliminate noisy tokens and metadata.
Whitespace Normalization Strips leading/trailing spaces and collapses multiple spaces.
π Language Filtering Uses langdetect with detect_langs() for probabilistic language detection
Keeps only rows with β₯90% confidence in English
Parallelized with pandarallel for speed
Final Size After Filtering:
English rows kept: 162,514
Non-English rows removed: 37,481
Class distribution:
0: 81,781
1: 80,733
Dataset Preparation
Converting preprocessed df_balanced into Hugging Face Dataset
Appling static padding (max_length=128) for faster speed(in comparizon to Dynamic padding)
Spliting into train (70%), validation (15%), and test (15%)
Removes raw text column post-tokenization to optimize memory
Training Hyperparameters
Training regime: fp16 mixed precision
TrainingArguments( num_train_epochs=4, per_device_train_batch_size=8, per_device_eval_batch_size=16, gradient_accumulation_steps=2, learning_rate=2e-5, weight_decay=0.01, warmup_steps=500, label_smoothing_factor=0.1, fp16=True, load_best_model_at_end=True, metric_for_best_model='accuracy', save_total_limit=2, logging_steps=50, eval_strategy='epoch', save_strategy='epoch', )
Evaluation
Final Evaluation Test Accuracy: 83.54%
Label-wise Accuracy:
Negative (0): 86.07%
Positive (1): 81.01%
Confusion Matrix: [[10512 1702] [ 2310 9853]]
Classification Report:
Precision: ~83.6%
Recall: ~83.5%
F1-score: ~83.5%
Metrics
Accuracy
Precision
Recall
F1-score (macro average)
Results
Hardware
GPU T4(on Colab) or GPU T4 x2 (on Kaggle)
- Downloads last month
- 1
Model tree for Mentalii/sentiment-tweets-pos-neg-epoch3
Base model
google-bert/bert-base-uncased


