Experiment Timeline

The primary objective of this project is to systematically explore different approaches to programming language classification, ranging from traditional machine learning methods to modern transformer architectures.

Rather than immediately training a large neural network, the project follows a progressive benchmarking strategy. Each model serves as a baseline for the next stage, allowing direct comparison of accuracy, model size, training cost, inference speed, and deployment complexity.

The experiments are designed to answer several questions:

How far can classical machine learning be pushed on source code classification?
How much improvement does FastText provide over linear models?
How much additional performance can transformer architectures achieve?
What is the optimal trade-off between accuracy and model size?
Can large transformer models later be distilled into smaller deployable models?

Phase 1 — SGD Logistic Regression Baseline

Motivation

The first goal was to establish a strong classical machine learning baseline.

Programming languages contain many distinctive lexical and syntactic patterns:

#include
public class
def
fn
let
import

Character n-gram models are known to perform surprisingly well for language identification tasks because they capture these patterns directly without requiring deep semantic understanding.

Because of this, a linear classifier using hashed character n-gram features was selected as the initial benchmark.

Architecture

Feature Extraction

HashingVectorizer
Character-level features
Character n-grams: (2, 6)
131,072 hashed dimensions
No vocabulary storage
Constant-memory feature extraction

Classifier

SGDClassifier
Logistic Regression objective (log_loss)
Incremental training using partial_fit
Streaming JSONL training pipeline

Training Strategy

The entire dataset was streamed from disk in batches.

Benefits:

Constant RAM usage
Scalable to millions of samples
No need to load the entire dataset into memory
Fast experimentation

The classifier was trained for multiple epochs while evaluating both validation and test performance after every epoch.

Results

Test Accuracy

~91.1%

Observations

The model performed significantly better than expected for such a simple architecture.

Strengths

Extremely fast training
Fast inference
Simple implementation
Excellent scalability

Weaknesses

Difficulty separating structurally similar languages
Limited contextual understanding
Large sparse parameter matrix
Performance ceiling reached relatively quickly

Common Confusion Pairs

C ↔ C++
JavaScript ↔ TypeScript
HTML ↔ Markdown

Phase 2 — FastText

Motivation

After establishing the linear baseline, the next objective was to evaluate FastText.

FastText occupies an interesting position between classical machine learning and neural networks.

It introduces:

Learned embeddings
Character-level subword information
Efficient training
Low inference latency

while remaining dramatically smaller and faster than transformer models.

Data Preparation

FastText requires a custom supervised text format:

__label__Python print("hello")

A dedicated conversion pipeline was created to transform JSONL datasets into FastText format.

Preventing Label Leakage

During preprocessing, special care was taken to prevent accidental label leakage.

Source code occasionally contained the token:

__label__

which FastText interprets as a valid training label.

To prevent this issue:

__label__ → __lbl__

was applied during dataset conversion.

This eliminated spurious classes and ensured correct training.

Architecture

Configuration

dim = 50
wordNgrams = 3
minn = 2
maxn = 5
minCount = 100
bucket = 50000
loss = softmax
epoch = 25
learning_rate = 0.7

Hyperparameter Exploration

A significant amount of experimentation was performed around:

Embedding dimension
Character subword lengths
Vocabulary size
Bucket size
Epoch count
Learning rate
Model size reduction

The goal was not merely to maximize accuracy, but also to produce a compact deployable model.

Results

Test Accuracy

~95.5%

Improvement Over SGD

+4.4 percentage points

Observations

FastText substantially outperformed the linear baseline.

Key Findings

Character subwords are extremely powerful for source code.
Many language-specific keywords are captured effectively.
FastText dramatically reduced confusion between related languages.
Training remained relatively fast despite the dataset scale.

FastText proved to be one of the strongest accuracy-to-compute trade-offs observed during the project.

Phase 3 — ModernBERT

Motivation

After achieving strong results with FastText, the next stage of the project explored whether transformer architectures could further improve programming language classification performance.

Unlike FastText, transformer models can learn:

Long-range dependencies
Global context
Structural relationships
Context-aware representations

The goal was to determine whether additional model capacity translates into meaningful real-world gains for source code language identification.

Architecture

Model

ModernBERT-base

Task

Sequence Classification

Results

Approximate Test Accuracy

~97–98%

Improvement Over FastText

~2–3 percentage points

Observations

ModernBERT achieved the highest overall accuracy among all models tested.

However, experimentation revealed that the improvement over FastText was relatively small considering the large increase in computational requirements.

Compared with FastText:

Training time increased dramatically
GPU memory usage increased significantly
Inference became substantially slower
Model size increased considerably
Deployment became more complex

Although ModernBERT achieved higher accuracy, the gain remained limited relative to the increase in compute.

Key Finding

For programming language classification specifically:

Transformer-based neural networks do not appear to be the most efficient solution for this task.

Programming languages contain strong lexical and structural signals that can already be captured extremely effectively using lightweight approaches.

FastText achieved performance surprisingly close to ModernBERT while requiring only a fraction of:

Compute
Training time
Memory
Storage
Inference cost

Current Benchmark Summary

Model	Test Accuracy	Relative Compute
SGD Logistic Regression	~91.1%	Very Low
FastText	~95.5%	Low
ModernBERT-base	~97–98%	Extremely High

Current Conclusions

1. Classical machine learning remains surprisingly competitive

Character-level linear models establish a strong baseline even at large scale.

2. FastText provides the strongest accuracy-to-compute ratio

Current experiments indicate FastText delivers the best balance of:

Accuracy
Training speed
Inference speed
Memory efficiency
Deployment simplicity

while remaining within only a few percentage points of transformer performance.

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train kaushik-harsh-99/Code-Lang-Classifier

Evaluation results

SGD Test Accuracy on Code Language Classification Dataset
self-reported

91.100
FastText Test Accuracy on Code Language Classification Dataset
self-reported

95.500