Post
102
I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point.
After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture.
It wasn't. It never is.
The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 β 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness.
The real lesson isn't about French LLMs:
Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion.
The model was never the problem.
Full write-up here π
π https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french
After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture.
It wasn't. It never is.
The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 β 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness.
The real lesson isn't about French LLMs:
Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion.
The model was never the problem.
Full write-up here π
π https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french