👋 Open to Work

RDTvlokip PRO

RDTvlokip

·

https://rdtvlokip.fr

AI & ML interests

None yet

Recent Activity

repliedto their post about 21 hours ago

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point. After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture. It wasn't. It never is. The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness. The real lesson isn't about French LLMs: Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion. The model was never the problem. Full write-up here 👇 🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

repliedto their post about 24 hours ago

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point. After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture. It wasn't. It never is. The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness. The real lesson isn't about French LLMs: Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion. The model was never the problem. Full write-up here 👇 🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

repliedto their post 1 day ago

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point. After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture. It wasn't. It never is. The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness. The real lesson isn't about French LLMs: Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion. The model was never the problem. Full write-up here 👇 🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

View all activity

Organizations

Posts 2

Post

102

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point.

After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture.

It wasn't. It never is.

The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness.

The real lesson isn't about French LLMs:

Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion.

The model was never the problem.

Full write-up here 👇

🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

Articles 89

Article

1

🔧 L'architecture est un seuil, pas un levier — ce que j'ai appris en optimisant un LLM français de 15M de paramètres 🇫🇷

View all Articles

spaces 2

AG BPE

AG-BPE (Attention-Guided Byte-Pair Encoding)

InifiniGPT

models 1

RDTvlokip/POGAT

Updated Mar 10, 2025

datasets 1

RDTvlokip/InfiniQA

Viewer • Updated Jun 26, 2025 • 101k • 47 • 2