👋 Open to Work

RDTvlokip PRO

RDTvlokip

92 3

https://rdtvlokip.fr

AI & ML interests

None yet

Recent Activity

repliedto their post 1 day ago

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point. After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture. It wasn't. It never is. The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness. The real lesson isn't about French LLMs: Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion. The model was never the problem. Full write-up here 👇 🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

repliedto their post 1 day ago

View all activity

Organizations

replied to their post 1 day ago

The runner-up at 0.39 is the tell. A 0.6 pick reads decisive until you see the model was one sample away from a completely different sentence. The chosen-prob view collapses that fork into a single number; the full distribution is where the hesitation actually lives. We agree there.

Your cross-turn rung is the one I underweighted, and you're right. In a single generation the worst case is a bad sentence. Across turns, the bug is in the gap, the state you assumed propagated versus what actually did. The failure isn't in any frame, it's in the cut between two.

To your real question, does any layer never lie? My honest answer: no single layer is fully honest, but the disagreement between two adjacent layers is. A clean rendered string over an exploded distribution. A confident chosen-prob over a runner-up that's nearly tied. Carried state that doesn't match assumed state. Every bug I've actually caught lived in a mismatch between two representations, never in one read alone.

So I've stopped looking for the truthful layer and started diffing adjacent ones. The signal isn't in any rung of the ladder, it's in the rungs not agreeing. The lie is always at a seam.

That's literally how I caught a decoding setting gaming my own metric: coherence looked great, but the model's self-perplexity on its own output had jumped. Neither number was "the truth." The gap between them was.

replied to their post 1 day ago

For me it's the per-token probabilities of the generation, not the tokens themselves.

When an eval looks clean but feels off, the text reads fine and the ids look fine, so the bug isn't what was generated, it's how confidently. I pull the prob of each chosen token. A model that's quietly broken (or being pushed by a bad sampling setting) shows it there first: long stretches of very low-confidence picks the surface text hides, or suspicious spikes where it's locked onto one path.

That's actually how I caught one of my decoding configs gaming a metric, coherence looked great, but the self-perplexity of the model on its own output had jumped. The rendered text was smooth, the model itself "disagreed" with what it had written. The confidence signal exposed it before any human read-through would have.

So my order is: rendered text → raw ids → per-token confidence. Each layer is less lossy than the one above it. Your truncated-JSON case is the same shape, the rendered transcript is the most lossy view of all, and the thing that actually fired lives one layer down.

What about you, when the id stream looks clean too, do you go to logits, or somewhere else entirely?

replied to their post 1 day ago

Diffing the real input_ids against what you think you sent is the move. That's literally how I found it, printed the actual ids, saw a 3 (my `) sitting at the end where I expected my last content token. Two tokens I never typed, exactly.

To your question: I log raw ids on every eval run now, not just when something looks off. Cheap insurance. The whole reason this bug survived so long is that the symptom (drift) and the trigger (<eos>) live in different representations, one in the decoded text, one in the id stream. If you only inspect the layer where the symptom shows up, you never see the cause. Logging both by default means the next invisible-token bug shows up in the diff before I waste a day blaming sampling.

The rule I took from it: never debug a generation issue from the decoded string alone. The string is lossy skip_special_tokens=True hides the exact thing that's breaking you.

replied to their post 1 day ago

Exactly. Once the architecture clears the threshold, it stops being the lever and we keep tuning it out of habit because it's the part we can see.

The one that surprised me most wasn't even a hyperparameter. It was a single trailing <eos> token on the prompt.

My model kept "drifting", prompt about one town, output about a completely different one. I spent ages blaming the # heading token, the sampling, the data. Turns out encode(add_special_tokens=True) was appending <eos> to the end of the prompt. The model, trained on packed documents, read that as "this document is finished" and helpfully started a brand new one. The was invisible in the decoded output, so I never saw the actual trigger.

Strip the trailing <eos>, and the "drift" was just... gone. No architecture, no retraining. One token.

Your deepcopy-in-the-hot-path story is the same shape, the bug hides in the layer nobody inspects because it "obviously can't be that simple." It always can.

posted an update 2 days ago

Post

104

I spent a week optimizing my 15M French LLM. Not one line of new architecture. And that was the whole point.

After building it from scratch (custom crawler, BPE, LLaMA-style arch, 3-phase trainer), the model wrote perfect French but hallucinated facts and drifted off-topic. So I went hunting for the bottleneck, convinced it was the architecture.

It wasn't. It never is.

The wins came from boring places: a data pipeline that cut documents mid-sentence, two special tokens silently sabotaging generation, and one decoding hyperparameter that doubled coherence (38 → 76 tokens before drift). The flashy research, contrastive decoding, DoLa, gave the smallest gains. One of them was even a false negative caused by my own buggy eval harness.

The real lesson isn't about French LLMs:

Architecture is a threshold, not a lever. Once you clear it, the bottleneck is everywhere except the architecture. Measure first. Read your own data. Verify your code before you trust your conclusion.

The model was never the problem.

Full write-up here 👇

🔗 https://huggingface.co/blog/RDTvlokip/what-i-learned-optimizing-a-15m-french

8 replies

reacted to Aurelien-Morgan's post with 🔥 about 2 months ago

Post

1097

@retrain-pipelines v0.2.0 is out !
I'm at Station F at My booth with GOSIM Paris 2026 today & tomorrow.
Come meet me for a live in-person demo and a chat !

1 reply

posted an update about 2 months ago

Post

136

🧠 I trained a French LLM from scratch. Alone. On a 1080 Ti. And honestly… it was a lot.

4 months building the dataset before even touching the model. Custom crawler, custom extractor, custom BPE tokenizer, everything from zero. Then the architecture — RoPE, RMSNorm, SwiGLU, Flash Attention. Then a 3-phase trainer. Then debugging a causal mask bug that made the model generate "ïsïsïs" for hours.

Then the power went out at epoch 10/18.

The checkpoint survived. The model learned form perfectly — grammar, markdown, structure. Substance? Still working on it. Honest conclusion.

Full write-up here 👇

🔗 https://huggingface.co/blog/RDTvlokip/i-trained-my-own-french-llm-from-scratch

reacted to hba123's post with 🔥 9 months ago

Post

4081

🤖 What if building your own robot arm costs less than £220?

For years, robotics has been locked behind high prices and complex systems.
So we decided to change that.

Today, we’re open-sourcing Ark-Bot — a fully 3D-printed, 6-DOF robot arm that works seamlessly with our Python robotics library, Ark.

And yes… It’s only £215.86 to build.

🧠ArkBot Specs 🧠

1️⃣ Reach: 1 meter
2️⃣ Weight: 2.6 kg
3️⃣ Payload: 1.8 kg 💪
4️⃣ DOF: 6
5️⃣ Input Voltage: DC 12V

🤟Fully 3D-printable & open-source
🤟Integrated with Ark — no ROS required

📹 We’ve also released a video showing the full assembly process — because robotics should be something everyone can learn, build, and improve on.

👩‍🎓 With Ark-Bot, anyone — from students to AI researchers — can experiment with embodied AI, robot learning, and control algorithms on real hardware, affordably.

If you could control a 1-meter robot arm from your laptop for under £220…
👉 What would you build first?

🔗https://github.com/Robotics-Ark/ark_bot
🎥 https://www.youtube.com/watch?v=Kuk4pC0EaEw&feature=youtu.be

2 replies

RDTvlokip PRO

AI & ML interests

Recent Activity

Organizations

RDTvlokip's activity