Jim Lai

grimjim

AI & ML interests

Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct end use, but aim for research and/or educational purposes. Recent Contributions: stabilized refusal direction ablation via Gram-Schmidt orthonormalization and norm-preserving interventions; confirmed reasoning transfer via model merger.

Recent Activity

updated a model about 13 hours ago

grimjim/Equatorium-v3-12B

published a model 1 day ago

grimjim/Equatorium-v3-12B

new activity 1 day ago

MuXodious/OmniDimen-2-8B-Emotion-absolute-heresy:Fascinating model

View all activity

Organizations

updated a model about 13 hours ago

grimjim/Equatorium-v3-12B

Text Generation • 12B • Updated about 13 hours ago

published a model 1 day ago

grimjim/Equatorium-v3-12B

Text Generation • 12B • Updated about 13 hours ago

New activity in MuXodious/OmniDimen-2-8B-Emotion-absolute-heresy 1 day ago

Fascinating model

🤝 👀 1

#1 opened 9 days ago by

redaihf

New activity in MuXodious/Goetia-24B-v1.3-absolute-heresy 1 day ago

Faithful decensor

👀 🔥 2

#1 opened 8 days ago by

redaihf

replied to their post 11 days ago

OpenClaw commoditizes the agentic orchestration layer, but at the expense of offloading security to users, who're mostly unprepared. Right now it's got vibes, but also technical debt as security isn't baked in from conception, so I wouldn't be surprised if it ends up like LangChain in a year - initially popular, but less so as its limitations became visible. Most people are using APIs to frontier models, so only the agents are running locally, not the brains of the AI per se. Enthusiasm precedes awareness.

replied to their post 13 days ago

The main negative I see is that a lot of newbies are going to learn hard-won security lessons the hard way. VIbe coding generally doesn't bake security in from the ground up, and not enough vibe coders likely know enough to prompt for that.

I gather a lot of instances aren't even using local AI, making them agentic extensions of frontier models. Of course a lot of people will be impressed at what modern agentic AI can do.

updated a model 13 days ago

grimjim/Equatorium-v2-12B

Text Generation • 12B • Updated 13 days ago • 8 • 1

published a model 13 days ago

grimjim/Equatorium-v2-12B

Text Generation • 12B • Updated 13 days ago • 8 • 1

updated a model 18 days ago

grimjim/Equatorium-v1-12B

Text Generation • 12B • Updated 18 days ago • 21 • 1

published a model 18 days ago

grimjim/Equatorium-v1-12B

Text Generation • 12B • Updated 18 days ago • 21 • 1

posted an update 19 days ago

Post

339

After tinkering with Gemma Scope 2, I now have an mechanistic explanation of why Winsorization was as effective as it was in my ablation experiments on Gemma 3 12B Instruct. In short, the activation for the BOS token overwhelms everything else. Gemma Scope 2 deliberately did not train on the BOS token. Winsorization capped the magnitude of the BOS token, allowing the activations of other tokens to be compared.
google/gemma-scope-2-12b-it

replied to their post 21 days ago

Given scale, it also means contamination with meme culture, adding an unserious element to things. It was therefore stochastically predictable that we would see some meme tropes be amplified.

replied to their post 22 days ago

We know that simulating multiple agents leads to emergent community behavior from 2023 even. Perhaps their Github repo should be revisited, as they had a 25-agent sim.
https://hai.stanford.edu/news/computational-agents-exhibit-believable-humanlike-behavior

We already have tools which enable group chats of multiple personae at small scale, so seeing emergent behavior isn't unprecedented.

What we don't have in this live experiment is an assurance of integrity; e.g., that human prompt injection isn't being used to tamper with results for clout. Alternately, having agents read human-authored content at large on the Internet results in contamination, invalidating any claims to emergence without human input.

posted an update 23 days ago

Post

483

The contrarian in me is wary of the irrational exuberance over MoltBook. Nothing so far has struck me as being unpredictable. We knew already that LLMs were good at roleplay, to the point where some users started to think of their chatbots as soulmates (only to lament when the underlying model was pulled), and that chatbots can fall into conversational basins when even two instances are allowed to chat with each other at length. The appearance of memes that postdate training cutoff is suspect, which implies at the very least that humans have injected something at the level of prompts or content/context to introduce them into conversation like a Chekhov's Gun. And we know that security holes are common in vibe coding, attended or not.

12 replies

commented on Norm-Preserving Biprojected Abliteration 23 days ago

And the tradeoff is having to allocate more memory to track magnitude and direction separately. Please keep me appraised about how this goes.

commented on Norm-Preserving Biprojected Abliteration 26 days ago

The yaml included was accurate then. Layer 27 was from an early attempt. The viability of applying refusal measurements to chunks of layers suggests that a signal processing view involving key layers could be a useful framing. Applying refusal direction on a per layer basis underperformed in my experiments.

I expect the deccp dataset seems to be only useful against a subset of refusals, though I didn't test that edge case as it was inhereted from the codebase I started from. Validating that the entries are refused by a particular Chinese model and culling those that pass would be a more targeted approach, as nonrefusals would dilute the refusal direction.

Fine-tuning is a well-established way to smooth over damage resulting from ablation. I'm curious why you picked DoRA.

commented on Norm-Preserving Biprojected Abliteration 27 days ago

I should get around to documenting my layer selection choice on the relevant model card, which was admittedly empirical and bespoke.

I should have taken better notes regarding my final Gemma 3 12B work, but it appears that I took the measurement from layer 29 (which looked good in charting) and ablated it from layers 11-41, scale 1 throughout; I threw in sparsity 0.001 to layers 35-41, but that may have not have been necessary. Geometric preservation allowed the model to retain most of its knowledge despite the extent of intervention.

Let me know whenever you make your paper available. I'd be interested to see your findings!

updated a dataset about 1 month ago

grimjim/llm-aes-writing-prompts-deduplicated-0.9-similarity

Viewer • Updated Jan 11 • 81.4k • 14

published a dataset about 1 month ago

grimjim/llm-aes-writing-prompts-deduplicated-0.9-similarity

Viewer • Updated Jan 11 • 81.4k • 14

commented on Norm-Preserving Biprojected Abliteration about 2 months ago

Activations are measured for all layers in one pass, as the cost is only a bit more RAM to hold the results; no significant cost in inference time. This is done for measuring compliance and refusal activations. Directional difference is computed within each layer.

For intervention/ablation, the YML file allows an N-to-M mapping. I can pick 3-4 (notionally high relevance) layer measurements to apply to sequential chunks, with the heuristic that the source measurement layer being closer to the target intervention layer will hopefully limit unwanted side-effects. One could apply each refusal measurement to the same layer, but that approach doesn't provide the most effective ablation in my experience. There's something deeper going on which I've not yet been able to characterize.

Jim Lai

AI & ML interests

Recent Activity

Organizations

grimjim's activity

Fascinating model

Faithful decensor