Jim Lai
AI & ML interests
Recent Activity
Organizations
Fascinating model
Faithful decensor
OpenClaw commoditizes the agentic orchestration layer, but at the expense of offloading security to users, who're mostly unprepared. Right now it's got vibes, but also technical debt as security isn't baked in from conception, so I wouldn't be surprised if it ends up like LangChain in a year - initially popular, but less so as its limitations became visible. Most people are using APIs to frontier models, so only the agents are running locally, not the brains of the AI per se. Enthusiasm precedes awareness.
The main negative I see is that a lot of newbies are going to learn hard-won security lessons the hard way. VIbe coding generally doesn't bake security in from the ground up, and not enough vibe coders likely know enough to prompt for that.
I gather a lot of instances aren't even using local AI, making them agentic extensions of frontier models. Of course a lot of people will be impressed at what modern agentic AI can do.
google/gemma-scope-2-12b-it
Given scale, it also means contamination with meme culture, adding an unserious element to things. It was therefore stochastically predictable that we would see some meme tropes be amplified.
We know that simulating multiple agents leads to emergent community behavior from 2023 even. Perhaps their Github repo should be revisited, as they had a 25-agent sim.
https://hai.stanford.edu/news/computational-agents-exhibit-believable-humanlike-behavior
We already have tools which enable group chats of multiple personae at small scale, so seeing emergent behavior isn't unprecedented.
What we don't have in this live experiment is an assurance of integrity; e.g., that human prompt injection isn't being used to tamper with results for clout. Alternately, having agents read human-authored content at large on the Internet results in contamination, invalidating any claims to emergence without human input.
And the tradeoff is having to allocate more memory to track magnitude and direction separately. Please keep me appraised about how this goes.
The yaml included was accurate then. Layer 27 was from an early attempt. The viability of applying refusal measurements to chunks of layers suggests that a signal processing view involving key layers could be a useful framing. Applying refusal direction on a per layer basis underperformed in my experiments.
I expect the deccp dataset seems to be only useful against a subset of refusals, though I didn't test that edge case as it was inhereted from the codebase I started from. Validating that the entries are refused by a particular Chinese model and culling those that pass would be a more targeted approach, as nonrefusals would dilute the refusal direction.
Fine-tuning is a well-established way to smooth over damage resulting from ablation. I'm curious why you picked DoRA.
I should get around to documenting my layer selection choice on the relevant model card, which was admittedly empirical and bespoke.
I should have taken better notes regarding my final Gemma 3 12B work, but it appears that I took the measurement from layer 29 (which looked good in charting) and ablated it from layers 11-41, scale 1 throughout; I threw in sparsity 0.001 to layers 35-41, but that may have not have been necessary. Geometric preservation allowed the model to retain most of its knowledge despite the extent of intervention.
Let me know whenever you make your paper available. I'd be interested to see your findings!
Activations are measured for all layers in one pass, as the cost is only a bit more RAM to hold the results; no significant cost in inference time. This is done for measuring compliance and refusal activations. Directional difference is computed within each layer.
For intervention/ablation, the YML file allows an N-to-M mapping. I can pick 3-4 (notionally high relevance) layer measurements to apply to sequential chunks, with the heuristic that the source measurement layer being closer to the target intervention layer will hopefully limit unwanted side-effects. One could apply each refusal measurement to the same layer, but that approach doesn't provide the most effective ablation in my experience. There's something deeper going on which I've not yet been able to characterize.