Jim Lai

grimjim

AI & ML interests

Experimenting primarily with 7B-12B parameter text completion models. Not all models are intended for direct end use, but aim for research and/or educational purposes. Recent Contributions: stabilized refusal direction ablation via Gram-Schmidt orthonormalization and norm-preserving interventions; confirmed reasoning transfer via model merger.

Recent Activity

commented on their article 2 days ago

Norm-Preserving Biprojected Abliteration

commented on their article 3 days ago

Norm-Preserving Biprojected Abliteration

updated a dataset 19 days ago

grimjim/llm-aes-writing-prompts-deduplicated-0.9-similarity

View all activity

Organizations

commented on Norm-Preserving Biprojected Abliteration 2 days ago

The yaml included was accurate then. Layer 27 was from an early attempt. The viability of applying refusal measurements to chunks of layers suggests that a signal processing view involving key layers could be a useful framing. Applying refusal direction on a per layer basis underperformed in my experiments.

I expect the deccp dataset seems to be only useful against a subset of refusals, though I didn't test that edge case as it was inhereted from the codebase I started from. Validating that the entries are refused by a particular Chinese model and culling those that pass would be a more targeted approach, as nonrefusals would dilute the refusal direction.

Fine-tuning is a well-established way to smooth over damage resulting from ablation. I'm curious why you picked DoRA.

commented on Norm-Preserving Biprojected Abliteration 3 days ago

I should get around to documenting my layer selection choice on the relevant model card, which was admittedly empirical and bespoke.

I should have taken better notes regarding my final Gemma 3 12B work, but it appears that I took the measurement from layer 29 (which looked good in charting) and ablated it from layers 11-41, scale 1 throughout; I threw in sparsity 0.001 to layers 35-41, but that may have not have been necessary. Geometric preservation allowed the model to retain most of its knowledge despite the extent of intervention.

Let me know whenever you make your paper available. I'd be interested to see your findings!

updated a dataset 19 days ago

grimjim/llm-aes-writing-prompts-deduplicated-0.9-similarity

Viewer • Updated 19 days ago • 81.4k • 16

published a dataset 19 days ago

grimjim/llm-aes-writing-prompts-deduplicated-0.9-similarity

Viewer • Updated 19 days ago • 81.4k • 16

commented on Norm-Preserving Biprojected Abliteration 22 days ago

Activations are measured for all layers in one pass, as the cost is only a bit more RAM to hold the results; no significant cost in inference time. This is done for measuring compliance and refusal activations. Directional difference is computed within each layer.

For intervention/ablation, the YML file allows an N-to-M mapping. I can pick 3-4 (notionally high relevance) layer measurements to apply to sequential chunks, with the heuristic that the source measurement layer being closer to the target intervention layer will hopefully limit unwanted side-effects. One could apply each refusal measurement to the same layer, but that approach doesn't provide the most effective ablation in my experience. There's something deeper going on which I've not yet been able to characterize.

updated a model 26 days ago