Deep Dive

Bankai: first post-training adaptation method for true 1-bit LLMs

Original work: Bankai: first post-training adaptation method for true 1-bit LLMs

Excellent research results. Here's the article:

Why This Matters — Model quantization has long been a game of trading quality for size, but true 1-bit models represent the most extreme end of that spectrum — weights stored as single bits, enabling models that fit in kilobytes-per-layer and run on commodity hardware. The problem is that once a model is collapsed to 1-bit, traditional fine-tuning methods like LoRA simply don't work — they assume continuous-valued weight deltas. Bankai is the first post-training adaptation method designed specifically for true 1-bit LLMs, opening a door that was previously sealed shut: targeted behavioral modification of binary-weight models after training, with patches measured in bytes rather than megabytes.

The Problem — True 1-bit models like Bonsai 8B offer extraordinary deployment efficiency, but they're frozen at training time. LoRA and other parameter-efficient fine-tuning methods require continuous weight spaces to compute meaningful gradients and low-rank updates. A 1-bit weight is either 0 (mapped to −scale) or 1 (mapped to +scale) — there's no gradient to follow, no rank to decompose. This means if a 1-bit model gets arithmetic wrong or fails on specific tasks, there has been no way to patch those behaviors without retraining from scratch. The entire adaptation ecosystem that makes full-precision models so useful in practice simply doesn't apply.

Key Innovation — Bankai exploits a fundamental property of binary arithmetic: when weights are single bits, the difference between any two model states is a bitwise XOR. Instead of computing gradient updates, Bankai searches for specific rows in MLP weight tensors where flipping all bits in that row (a 4,096-bit operation) improves targeted behavioral probes without degrading general capabilities. A scale-guided targeting heuristic identifies high-impact rows, achieving 3.88x more behavioral change than random flips. The result is a JSON patch file — literally a list of (layer, projection, row) triples at 12 bytes each — that can be applied or removed in microseconds via XOR's self-inverse property.

How It Works — The search process iterates over candidate row flips in MLP layers, evaluating each against a set of behavioral probes (logit-gap measurements on targeted prompts) and safety controls (ensuring existing correct behaviors aren't broken). Accepted flips are accumulated into a patch. On Bonsai 8B (36 layers, 4096 hidden dimension, 1-bit group quantization), a typical patch contains 70–93 flips modifying only 0.005–0.007% of total model weights. The math patch (72 flips, 864 bytes) improved arithmetic probe scores; the generalized calculus patch (93 flips, 1.1 KB) achieved a 23.5% fix rate on held-out problems — fixing 4 of 17 failures with zero breakage on 13 existing successes. Notably, even 500,000 random bit flips produce less than 0.08 perplexity change, revealing massive redundancy in binary MLP weights that Bankai strategically exploits. The entire search runs on an Apple M3 with 24 GB RAM in under 70 minutes — no GPU required.

Impact & What's Next — Bankai makes 1-bit models adaptable for the first time. Where LoRA requires 50–200 MB and adds inference-time matrix multiplications, a Bankai patch is ~1 KB with zero runtime overhead — applied once before generation, perfectly reversible. This is particularly compelling for on-device deployment where multiple domain-specific patches could be swapped instantly. Current limitations include row-level granularity (flipping entire 4,096-bit rows rather than individual bits) and a documented probe-to-generation gap where logit improvements don't always translate to correct free-form outputs. Patch composability via sequential XOR is algebraically sound but behaviorally untested for overlapping domains. As true 1-bit model architectures mature, Bankai establishes the adaptation primitive that makes them practically useful — turning frozen binary weights into a patchable, versionable substrate.