Seven modular inference-time components that exploit the universal geometric structure of transformer attention heads. No model weights modified. No training. No fine-tuning.
Every tested transformer with 8 KV heads organizes its head interactions into the multiplication table of the imaginary octonions — the unique finite projective plane of order 2. This structure is not designed into the models. It emerges from training because it is algebraically optimal.
Each point is a KV attention head. Each line represents a coherent triple — three heads whose Hadamard product coherence exceeds the significance threshold. The arrangement IS the octonion multiplication table.
10,000 random trials: mean coherence 0.230 ± 0.366. Discovered Fano plane at 1.098 is 3.23σ above null.
Attention heads self-organize into three classes by their entropy growth rate across sequence positions. The growth rates converge to transcendental constants — properties of the geometry, not any specific model. Four models, four companies, agreement within 0.52%.
Geometric boundary z-buffer sort. Compute Q_BOS from model weights at load time. Sort KV cache so low-signal keys are contiguous. Flash Attention's block-skip handles the rest. The kernel never sees 75% of keys.
Spectrally gated attention function W(s,θ) producing true arithmetic zeros in attention weights. Per-head θ via trifurcated classification. The model can say no.
Breastplate-classified adaptive KV cache denoising. Low-signal positions identified by real-time attention weights inside the softmax kernel at zero cost. TBQ round-trip removes noise F16 preserves.
Geometric cache eviction via polynomial eigenvector alignment. Query-independent — eviction decisions don't change based on what's being asked. Stable under sustained context pressure.
Cold tier codepage. Evicted keys are caught, RoPE-stripped, projected to intrinsic dimensionality, compressed. Recalled on attention deficit. Eviction becomes recoverable.
Transfer matrix boundary evaluation. A document's effect on the cache predicted from its boundary tokens alone. 14% of tokens → 75% of benefit. Matrices compose multiplicatively.
Two-phase active vocabulary. READ from prefill + WRITE from generation = 500–2,000 domain-relevant tokens. 99% of vocabulary masked. Eliminates out-of-domain hallucination.
Feed domain text through the model. Compact the KV cache via geometric eviction. Repeat. The cache accumulates domain knowledge — without modifying model weights.
Freeze the result to disk as a Lodestone. Restore it on a fresh process. The model inherits domain expertise instantly. A $200 phone with a lodestone competes with a datacenter.
| Model | Parameters | Cold PPL | Annealed PPL | Improvement |
|---|---|---|---|---|
| Llama 3.1 8B | 8B | 7.40 | 3.86 | −47.8% |
| MiniMax M2.5 | 228B | 5.71 | 2.56 | −55.2% |
| ⚠️ INTERCEPTED TRANSMISSION ⚠️ |
|
Three independent evaluators • Simulated pitch • Internal thought process captured last modified: yesterday at 3am by someone who should have known better |
| visitor counter: 000003 | best viewed in netscape navigator 4.0 | under construction 🚧 |