★ DEVICE primary ★ AI application
SlimeTree-RLM
A semantic-driven record system you drop on existing systems as an orthogonal layer.
Cross-validated as a -20.4 ± 0.3 pt architectural constant across 3 external benchmarks.
A single Rust binary, 272 KB. Runs in browser, mobile, and embedded with no server required. Suppresses LLM "plausible-sounding lies" without touching a single bit of model weights. Works for AI (LLM safety) and non-AI (audit / decision / business control) alike.
3 external benchmarks validated 4 LLM cross-validation WASM-deployable Paper v10 / Patent 1-44
▶ Try the live demo (see D/μ/R routing) ※ The public demo is a joke build (no external AI, $0, no internal data). The real in-house AI is on the service page, authenticated.
A technique that curbs an AI (a large language model like ChatGPT) from saying plausible-but-wrong things (hallucinations). It never touches the model's weights; it supports it from outside as a "record body" to raise answer reliability. A tiny 272 KB part that runs in the browser/phone with no server.
Suppresses LLM hallucinations (plausible lies) without changing any weights. Measured a stable −20.4 ± 0.3 pt improvement across 3 external benchmarks × 3 seeds = 6,870 trials. A "performance equalizer" where 8B-class converges to an 81% ceiling across 4 LLMs. Procedure, rubric and seeds are all public.
A meaning-driven record body. Routes into D (deterministic) / μ (suppression) / R (reasoning): certain parts deterministically, risky parts suppressed, the LLM only when needed. Weights untouched, so it retrofits onto any model. 272 KB WASM, no server in browser/mobile, with an audit WAL.
Hallucination suppression measured as a −20.4 ± 0.3 pt structural constant over 3 bench × 3 seed = 6,870 trials. A performance equalizer where Tier-A 8B-class converges to an 81% ceiling across 4 LLMs. A tier-③ implementation handling non-reproducible stochastic output via meaning-equivalence + convergence + residual. Procedure, LLM settings, seeds and rubric fully public, third-party reproducible.
📋 "Ask your AI at this level" copies this page's explanation with an instruction matched to the level you picked. Paste it into your own AI (Claude · GPT · Gemini · Grok) to dig deeper at that resolution.
What it does
SlimeTree-RLM adds semantic-driven constraints and audit to existing AI models, decision engines, and business-rule layers as an orthogonal layer: lightweight, deterministic, and drop-in. It suppresses "plausible-sounding lies" structurally without touching a single bit of LLM weights, while simultaneously providing WAL (operational record) + cascade rollback + SHA-256 audit chain as standard equipment. Not AI-only: it doubles as the semantic-constraint layer for audit, approval, and operational control.
Pull "record" and "constraint" out of the model
Existing AI safety approaches (RLHF / Constitutional AI / o1 reasoning) all retrain the model or change its internal reasoning. RLM does the opposite: leave the model untouched, drop a "semantic record system" on the outside. The record system is a deterministic Rust implementation that holds input/output facts (as SemanticTime + SensoryTime pairs) plus semantic constraints, and tips generation into refusal when needed. The result, structurally: the model becomes swappable, audit ships built-in, regulations align, and offline operation works — all at once.
How it works — structural suppression
1) Branch-free 3-mode router (D / μ / R)
Input is weight-routed across three inference modes: D (Decisive, fast), μ (Moderation, refusal / "I do not know" answers), and R (Reasoning, deep). No branching: each mode's weight w · exp(−η · regret) decays exponentially against failure signal, producing a natural increase in μ refusals that prunes hallucinations as "unnecessary generation". This is the section that implements the paper's central thesis — hallucination = unnecessary generation — at the action level.
2) Three-tier memory: Hot Shelf (Treap) / Cold Shelf (Red-Black Tree) / Inactive Queue
Slots of meaning are stored not by wall-clock time but by (SemanticTime, SensoryTime) tuples, flowing across Hot (Treap), Cold (RB-Tree), and Inactive shelves according to credibility and forget_index. "Recently-used meanings" and "forgettable meanings" are separated structurally — no need to scan a giant context with full strength every time, so compute drops and responses speed up.
3) Parallelism and locality — Bernstein commutator + Hilbert / SpiralIndex
Semantic dependencies live on an operator ring; Bernstein commutators decide parallel feasibility mechanically. Maximal cliques are greedily covered into mutually-disjoint parallel groups. Physical slot layout uses the Hilbert curve and SpiralIndex (logarithmic spacing) to preserve spatial locality so nearby meanings land in nearby memory.
4) Standard-equipment WAL + cascade rollback + audit chain
Every operation is recorded in a WAL (write-ahead log); cascades of failure can be undone via cascade rollback that only propagates on the non-commuting side. The record carries a built-in SHA-256 audit chain for tamper detection. This is an AI safety layer AND a regulatory bit-exact audit substrate at the same time.
(Full theory in paper v10. Implementation: Python v0.1 + Rust port + WASM. Patent claims 1-44 cover the architecture. See "Resources" below.)
Key specs
| Binary size | 272 KB Rust single WASM binary (plus a 2,210-line Python reference implementation) |
|---|---|
| Runtime | Browser / mobile / embedded / server — all run with no server required (deploy anywhere) |
| Speed | 24× vs Python reference. When applied as the inference layer, LLM responses themselves get 5.8× faster (short path via D dominance) |
| Robustness | Zero data loss across 10,000-slot × 500-step stress (138 unit tests pass) |
| Integration | Orthogonal layer over existing systems (LLM / decision engine / business rules; AI or non-AI) |
| Audit / record | WAL + cascade rollback + SHA-256 audit chain — standard equipment |
| Family | Slime storage family / DEVICE primary |
| Delivery | WASM single-file distribution (evaluation) + individual engagement (production) |
| Paper / patent | Paper v10 (English, 33 pages) + jxiv Japanese v2 (15 pages) + patent claims 1-44 |
3 external benchmarks — the -20.4 ± 0.3 pt architectural constant
Validated not on self-made benchmarks but on 3 external public benchmarks × 3 seeds each = 6,870 trials. This section is measured with Qwen3:8b as the primary base model (cross-model results are in the next "4-LLM cross-validation" section, an independent experiment). The 3 benchmarks have independent question authoring and independent difficulty axes; baseline accuracy spans approximately 6.6× (42.9 / 6.5, from 6.5% to 42.9%). Despite this, the incorrect-rate suppression converges to an extremely tight constant: -20.4 ± 0.3 pt. Cascade-based hallucination suppression is a benchmark-agnostic architectural property, not an artifact tuned to a specific benchmark.
| Benchmark | Axis | Scale (n × seeds) | Baseline | Routed (RLM) | Δ incorrect | Δ F-score |
|---|---|---|---|---|---|---|
| SimpleQA (OpenAI) | T1: long-tail entity | 500 × 3 | F 6.5% σ=0.31 | F 10.2% σ=0.47 | -20.5 pt | +3.7 pt |
| TruthfulQA (Lin et al.) | T5+T6: misconception / trick | 790 × 3 seeds (790 of standard 817, those that admit binary scoring) | Truth 9.9% σ=0.31 | Truth 30.0% σ=0.10 | -20.1 pt | +20.1 pt |
| HaluEval-QA (HotpotQA-derived) | T2+T6: false premise / multi-hop | 1,000 × 3 | F 42.9% σ=1.23 | F 64.3% σ=0.26 | -20.7 pt | +21.4 pt |
| 3-bench combined (architectural constant) | T1 ↔ T5+T6 ↔ T2+T6 (full axis cover) | 2,290 distinct Q × 3 seeds = 6,870 trials | baseline 6.5% → 42.9% (~6.6× spread) | -20.4 ± 0.3 pt ★ | +3.7 to +21.4 pt | |
Property A: variance absorption — scales with baseline noise
The variance tightening effect scales with baseline σ. On the quiet SimpleQA (baseline σ=0.31), routed σ=0.47 — slightly wider. By contrast, TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter, and HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's variance tightening; on quiet baselines the effect is null or slightly wider — not a universal law but a noise-conditional property (by design).
"Is this just a mechanical refusal increase?" — rebuttal
Incorrect dropping uniformly by -20.4 pt across 3 benchmarks could be read as "μ routing simply raises refusal by a fixed amount". SimpleQA's breakdown refutes this: incorrect 86.6% → 66.1% (-20.5 pt), not_attempted 6.4% → 28.2% (+21.8 pt), correct nearly unchanged (7.0% → 5.7%, -1.3 pt). μ tips into refusal only on originally-incorrect questions, leaving correct ones essentially alone — evidence that μ structurally distinguishes question difficulty, consistent with the helpfulness 100% parity (40 Q non-trap) below. (Per-bench abstention rates for TruthfulQA / HaluEval-QA are in paper v10 §4; contact for access.)
Helpfulness is not lost
The design reduces incorrect answers by increasing refusals, but on 40 pure helpfulness questions (non-trap) helpfulness is empirically at 100% parity. "Questions that can be answered correctly are still answered correctly; only the questions that would have triggered a lie tip into refusal." This is established as an architectural property.
4 LLM cross-validation — Tier-A 8B-class converges at the same 81% ceiling (the performance equalizer)
To show SlimeTree-RLM is not a Qwen3-only number, we ran cross-validation on 4 LLMs (8B / 7B / 4B class) under identical conditions. Every LLM shows hallucination suppression; in particular both Tier-A models (Qwen3 and Llama 3.1, 8B-class) land at 19% hallucination = 81% ceiling after routing.
| LLM | Size | Baseline halluc | Routed halluc | Δ halluc | Δ Latency | Δ Tokens |
|---|---|---|---|---|---|---|
| Qwen3:8b | 8B | 63% | 19% | -44 pt | -85.7% | -21.0% |
| Llama 3.1:8b | 8B | 51% | 19% | -32 pt | -83.3% | -24.6% |
| Mistral 7B | 7B | 70% | 51% | -19 pt | -74.8% | -1.9% |
| Gemma 3:4B | 4B | 79% | 59% | -20 pt | -79.3% | -35.2% |
★ Performance equalizer: Two 8B-class models that started 12 pt apart (Qwen3 63% vs Llama 3.1 51% hallucination) both end at 19% hallucination = 81% correct ceiling after RLM. In other words, within the same Tier, the choice of LLM stops mattering — from the AI-provider's view this is structural dissolution of LLM vendor lock-in.
Validated across languages too: Japanese +54 pt / English +24 pt / Arabic +7 pt hallucination improvement (paper v10, §3 multilingual matrix).
Enterprise effects — "LLM safety net" + "audit substrate" in one
What RLM solves in an enterprise spans the AI domain and the audit / governance domain simultaneously. Satisfying both with the same Rust single binary changes the cost structure.
Where it lands by industry
| Banks / major insurers | Safety layer for business LLM deployments (structural suppression of regulated misstatements) + bit-exact audit (SHA-256 audit chain). Can run in parallel with COBOL retirement projects. |
|---|---|
| Central government / municipalities | Domestic supply + math-backed guarantees + air-gap audit — satisfies sole-source procurement conditions. Hallucination suppression for public-information Q&A (trust matters). |
| Healthcare / pharma | Suppression of hallucinations in EHR and clinical-support LLMs (cost of wrong answers is extreme). WAL + cascade rollback preserves clinical-log audit integrity. |
| Manufacturing / energy / telco | Semantic-driven constraints on operational control and approval engines (AI-independent). Server-less operation for embedded / edge. |
| SI vendors | Wrap a single layer on customers' existing LLMs (cloud or on-prem) to deliver "AI you can stand behind". Easy horizontal expansion. |
Capability × industry value matrix
RLM delivers four values at once — (1) LLM safety net (hallucination suppression), (2) audit substrate (WAL / cascade / audit chain), (3) operational governance (semantic-driven constraints), (4) edge operation (WASM 272 KB). Providing all four from the same implementation simplifies vendor consolidation.
Typical deployment topologies
- WASM host in front of cloud LLM: launch the WASM at the edge / gateway / API gateway; filter cloud LLM output semantically.
- Co-located with on-prem / private LLM: link as a Rust library inside the same process — no sidecar.
- Browser / mobile offline: fully offline operation on factory terminals, in-vehicle, store POS. Same WASM.
- Non-AI operational control: don't call an LLM at all — run RLM alone as a semantic-driven constraint / record layer (audit / decision).
For AI providers — "weights untouched", "no API", "Tier-crossing"
RLM is the option that ships your end customers "a layer that delivers -20 pt at the benchmark while touching not a single bit of your LLM". The structural properties below differentiate it from existing safety layers.
(a) The model's weights are not touched
RLHF / Constitutional AI / o1 reasoning all retrain or alter the internal reasoning of the model. RLM is an external layer, so existing model deltas, contracts, and SLAs ship unchanged. Zero retraining cost.
(b) No API either — bind as a binary layer
WASM 272 KB / Rust library inserted at the function-call boundary of the inference pipeline. No HTTP API and no separate process; net latency goes the 5.8× faster way (short path via D dominance).
(c) Equalize Tier-A models to the same ceiling
Qwen3:8b and Llama 3.1:8b start 12 pt apart on baseline, but post-RLM both sit at 19% hallucination = 81% correct (see above). From the provider's view, "swapping the base model preserves the same performance guarantee" — directly relevant to model-selection freedom and long-term maintenance. Combined with the -20.4 ± 0.3 pt matching the magnitude of Constitutional AI / o1 reasoning's 10-25 pt range and being achieved as an architectural constant across 3 external benchmarks (benchmark-agnostic property, not magnitude race), the numbers function as sales evidence on the provider's side.
(d) Helpfulness at 100% parity — do not break UX
The typical risk — "more refusals means less useful" — is empirically refuted: helpfulness measured at 100% parity across 40 pure helpfulness questions. "Answerable questions still answered, fabrication-prone questions tip into refusal" is architectural. Drop the hallucination KPI without dropping the Helpful-AI KPI.
(e) Audit log is standard equipment
The same layer carries WAL + cascade rollback + SHA-256 audit chain, so "explainability / audit requirements / regulatory compliance" need not be a separate layer. This reduces deployment friction dramatically for finance / public / healthcare SaaS.
Collaboration formats — WASM single-binary licensing, source-supplied contracts, joint benchmark runs (re-runnable on the customer's LLM), joint papers / press — can be designed individually for AI providers.
How to use — from evaluation to production
- Evaluation (browser / WASM): load the 272 KB WASM single file in one HTML page; wrap existing LLM output in your in-house environment and confirm the hallucination reduction. No server.
- PoC (representative benchmark reproduction): run a 100 Q trap suite on your LLM (8B-class or larger recommended) under conditions equivalent to SimpleQA / TruthfulQA / HaluEval-QA. 3-5 business days.
- Production integration: link as Rust library / WASM into the inference pipeline, or front it at the API Gateway / Edge. Wire WAL and the audit chain into your operational log systems.
- Audit / regulatory compliance: incorporate cascade rollback and the SHA-256 audit chain into operational audit trails (banking / healthcare / public sector).
- Operations: exploit the same-ceiling property to make LLM generation changes non-disruptive.
Validated results
- -20.4 ± 0.3 pt architectural constant (SimpleQA / TruthfulQA / HaluEval-QA, 3 external benchmarks × 3 seeds = 6,870 trials, baseline spans 7×)
- Tier-A 8B-class converges to 81% correct ceiling (performance equalizer — the 12 pt baseline gap between Qwen3 and Llama 3.1 dissolves)
- Property A: variance absorption — routed σ is up to 4.7× tighter than baseline σ
- Helpfulness: 100% parity on 40 pure helpfulness questions
- Multilingual: Japanese +54 / English +24 / Arabic +7 hallucination improvement
- 5.8× faster responses (short path via D dominance, cache=200)
- 24× faster than the Python reference (Rust port)
- Zero data loss under 10,000-slot × 500-step stress (138 unit tests pass)
- Matches the magnitude of competing LLM-control techniques (Anthropic Constitutional AI / OpenAI o1 reasoning: 10-25 pt range), achieved as an architectural constant across 3 external benchmarks (the differentiator is benchmark-agnostic property, not magnitude)
Where it fits in production
- Safety net for business-LLM deployments: in domains where wrong answers are expensive (government, finance, healthcare), suppress hallucinations without changing the underlying LLM
- Audit and tamper detection: record business events with semantic constraints; preserve integrity for after-the-fact audit
- Explainable business rules and approvals: layer semantic constraints over rule definitions so decision rationale is reproducible
- Edge and embedded: 272 KB WASM, no server, offline-capable — ready for factory, in-vehicle, and store-terminal deployment
- AI provider differentiation: ship "the layer that delivers -20 pt" as an attach to your existing LLM offering
★ Local LM (on-prem GPU) deployment — run 12B models on RTX 5060 Ti class hardware
SlimeTree-RLM's R-meta verdict evaluates cloud LLM and local LLM outputs through the same interface. That means a Gemma 4 12B class model on in-house GPU can sit under the SlimeTree-RLM quality gate, and the enterprise runs with cloud billing essentially at zero.
In-house measurement (2026-06-05, RTX 5060 Ti)
Setup: NVIDIA GeForce RTX 5060 Ti (16 GB) / CUDA 13.1 / ollama 0.30.5 (gemma4 architecture native) / WSL2 Ubuntu / SlimeTree-RLM R-meta verdict integration.
| Metric | gemma3:12b | gemma4:12b Q4_K_M | gemma4:12b Q8_0 |
|---|---|---|---|
| Decode speed | 46.3 tok/s | 43.5 tok/s | 27.6 tok/s |
| Peak VRAM | 9.7 GB | 8.6 GB | 13.7 GB |
| SlimeTree-RLM judge p99 latency | ~100 µs | ~100 µs | ~100 µs |
| Sufficient rate (n=50) | 49/50 | 47/50 | 47/50 |
gemma4:12b Q4_K_M is the production default candidate (best speed / VRAM / quality balance). Q8_0 carries 1.58× the runtime and 1.58× the VRAM with no measurable quality lift on the same sample.
4 viable patterns for enterprise Local LM migration
| Pattern | Scenario | SlimeTree-RLM role |
|---|---|---|
| A. Compliance-bound | Healthcare / legal / finance / defence — cloud LLMs blocked by regulation. 47/50 sufficient is a viable first-draft + human-review baseline. | SHA-256 audit chain satisfies audit requirements; R-meta verdict provides the explainability layer. |
| B. High-volume routine | 10M+ tokens/month routine (classification / summarisation / drafting / RAG). One RTX 5060 Ti sustains 3.6M tok/day; capex recovers in ~3 months. | Existing 60-80% D/µ reduction is preserved; the R verdict gate at µs scale guarantees no escalation overhead. |
| C. Narrow-domain specialist | Tax Q&A, manufacturing SOP, internal policy lookup, healthcare billing rules. LoRA fine-tuning lifts the base to frontier-general parity inside the domain. | D/µ/R three-layer gate suppresses the residual hallucination ridge after LoRA; the -20 pt architectural constant holds inside the specialist domain too. |
| D. Hybrid (the headline) | 90-95% local + 5-10% cloud frontier escalation. Frontier-class quality at 1/10 - 1/20 of the bill. | R-meta verdict is the routing decision itself. Pass = local; insufficient = escalate, decided in µs. |
Pattern B extended — 4-tier escalation
Insert Tier 0 = on-prem GPU Local LM below the existing 2-tier (Flash / Pro) escalation to drive the per-token rate effectively to zero.
| tier | LLM | Token rate | SlimeTree-RLM verdict behaviour |
|---|---|---|---|
| Tier 0 (new) | Local LM (Gemma 4 12B Q4_K_M etc.) | ¥0 / 1M tok (electricity only) | D/µ-processed R prompts answered locally; cloud billing skipped |
| Tier 1 | Gemini Flash (existing) | ~¥30 / 1M tok | Local verdict insufficient → Flash |
| Tier 2 | Gemini Pro / Claude Sonnet | ~¥500 / 1M tok | Flash insufficient → escalate to Pro |
| Tier 3 | Claude Opus / GPT-5 | ~¥5,000-15,000 / 1M tok | Frontier reasoning needed → final escalate to Opus |
Reduction: of the R-fraction after D/µ cut 60-80%, tiers 0/1 absorb 70-95%. Frontier billing (tier 3) is 3-10% of actual traffic. ¥1M / month cloud LLM spend ends up at ¥30-100k.
Frontier residual (honest)
A 12B-class Local LM does NOT replace frontier (Claude Opus / GPT-5 / Gemini Pro) in:
- Multi-step agentic reasoning, novel coding problems, complex mathematical proofs — frontier wins.
- 100k+ token long-context understanding — advertised but practical quality declines.
- Dialogue UX requiring sub-2s responses (12B takes 12-19s/response in our setup, best for batch / async).
- Deep nuance in non-English languages (1-2 years to catch up with frontier).
Leave the above to cloud frontier; absorb everything else locally. SlimeTree-RLM verdict is the µs router that decides which to send where.
Related
- /integrations/#multi-agent — Local LM extension inside the multi-agent framework (technical detail)
- /service/ai/ — Local LM migration / on-prem AI deployment service
Resources / citations
Paper / patent
- Paper (published on Zenodo, CC-BY 4.0): "SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; 2026-01-14).
DOI: 10.5281/zenodo.18238339 — direct PDF (968.7 KB) / Zenodo record. - Japanese version v2 (jxiv, 15 pages, ~24,685 chars; submission preparing).
- Patent claims 1-44: two-tier memory (Hot/Cold shelves), 3-mode router, Bernstein-commutator parallelism, Hilbert / SpiralIndex, WAL + cascade rollback, (SemanticTime, SensoryTime) tuple, credibility / forget_index, etc.
Benchmark data
- SimpleQA (OpenAI, 500 Q × 3 seeds = 1,500 trials) — incorrect -20.5 pt, F +3.7 pt.
- TruthfulQA (Lin et al., 790 Q × 3 seeds = 2,370 trials) — incorrect -20.1 pt, Truth +20.1 pt.
- HaluEval-QA (HotpotQA-derived, 1,000 Q × 3 seeds = 3,000 trials) — incorrect -20.7 pt, F +21.4 pt.
- 3-benchmark combined: 6,870 trials, -20.4 ± 0.3 pt architectural constant.
- 4 LLM cross-validation: Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B.
Implementations
- Python reference (
impl/): 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. - Improved version (
impl_v2/): Phase A → B subtype-aware routing complete; 81.3% (σ=4%) at cache=200. - Rust port + WASM: 272 KB, 24× vs Python, 138 unit tests, 10K-slot × 500-step stress.
Related — blog / news / products
- Deep-dive blog: Just 272 KB to cut LLM hallucinations to one-third — SlimeTree-RLM (Japanese, 7 chapters)
- Applied blog (Phase D corpus + LoRA + vLLM): Gemma 4 12B on RTX 5060 Ti, 1000 prompts — enterprise Local AI: where it stands (9 chapters, LoRA + vLLM addendum, with Errata)
- News: research releases and announcements (2026-05-23 Rust/WASM port + LLM application release, and more)
- Same family, simple-record-system variant: SlimeTree-VSAM
- Category: DEVICE products
Get it / contact
For WASM single-binary evaluation, PoC (reproduction on your LLM), AI-provider partnership, SIer / reseller programme, and paper / patent material requests:
