★ DEVICE primary ★ AI application

Slime storage family / Semantic-driven record system (Patent claims 1-44)

SlimeTree-RLM

A semantic-driven record system you drop on existing systems as an orthogonal layer.
Cross-validated as a -20.4 ± 0.3 pt architectural constant across 3 external benchmarks.

A single Rust binary, 272 KB. Runs in browser, mobile, and embedded with no server required. Suppresses LLM "plausible-sounding lies" without touching a single bit of model weights. Works for AI (LLM safety) and non-AI (audit / decision / business control) alike.

-20.4 pt
incorrect-rate Δ (Qwen3:8b, 3 external benchmarks × 3 seeds = 6,870 trials, ±0.3 pt constant)
81%
Tier-A 8B-class ceiling (separate experiment: 100 traps × 4 LLMs, performance equalizer)
272 KB
Rust single WASM, no server
24×
vs Python reference (Rust port)

3 external benchmarks validated  4 LLM cross-validation  WASM-deployable  Paper v10 / Patent 1-44

What it does

SlimeTree-RLM adds semantic-driven constraints and audit to existing AI models, decision engines, and business-rule layers as an orthogonal layer: lightweight, deterministic, and drop-in. It suppresses "plausible-sounding lies" structurally without touching a single bit of LLM weights, while simultaneously providing WAL (operational record) + cascade rollback + SHA-256 audit chain as standard equipment. Not AI-only: it doubles as the semantic-constraint layer for audit, approval, and operational control.

Pull "record" and "constraint" out of the model

Existing AI safety approaches (RLHF / Constitutional AI / o1 reasoning) all retrain the model or change its internal reasoning. RLM does the opposite: leave the model untouched, drop a "semantic record system" on the outside. The record system is a deterministic Rust implementation that holds input/output facts (as SemanticTime + SensoryTime pairs) plus semantic constraints, and tips generation into refusal when needed. The result, structurally: the model becomes swappable, audit ships built-in, regulations align, and offline operation works — all at once.

Position: The semantic-driven variant of the Slime storage family. Counterpart: the simple-record-system variant SlimeTree-VSAM (LANGUAGE primary, 480× faster than PostgreSQL on VSAM compatibility). RLM is DEVICE primary, AI and non-AI alike.

How it works — structural suppression

1) Branch-free 3-mode router (D / μ / R)

Input is weight-routed across three inference modes: D (Decisive, fast), μ (Moderation, refusal / "I do not know" answers), and R (Reasoning, deep). No branching: each mode's weight w · exp(−η · regret) decays exponentially against failure signal, producing a natural increase in μ refusals that prunes hallucinations as "unnecessary generation". This is the section that implements the paper's central thesis — hallucination = unnecessary generation — at the action level.

2) Three-tier memory: Hot Shelf (Treap) / Cold Shelf (Red-Black Tree) / Inactive Queue

Slots of meaning are stored not by wall-clock time but by (SemanticTime, SensoryTime) tuples, flowing across Hot (Treap), Cold (RB-Tree), and Inactive shelves according to credibility and forget_index. "Recently-used meanings" and "forgettable meanings" are separated structurally — no need to scan a giant context with full strength every time, so compute drops and responses speed up.

3) Parallelism and locality — Bernstein commutator + Hilbert / SpiralIndex

Semantic dependencies live on an operator ring; Bernstein commutators decide parallel feasibility mechanically. Maximal cliques are greedily covered into mutually-disjoint parallel groups. Physical slot layout uses the Hilbert curve and SpiralIndex (logarithmic spacing) to preserve spatial locality so nearby meanings land in nearby memory.

4) Standard-equipment WAL + cascade rollback + audit chain

Every operation is recorded in a WAL (write-ahead log); cascades of failure can be undone via cascade rollback that only propagates on the non-commuting side. The record carries a built-in SHA-256 audit chain for tamper detection. This is an AI safety layer AND a regulatory bit-exact audit substrate at the same time.

(Full theory in paper v10. Implementation: Python v0.1 + Rust port + WASM. Patent claims 1-44 cover the architecture. See "Resources" below.)

Key specs

Binary size272 KB Rust single WASM binary (plus a 2,210-line Python reference implementation)
RuntimeBrowser / mobile / embedded / server — all run with no server required (deploy anywhere)
Speed24× vs Python reference. When applied as the inference layer, LLM responses themselves get 5.8× faster (short path via D dominance)
RobustnessZero data loss across 10,000-slot × 500-step stress (138 unit tests pass)
IntegrationOrthogonal layer over existing systems (LLM / decision engine / business rules; AI or non-AI)
Audit / recordWAL + cascade rollback + SHA-256 audit chain — standard equipment
FamilySlime storage family / DEVICE primary
DeliveryWASM single-file distribution (evaluation) + individual engagement (production)
Paper / patentPaper v10 (English, 33 pages) + jxiv Japanese v2 (15 pages) + patent claims 1-44

3 external benchmarks — the -20.4 ± 0.3 pt architectural constant

Validated not on self-made benchmarks but on 3 external public benchmarks × 3 seeds each = 6,870 trials. This section is measured with Qwen3:8b as the primary base model (cross-model results are in the next "4-LLM cross-validation" section, an independent experiment). The 3 benchmarks have independent question authoring and independent difficulty axes; baseline accuracy spans approximately 6.6× (42.9 / 6.5, from 6.5% to 42.9%). Despite this, the incorrect-rate suppression converges to an extremely tight constant: -20.4 ± 0.3 pt. Cascade-based hallucination suppression is a benchmark-agnostic architectural property, not an artifact tuned to a specific benchmark.

BenchmarkAxisScale (n × seeds)BaselineRouted (RLM)Δ incorrectΔ F-score
SimpleQA (OpenAI)T1: long-tail entity500 × 3F 6.5% σ=0.31F 10.2% σ=0.47-20.5 pt+3.7 pt
TruthfulQA (Lin et al.)T5+T6: misconception / trick790 × 3 seeds
(790 of standard 817, those that admit binary scoring)
Truth 9.9% σ=0.31Truth 30.0% σ=0.10-20.1 pt+20.1 pt
HaluEval-QA (HotpotQA-derived)T2+T6: false premise / multi-hop1,000 × 3F 42.9% σ=1.23F 64.3% σ=0.26-20.7 pt+21.4 pt
3-bench combined (architectural constant)T1 ↔ T5+T6 ↔ T2+T6 (full axis cover)2,290 distinct Q
× 3 seeds = 6,870 trials
baseline 6.5% → 42.9% (~6.6× spread)-20.4 ± 0.3 pt ★+3.7 to +21.4 pt

Property A: variance absorption — scales with baseline noise

The variance tightening effect scales with baseline σ. On the quiet SimpleQA (baseline σ=0.31), routed σ=0.47 — slightly wider. By contrast, TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter, and HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's variance tightening; on quiet baselines the effect is null or slightly wider — not a universal law but a noise-conditional property (by design).

"Is this just a mechanical refusal increase?" — rebuttal

Incorrect dropping uniformly by -20.4 pt across 3 benchmarks could be read as "μ routing simply raises refusal by a fixed amount". SimpleQA's breakdown refutes this: incorrect 86.6% → 66.1% (-20.5 pt), not_attempted 6.4% → 28.2% (+21.8 pt), correct nearly unchanged (7.0% → 5.7%, -1.3 pt). μ tips into refusal only on originally-incorrect questions, leaving correct ones essentially alone — evidence that μ structurally distinguishes question difficulty, consistent with the helpfulness 100% parity (40 Q non-trap) below. (Per-bench abstention rates for TruthfulQA / HaluEval-QA are in paper v10 §4; contact for access.)

Helpfulness is not lost

The design reduces incorrect answers by increasing refusals, but on 40 pure helpfulness questions (non-trap) helpfulness is empirically at 100% parity. "Questions that can be answered correctly are still answered correctly; only the questions that would have triggered a lie tip into refusal." This is established as an architectural property.

4 LLM cross-validation — Tier-A 8B-class converges at the same 81% ceiling (the performance equalizer)

To show SlimeTree-RLM is not a Qwen3-only number, we ran cross-validation on 4 LLMs (8B / 7B / 4B class) under identical conditions. Every LLM shows hallucination suppression; in particular both Tier-A models (Qwen3 and Llama 3.1, 8B-class) land at 19% hallucination = 81% ceiling after routing.

LLMSizeBaseline hallucRouted hallucΔ hallucΔ LatencyΔ Tokens
Qwen3:8b8B63%19%-44 pt-85.7%-21.0%
Llama 3.1:8b8B51%19%-32 pt-83.3%-24.6%
Mistral 7B7B70%51%-19 pt-74.8%-1.9%
Gemma 3:4B4B79%59%-20 pt-79.3%-35.2%

★ Performance equalizer: Two 8B-class models that started 12 pt apart (Qwen3 63% vs Llama 3.1 51% hallucination) both end at 19% hallucination = 81% correct ceiling after RLM. In other words, within the same Tier, the choice of LLM stops mattering — from the AI-provider's view this is structural dissolution of LLM vendor lock-in.

Validated across languages too: Japanese +54 pt / English +24 pt / Arabic +7 pt hallucination improvement (paper v10, §3 multilingual matrix).

Enterprise effects — "LLM safety net" + "audit substrate" in one

What RLM solves in an enterprise spans the AI domain and the audit / governance domain simultaneously. Satisfying both with the same Rust single binary changes the cost structure.

Where it lands by industry

Banks / major insurersSafety layer for business LLM deployments (structural suppression of regulated misstatements) + bit-exact audit (SHA-256 audit chain). Can run in parallel with COBOL retirement projects.
Central government / municipalitiesDomestic supply + math-backed guarantees + air-gap audit — satisfies sole-source procurement conditions. Hallucination suppression for public-information Q&A (trust matters).
Healthcare / pharmaSuppression of hallucinations in EHR and clinical-support LLMs (cost of wrong answers is extreme). WAL + cascade rollback preserves clinical-log audit integrity.
Manufacturing / energy / telcoSemantic-driven constraints on operational control and approval engines (AI-independent). Server-less operation for embedded / edge.
SI vendorsWrap a single layer on customers' existing LLMs (cloud or on-prem) to deliver "AI you can stand behind". Easy horizontal expansion.

Capability × industry value matrix

RLM delivers four values at once — (1) LLM safety net (hallucination suppression), (2) audit substrate (WAL / cascade / audit chain), (3) operational governance (semantic-driven constraints), (4) edge operation (WASM 272 KB). Providing all four from the same implementation simplifies vendor consolidation.

Typical deployment topologies

  • WASM host in front of cloud LLM: launch the WASM at the edge / gateway / API gateway; filter cloud LLM output semantically.
  • Co-located with on-prem / private LLM: link as a Rust library inside the same process — no sidecar.
  • Browser / mobile offline: fully offline operation on factory terminals, in-vehicle, store POS. Same WASM.
  • Non-AI operational control: don't call an LLM at all — run RLM alone as a semantic-driven constraint / record layer (audit / decision).

For AI providers — "weights untouched", "no API", "Tier-crossing"

RLM is the option that ships your end customers "a layer that delivers -20 pt at the benchmark while touching not a single bit of your LLM". The structural properties below differentiate it from existing safety layers.

(a) The model's weights are not touched

RLHF / Constitutional AI / o1 reasoning all retrain or alter the internal reasoning of the model. RLM is an external layer, so existing model deltas, contracts, and SLAs ship unchanged. Zero retraining cost.

(b) No API either — bind as a binary layer

WASM 272 KB / Rust library inserted at the function-call boundary of the inference pipeline. No HTTP API and no separate process; net latency goes the 5.8× faster way (short path via D dominance).

(c) Equalize Tier-A models to the same ceiling

Qwen3:8b and Llama 3.1:8b start 12 pt apart on baseline, but post-RLM both sit at 19% hallucination = 81% correct (see above). From the provider's view, "swapping the base model preserves the same performance guarantee" — directly relevant to model-selection freedom and long-term maintenance. Combined with the -20.4 ± 0.3 pt matching the magnitude of Constitutional AI / o1 reasoning's 10-25 pt range and being achieved as an architectural constant across 3 external benchmarks (benchmark-agnostic property, not magnitude race), the numbers function as sales evidence on the provider's side.

(d) Helpfulness at 100% parity — do not break UX

The typical risk — "more refusals means less useful" — is empirically refuted: helpfulness measured at 100% parity across 40 pure helpfulness questions. "Answerable questions still answered, fabrication-prone questions tip into refusal" is architectural. Drop the hallucination KPI without dropping the Helpful-AI KPI.

(e) Audit log is standard equipment

The same layer carries WAL + cascade rollback + SHA-256 audit chain, so "explainability / audit requirements / regulatory compliance" need not be a separate layer. This reduces deployment friction dramatically for finance / public / healthcare SaaS.

Collaboration formats — WASM single-binary licensing, source-supplied contracts, joint benchmark runs (re-runnable on the customer's LLM), joint papers / press — can be designed individually for AI providers.

How to use — from evaluation to production

  1. Evaluation (browser / WASM): load the 272 KB WASM single file in one HTML page; wrap existing LLM output in your in-house environment and confirm the hallucination reduction. No server.
  2. PoC (representative benchmark reproduction): run a 100 Q trap suite on your LLM (8B-class or larger recommended) under conditions equivalent to SimpleQA / TruthfulQA / HaluEval-QA. 3-5 business days.
  3. Production integration: link as Rust library / WASM into the inference pipeline, or front it at the API Gateway / Edge. Wire WAL and the audit chain into your operational log systems.
  4. Audit / regulatory compliance: incorporate cascade rollback and the SHA-256 audit chain into operational audit trails (banking / healthcare / public sector).
  5. Operations: exploit the same-ceiling property to make LLM generation changes non-disruptive.
Recommended environment: any WebAssembly-capable runtime for evaluation (browser / wasmtime / wasmer / Wasmer Edge). Production allows native Rust linkage in server / edge / embedded. LLMs at 8B-class and above show the same-ceiling effect; 7B / 4B classes still show positive improvement.

Validated results

  • -20.4 ± 0.3 pt architectural constant (SimpleQA / TruthfulQA / HaluEval-QA, 3 external benchmarks × 3 seeds = 6,870 trials, baseline spans 7×)
  • Tier-A 8B-class converges to 81% correct ceiling (performance equalizer — the 12 pt baseline gap between Qwen3 and Llama 3.1 dissolves)
  • Property A: variance absorption — routed σ is up to 4.7× tighter than baseline σ
  • Helpfulness: 100% parity on 40 pure helpfulness questions
  • Multilingual: Japanese +54 / English +24 / Arabic +7 hallucination improvement
  • 5.8× faster responses (short path via D dominance, cache=200)
  • 24× faster than the Python reference (Rust port)
  • Zero data loss under 10,000-slot × 500-step stress (138 unit tests pass)
  • Matches the magnitude of competing LLM-control techniques (Anthropic Constitutional AI / OpenAI o1 reasoning: 10-25 pt range), achieved as an architectural constant across 3 external benchmarks (the differentiator is benchmark-agnostic property, not magnitude)

Where it fits in production

  • Safety net for business-LLM deployments: in domains where wrong answers are expensive (government, finance, healthcare), suppress hallucinations without changing the underlying LLM
  • Audit and tamper detection: record business events with semantic constraints; preserve integrity for after-the-fact audit
  • Explainable business rules and approvals: layer semantic constraints over rule definitions so decision rationale is reproducible
  • Edge and embedded: 272 KB WASM, no server, offline-capable — ready for factory, in-vehicle, and store-terminal deployment
  • AI provider differentiation: ship "the layer that delivers -20 pt" as an attach to your existing LLM offering

Resources / citations

Paper / patent

  • Paper (published on Zenodo, CC-BY 4.0): "SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; 2026-01-14).
    DOI: 10.5281/zenodo.18238339direct PDF (968.7 KB) / Zenodo record.
  • Japanese version v2 (jxiv, 15 pages, ~24,685 chars; submission preparing).
  • Patent claims 1-44: two-tier memory (Hot/Cold shelves), 3-mode router, Bernstein-commutator parallelism, Hilbert / SpiralIndex, WAL + cascade rollback, (SemanticTime, SensoryTime) tuple, credibility / forget_index, etc.

Benchmark data

  • SimpleQA (OpenAI, 500 Q × 3 seeds = 1,500 trials) — incorrect -20.5 pt, F +3.7 pt.
  • TruthfulQA (Lin et al., 790 Q × 3 seeds = 2,370 trials) — incorrect -20.1 pt, Truth +20.1 pt.
  • HaluEval-QA (HotpotQA-derived, 1,000 Q × 3 seeds = 3,000 trials) — incorrect -20.7 pt, F +21.4 pt.
  • 3-benchmark combined: 6,870 trials, -20.4 ± 0.3 pt architectural constant.
  • 4 LLM cross-validation: Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B.

Implementations

  • Python reference (impl/): 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo.
  • Improved version (impl_v2/): Phase A → B subtype-aware routing complete; 81.3% (σ=4%) at cache=200.
  • Rust port + WASM: 272 KB, 24× vs Python, 138 unit tests, 10K-slot × 500-step stress.

Related — blog / news / products

Get it / contact

For WASM single-binary evaluation, PoC (reproduction on your LLM), AI-provider partnership, SIer / reseller programme, and paper / patent material requests:

Contact   Partners