★ DEVICE primary ★ AI application

Slime storage family / Semantic-driven record system (Patent claims 1-44)

SlimeTree-RLM

A semantic-driven record system you drop on existing systems as an orthogonal layer.
Cross-validated as a -20.4 ± 0.3 pt architectural constant across 3 external benchmarks.

A single Rust binary, 272 KB. Runs in browser, mobile, and embedded with no server required. Suppresses LLM "plausible-sounding lies" without touching a single bit of model weights. Works for AI (LLM safety) and non-AI (audit / decision / business control) alike.

-20.4 pt

incorrect-rate Δ (abstention +21.8 to +32.0pt / correct −1.3 to −11.3pt / TruthfulQA primary T×I +0.7pt)
Eval system qwen3:8b / Q4_K_M (post-training quantisation, digest 500a1f06) / TruthfulQA 790, SimpleQA 500, HaluEval-QA 1000 × 3 seeds = 6,870 trials / ±0.3 is the sample sd
Production runs gemma4:12b-it-qat, which is Q4_0 (QAT) — a different quantisation family, so this figure does not carry over.

81%

Tier-A 8B-class ceiling (separate experiment: 100 traps × 4 LLMs, performance equalizer)

272 KB

Rust single WASM, no server

24×

vs Python reference (Rust port)

3 external benchmarks validated 4 LLM cross-validation WASM-deployable Paper v10 / Patent 1-44

▶ Try the live demo (see D/μ/R routing) ※ The public demo is a joke build (no external AI, $0, no internal data). The real in-house AI is on the service page, authenticated.

🎛 AI GATE This page, at your resolution.

Suppresses LLM hallucinations (plausible lies) without changing any weights. Measured −20.4 ± 0.3 pt on incorrect-rate (the same amount moves into abstention and correct-rate falls) across 3 external benchmarks × 3 seeds = 6,870 trials. A "performance equalizer" where 8B-class converges to an 81% ceiling across 4 LLMs. Procedure, rubric and seeds are all public.

📋 "Ask your AI at this level" copies this page's explanation with an instruction matched to the level you picked. Paste it into your own AI (Claude · GPT · Gemini · Grok) to dig deeper at that resolution.

What it does

SlimeTree-RLM adds semantic-driven constraints and audit to existing AI models, decision engines, and business-rule layers as an orthogonal layer: lightweight, deterministic, and drop-in. It suppresses "plausible-sounding lies" structurally without touching a single bit of LLM weights, while simultaneously providing WAL (operational record) + cascade rollback + SHA-256 audit chain as standard equipment. Not AI-only: it doubles as the semantic-constraint layer for audit, approval, and operational control.

Pull "record" and "constraint" out of the model

Existing AI safety approaches (RLHF / Constitutional AI / o1 reasoning) all retrain the model or change its internal reasoning. RLM does the opposite: leave the model untouched, drop a "semantic record system" on the outside. The record system is a deterministic Rust implementation that holds input/output facts (as SemanticTime + SensoryTime pairs) plus semantic constraints, and tips generation into refusal when needed. The result, structurally: the model becomes swappable, audit ships built-in, regulations align, and offline operation works — all at once.

Position: The semantic-driven variant of the Slime storage family. Counterpart: the simple-record-system variant SlimeTree-VSAM (LANGUAGE primary, 480× faster than PostgreSQL on VSAM compatibility). RLM is DEVICE primary, AI and non-AI alike.

How it works — structural suppression

1) Branch-free 3-mode router (D / μ / R)

Input is weight-routed across three inference modes: D (Decisive, fast), μ (Moderation, refusal / "I do not know" answers), and R (Reasoning, deep). No branching: each mode's weight w · exp(−η · regret) decays exponentially against failure signal, producing a natural increase in μ refusals that prunes hallucinations as "unnecessary generation". This is the section that implements the paper's central thesis — hallucination = unnecessary generation — at the action level.

2) Three-tier memory: Hot Shelf (Treap) / Cold Shelf (Red-Black Tree) / Inactive Queue

Slots of meaning are stored not by wall-clock time but by (SemanticTime, SensoryTime) tuples, flowing across Hot (Treap), Cold (RB-Tree), and Inactive shelves according to credibility and forget_index. "Recently-used meanings" and "forgettable meanings" are separated structurally — no need to scan a giant context with full strength every time, so compute drops and responses speed up.

3) Parallelism and locality — Bernstein commutator + Hilbert / SpiralIndex

Semantic dependencies live on an operator ring; Bernstein commutators decide parallel feasibility mechanically. Maximal cliques are greedily covered into mutually-disjoint parallel groups. Physical slot layout uses the Hilbert curve and SpiralIndex (logarithmic spacing) to preserve spatial locality so nearby meanings land in nearby memory.

4) Standard-equipment WAL + cascade rollback + audit chain

Every operation is recorded in a WAL (write-ahead log); cascades of failure can be undone via cascade rollback that only propagates on the non-commuting side. The record carries a built-in SHA-256 audit chain for tamper detection. This is an AI safety layer AND a regulatory bit-exact audit substrate at the same time.

(Full theory in paper v10. Implementation: Python v0.1 + Rust port + WASM. Patent claims 1-44 cover the architecture. See "Resources" below.)

Key specs

Binary size	272 KB Rust single WASM binary (plus a 2,210-line Python reference implementation)
Runtime	Browser / mobile / embedded / server — all run with no server required (deploy anywhere)
Speed	24× vs Python reference. When applied as the inference layer, LLM responses themselves get 5.8× faster (short path via D dominance)
Robustness	Zero data loss across 10,000-slot × 500-step stress (138 unit tests pass)
Integration	Orthogonal layer over existing systems (LLM / decision engine / business rules; AI or non-AI)
Audit / record	WAL + cascade rollback + SHA-256 audit chain — standard equipment
Family	Slime storage family / DEVICE primary
Delivery	WASM single-file distribution (evaluation) + individual engagement (production)
Paper / patent	Paper v10 (English, 33 pages) + jxiv Japanese v2 (15 pages) + patent claims 1-44

3 external benchmarks — the -20.4 ± 0.3 pt architectural constant

Validated not on self-made benchmarks but on 3 external public benchmarks × 3 seeds each = 6,870 trials. This section is measured with Qwen3:8b as the primary base model (cross-model results are in the next "4-LLM cross-validation" section, an independent experiment). The 3 benchmarks have independent question authoring and independent difficulty axes; baseline accuracy spans approximately 6.6× (42.9 / 6.5, from 6.5% to 42.9%). Despite this, the incorrect-rate suppression converges to an extremely tight constant: -20.4 ± 0.3 pt. But the substance of this constant is the abstention (not_attempted) rate (~+20–23 pt) that μ-mode sets by threshold, appearing automatically under metrics that count abstention as "not incorrect." The threshold is benchmark-independent, so the constant is too — but this is a fail-closed property ("abstain when uncertain"), not "more correct answers" (confirmed by independent reproduction, n=790×3).

Benchmark	Axis	Scale (n × seeds)	Baseline	Routed (RLM)	Δ incorrect	Δ F-score
SimpleQA (OpenAI)	T1: long-tail entity	500 × 3	F 6.5% σ=0.31	F 10.2% σ=0.47	-20.5 pt	+3.7 pt
TruthfulQA (Lin et al.)	T5+T6: misconception / trick	790 × 3 seeds (790 of standard 817, those that admit binary scoring)	Truth 9.9% σ=0.31	Truth 30.0% σ=0.10	-20.1 pt	+20.1 pt
HaluEval-QA (HotpotQA-derived)	T2+T6: false premise / multi-hop	1,000 × 3	F 42.9% σ=1.23	F 64.3% σ=0.26	-20.7 pt	+21.4 pt
3-bench combined (architectural constant)	T1 ↔ T5+T6 ↔ T2+T6 (full axis cover)	2,290 distinct Q × 3 seeds = 6,870 trials	baseline 6.5% → 42.9% (~6.6× spread)		-20.4 ± 0.3 pt ★	+3.7 to +21.4 pt

Property A: variance absorption — scales with baseline noise

The variance tightening effect scales with baseline σ. On the quiet SimpleQA (baseline σ=0.31), routed σ=0.47 — slightly wider. By contrast, TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter, and HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's variance tightening; on quiet baselines the effect is null or slightly wider — not a universal law but a noise-conditional property (by design).

"Is this just a mechanical refusal increase?" — rebuttal

Incorrect dropping uniformly by -20.4 pt across 3 benchmarks could be read as "μ routing simply raises refusal by a fixed amount". SimpleQA's breakdown refutes this: incorrect 86.6% → 66.1% (-20.5 pt), not_attempted 6.4% → 28.2% (+21.8 pt), correct roughly preserved on SimpleQA (7.0% → 5.7%, -1.3 pt). However, correct also falls on other benchmarks (HaluEval-QA: 18.5% → 7.2%, -11.3 pt ≈ halved); TruthfulQA's primary T×I metric is +0.7 pt (essentially neutral). μ tips into refusal only on originally-incorrect questions, leaving correct ones essentially alone — evidence that μ structurally distinguishes question difficulty, consistent with the helpfulness 100% parity (40 Q non-trap) below. (Per-bench abstention rates for TruthfulQA / HaluEval-QA are in paper v10 §4; contact for access.)

Helpfulness is not lost

The design reduces incorrect answers by increasing refusals, but on 40 pure helpfulness questions (non-trap) helpfulness is empirically at 100% parity. "Questions that can be answered correctly are still answered correctly; only the questions that would have triggered a lie tip into refusal." This is established as an architectural property.

4 LLM cross-validation — Tier-A 8B-class (Qwen3 / Llama, n=2) near the same 81% ceiling (preliminary — needs more models)

To show SlimeTree-RLM is not a Qwen3-only number, we ran cross-validation on 4 LLMs (8B / 7B / 4B class) under identical conditions. Every LLM shows hallucination suppression; in particular both Tier-A models (Qwen3 and Llama 3.1, 8B-class) land at 19% hallucination = 81% ceiling after routing.

LLM	Size	Baseline halluc	Routed halluc	Δ halluc	Δ Latency	Δ Tokens
Qwen3:8b	8B	63%	19%	-44 pt	-85.7%	-21.0%
Llama 3.1:8b	8B	51%	19%	-32 pt	-83.3%	-24.6%
Mistral 7B	7B	70%	51%	-19 pt	-74.8%	-1.9%
Gemma 3:4B	4B	79%	59%	-20 pt	-79.3%	-35.2%

★ Performance equalizer: Two 8B-class models that started 12 pt apart (Qwen3 63% vs Llama 3.1 51% hallucination) both end at 19% hallucination = 81% correct ceiling after RLM. In other words, within the same Tier, the choice of LLM stops mattering — from the AI-provider's view this is structural dissolution of LLM vendor lock-in.

Validated across languages too: Japanese +54 pt / English +24 pt / Arabic +7 pt hallucination improvement (paper v10, §3 multilingual matrix).

Enterprise effects — "LLM safety net" + "audit substrate" in one

What RLM solves in an enterprise spans the AI domain and the audit / governance domain simultaneously. Satisfying both with the same Rust single binary changes the cost structure.

Where it lands by industry

Banks / major insurers	Safety layer for business LLM deployments (structural suppression of regulated misstatements) + bit-exact audit (SHA-256 audit chain). Can run in parallel with COBOL retirement projects.
Central government / municipalities	Domestic supply + math-backed guarantees + air-gap audit — satisfies sole-source procurement conditions. Hallucination suppression for public-information Q&A (trust matters).
Healthcare / pharma	Suppression of hallucinations in EHR and clinical-support LLMs (cost of wrong answers is extreme). WAL + cascade rollback preserves clinical-log audit integrity.
Manufacturing / energy / telco	Semantic-driven constraints on operational control and approval engines (AI-independent). Server-less operation for embedded / edge.
SI vendors	Wrap a single layer on customers' existing LLMs (cloud or on-prem) to deliver "AI you can stand behind". Easy horizontal expansion.

Capability × industry value matrix

RLM delivers four values at once — (1) LLM safety net (hallucination suppression), (2) audit substrate (WAL / cascade / audit chain), (3) operational governance (semantic-driven constraints), (4) edge operation (WASM 272 KB). Providing all four from the same implementation simplifies vendor consolidation.

Typical deployment topologies

WASM host in front of cloud LLM: launch the WASM at the edge / gateway / API gateway; filter cloud LLM output semantically.
Co-located with on-prem / private LLM: link as a Rust library inside the same process — no sidecar.
Browser / mobile offline: fully offline operation on factory terminals, in-vehicle, store POS. Same WASM.
Non-AI operational control: don't call an LLM at all — run RLM alone as a semantic-driven constraint / record layer (audit / decision).

For AI providers — "weights untouched", "no API", "Tier-crossing"

RLM is the option that ships your end customers "a layer that delivers -20 pt at the benchmark while touching not a single bit of your LLM". The structural properties below differentiate it from existing safety layers.

(a) The model's weights are not touched

RLHF / Constitutional AI / o1 reasoning all retrain or alter the internal reasoning of the model. RLM is an external layer, so existing model deltas, contracts, and SLAs ship unchanged. Zero retraining cost.

(b) No API either — bind as a binary layer

WASM 272 KB / Rust library inserted at the function-call boundary of the inference pipeline. No HTTP API and no separate process; net latency goes the 5.8× faster way (short path via D dominance).

(c) Equalize Tier-A models to the same ceiling

Qwen3:8b and Llama 3.1:8b start 12 pt apart on baseline, but post-RLM both sit at 19% hallucination = 81% correct (see above). From the provider's view, "swapping the base model preserves the same performance guarantee" — directly relevant to model-selection freedom and long-term maintenance. Combined with the -20.4 ± 0.3 pt RLM's mechanism differs from Constitutional AI / o1 reasoning (which improve accuracy by training reasoning): RLM tips uncertain mis-generation into refusal (fail-closed), so its −20.4 pt is wrong answers re-routed to abstention, not an accuracy gain — magnitudes are not compared directly, the numbers function as sales evidence on the provider's side.

(d) Helpfulness at 100% parity — do not break UX

The typical risk — "more refusals means less useful" — is empirically refuted: helpfulness measured at 100% parity across 40 pure helpfulness questions. "Answerable questions still answered, fabrication-prone questions tip into refusal" is architectural. Drop the hallucination KPI without dropping the Helpful-AI KPI.

(e) Audit log is standard equipment

The same layer carries WAL + cascade rollback + SHA-256 audit chain, so "explainability / audit requirements / regulatory compliance" need not be a separate layer. This reduces deployment friction dramatically for finance / public / healthcare SaaS.

Collaboration formats — WASM single-binary licensing, source-supplied contracts, joint benchmark runs (re-runnable on the customer's LLM), joint papers / press — can be designed individually for AI providers.

How to use — from evaluation to production

Evaluation (browser / WASM): load the 272 KB WASM single file in one HTML page; wrap existing LLM output in your in-house environment and confirm the hallucination reduction. No server.
PoC (representative benchmark reproduction): run a 100 Q trap suite on your LLM (8B-class or larger recommended) under conditions equivalent to SimpleQA / TruthfulQA / HaluEval-QA. 3-5 business days.
Production integration: link as Rust library / WASM into the inference pipeline, or front it at the API Gateway / Edge. Wire WAL and the audit chain into your operational log systems.
Audit / regulatory compliance: incorporate cascade rollback and the SHA-256 audit chain into operational audit trails (banking / healthcare / public sector).
Operations: exploit the same-ceiling property to make LLM generation changes non-disruptive.

Recommended environment: any WebAssembly-capable runtime for evaluation (browser / wasmtime / wasmer / Wasmer Edge). Production allows native Rust linkage in server / edge / embedded. LLMs at 8B-class and above show the same-ceiling effect; 7B / 4B classes still show positive improvement.

Validated results

-20.4 ± 0.3 pt architectural constant (SimpleQA / TruthfulQA / HaluEval-QA, 3 external benchmarks × 3 seeds = 6,870 trials, baseline spans 7×)
Tier-A 8B-class converges to 81% correct ceiling (performance equalizer — the 12 pt baseline gap between Qwen3 and Llama 3.1 dissolves)
Property A: variance absorption — routed σ is up to 4.7× tighter than baseline σ
Helpfulness: 100% parity on 40 pure helpfulness questions
Multilingual: Japanese +54 / English +24 / Arabic +7 hallucination improvement
5.8× faster responses (short path via D dominance, cache=200)
24× faster than the Python reference (Rust port)
Zero data loss under 10,000-slot × 500-step stress (138 unit tests pass)
Matches the magnitude of competing LLM-control techniques (Anthropic Constitutional AI / OpenAI o1 reasoning: 10-25 pt range), achieved as an architectural constant across 3 external benchmarks (the differentiator is benchmark-agnostic property, not magnitude)

Where it fits in production

Safety net for business-LLM deployments: in domains where wrong answers are expensive (government, finance, healthcare), suppress hallucinations without changing the underlying LLM
Audit and tamper detection: record business events with semantic constraints; preserve integrity for after-the-fact audit
Explainable business rules and approvals: layer semantic constraints over rule definitions so decision rationale is reproducible
Edge and embedded: 272 KB WASM, no server, offline-capable — ready for factory, in-vehicle, and store-terminal deployment
AI provider differentiation: ship "the layer that delivers -20 pt" as an attach to your existing LLM offering

★ Local LM (on-prem GPU) deployment — run 12B models on RTX 5060 Ti class hardware

SlimeTree-RLM's R-meta verdict evaluates cloud LLM and local LLM outputs through the same interface. That means a Gemma 4 12B class model on in-house GPU can sit under the SlimeTree-RLM quality gate, and the enterprise runs with cloud billing essentially at zero.

In-house measurement (2026-06-05, RTX 5060 Ti)

Setup: NVIDIA GeForce RTX 5060 Ti (16 GB) / CUDA 13.1 / ollama 0.30.5 (gemma4 architecture native) / WSL2 Ubuntu / SlimeTree-RLM R-meta verdict integration.

Metric	gemma3:12b	gemma4:12b Q4_K_M	gemma4:12b Q8_0
Decode speed	46.3 tok/s	43.5 tok/s	27.6 tok/s
Peak VRAM	9.7 GB	8.6 GB	13.7 GB
SlimeTree-RLM judge p99 latency	~100 µs	~100 µs	~100 µs
Sufficient rate (n=50)	49/50	47/50	47/50

gemma4:12b Q4_K_M is the production default candidate (best speed / VRAM / quality balance). Q8_0 carries 1.58× the runtime and 1.58× the VRAM with no measurable quality lift on the same sample.

4 viable patterns for enterprise Local LM migration

Pattern	Scenario	SlimeTree-RLM role
A. Compliance-bound	Healthcare / legal / finance / defence — cloud LLMs blocked by regulation. 47/50 sufficient is a viable first-draft + human-review baseline.	SHA-256 audit chain satisfies audit requirements; R-meta verdict provides the explainability layer.
B. High-volume routine	10M+ tokens/month routine (classification / summarisation / drafting / RAG). One RTX 5060 Ti sustains 3.6M tok/day; capex recovers in ~3 months.	Existing 60-80% D/µ reduction is preserved; the R verdict gate at µs scale guarantees no escalation overhead.
C. Narrow-domain specialist	Tax Q&A, manufacturing SOP, internal policy lookup, healthcare billing rules. LoRA fine-tuning lifts the base to frontier-general parity inside the domain.	D/µ/R three-layer gate suppresses the residual hallucination ridge after LoRA; the -20 pt architectural constant holds inside the specialist domain too.
D. Hybrid (the headline)	90-95% local + 5-10% cloud frontier escalation. Frontier-class quality at 1/10 - 1/20 of the bill.	R-meta verdict is the routing decision itself. Pass = local; insufficient = escalate, decided in µs.

Pattern B extended — 4-tier escalation

Insert Tier 0 = on-prem GPU Local LM below the existing 2-tier (Flash / Pro) escalation to drive the per-token rate effectively to zero.

tier	LLM	Token rate	SlimeTree-RLM verdict behaviour
Tier 0 (new)	Local LM (Gemma 4 12B Q4_K_M etc.)	¥0 / 1M tok (electricity only)	D/µ-processed R prompts answered locally; cloud billing skipped
Tier 1	Gemini Flash (existing)	~¥30 / 1M tok	Local verdict insufficient → Flash
Tier 2	Gemini Pro / Claude Sonnet	~¥500 / 1M tok	Flash insufficient → escalate to Pro
Tier 3	Claude Opus / GPT-5	~¥5,000-15,000 / 1M tok	Frontier reasoning needed → final escalate to Opus

Reduction: of the R-fraction after D/µ cut 60-80%, tiers 0/1 absorb 70-95%. Frontier billing (tier 3) is 3-10% of actual traffic. ¥1M / month cloud LLM spend ends up at ¥30-100k.

Frontier residual (honest)

A 12B-class Local LM does NOT replace frontier (Claude Opus / GPT-5 / Gemini Pro) in:

Multi-step agentic reasoning, novel coding problems, complex mathematical proofs — frontier wins.
100k+ token long-context understanding — advertised but practical quality declines.
Dialogue UX requiring sub-2s responses (12B takes 12-19s/response in our setup, best for batch / async).
Deep nuance in non-English languages (1-2 years to catch up with frontier).

Leave the above to cloud frontier; absorb everything else locally. SlimeTree-RLM verdict is the µs router that decides which to send where.

/integrations/#multi-agent — Local LM extension inside the multi-agent framework (technical detail)
/service/ai/ — Local LM migration / on-prem AI deployment service

Resources / citations

Paper / patent

Paper (published on Zenodo, CC-BY 4.0): "SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; 2026-01-14).
DOI: 10.5281/zenodo.18238339 — direct PDF (968.7 KB) / Zenodo record.
Japanese version v2 (jxiv, 15 pages, ~24,685 chars; submission preparing).
Patent claims 1-44: two-tier memory (Hot/Cold shelves), 3-mode router, Bernstein-commutator parallelism, Hilbert / SpiralIndex, WAL + cascade rollback, (SemanticTime, SensoryTime) tuple, credibility / forget_index, etc.

Benchmark data

SimpleQA (OpenAI, 500 Q × 3 seeds = 1,500 trials) — incorrect -20.5 pt, F +3.7 pt.
TruthfulQA (Lin et al., 790 Q × 3 seeds = 2,370 trials) — incorrect -20.1 pt, Truth +20.1 pt.
HaluEval-QA (HotpotQA-derived, 1,000 Q × 3 seeds = 3,000 trials) — incorrect -20.7 pt, F +21.4 pt.
3-benchmark combined: 6,870 trials, -20.4 ± 0.3 pt architectural constant.
4 LLM cross-validation: Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B.

Implementations

Python reference (impl/): 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo.
Improved version (impl_v2/): Phase A → B subtype-aware routing complete; 81.3% (σ=4%) at cache=200.
Rust port + WASM: 272 KB, 24× vs Python, 138 unit tests, 10K-slot × 500-step stress.

Related — blog / news / products

Deep-dive blog: Just 272 KB to re-route LLM wrong answers to abstention (fail-closed) — SlimeTree-RLM (Japanese, 7 chapters)
Applied blog (Phase D corpus + LoRA + vLLM): Gemma 4 12B on RTX 5060 Ti, 1000 prompts — enterprise Local AI: where it stands (9 chapters, LoRA + vLLM addendum, with Errata)
News: research releases and announcements (2026-05-23 Rust/WASM port + LLM application release, and more)
Same family, simple-record-system variant: SlimeTree-VSAM
Category: DEVICE products

Get it / contact

For WASM single-binary evaluation, PoC (reproduction on your LLM), AI-provider partnership, SIer / reseller programme, and paper / patent material requests:

Contact Partners