★ DEVICE primary ★ AI application

SlimeTree-RLM — Measurement procedure and primary materials

For evaluators and procurement. The procedure, rubrics, and LLM configurations behind the -20.4 ± 0.3 pt architectural constant measured across 3 external benchmarks × 3 seeds = 6,870 trials, the 4-LLM cross-validation conditions, paper v10, and access information for patent claims 1-44.

For product overview and deployment scenarios, see the product page (/products/device/slimetree-rlm/). This page focuses on primary materials for reproduction and verification.

🎛 AI GATE This page, at your resolution.

Suppresses LLM hallucinations (plausible lies) without changing any weights. Measured a stable −20.4 ± 0.3 pt improvement across 3 external benchmarks × 3 seeds = 6,870 trials. A "performance equalizer" where 8B-class converges to an 81% ceiling across 4 LLMs. Procedure, rubric and seeds are all public.

📋 "Ask your AI at this level" copies this page's explanation with an instruction matched to the level you picked. Paste it into your own AI (Claude · GPT · Gemini · Grok) to dig deeper at that resolution.

1. Evaluation data and external benchmarks Open

We measured against external public benchmarks, not self-made ones. The author, difficulty axis, scale, and scoring metric of each benchmark are listed below. All conditions needed for reproduction are public — a same-condition PoC on your LLM can be set up in 3-5 business days.

Benchmark	Author / origin	Axis (paper §3.5)	Scale	Scoring metric	Result (RLM effect)
SimpleQA	OpenAI	T1: long-tail entity	500 Q × 3 seeds = 1,500 trials	F-score (correct / attempted), SimpleQA paper preferred metric	incorrect -20.5 pt, F +3.7 pt
TruthfulQA	Lin et al. 2022	T5+T6: misconception / trick	790 Q × 3 seeds = 2,370 trials (790 of standard 817, those that admit binary scoring)	Truth metric, Llama-3 judge / NLI-equivalent	incorrect -20.1 pt, Truth +20.1 pt
HaluEval-QA	HotpotQA-derived (THUDM)	T2+T6: false premise / multi-hop	1,000 Q × 3 seeds = 3,000 trials	binary correctness on (Question, hallucinated_answer)	incorrect -20.7 pt, F +21.4 pt
3-bench combined	3 independent question sources	T1 ↔ T5+T6 ↔ T2+T6 (full axis cover)	6,870 trials (2,290 distinct Q × 3 seeds)	seed-mean ± SD of incorrect-rate Δ	-20.4 ± 0.3 pt ★

1.1 Reproduction conditions — LLM, temperature, seed, cache

LLM	Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B (via Ollama). Primary benchmarks in this table run on Qwen3:8b; 4-LLM cross-validation in §2
temperature	baseline 0.7, R-mode 0.4 (impl_v2 Phase B, suppresses fabrication randomness)
seeds	3 seeds fixed (23, 47, 89) for reproducibility
cache	200 (absorbs decoding noise)
Scoring	SimpleQA: OpenAI preferred F-score (refusal-when-uncertain rewarded); TruthfulQA: Truth metric; HaluEval: binary correctness. Reference rubrics are kept unchanged for all 3 benchmarks
variance metric	seed-to-seed σ (standard deviation of per-seed Δ). Enables measuring Property A variance absorption
Typical run time	HaluEval 6,000 LLM calls ≈ 22.5 min (same-host Ollama, reference value for an 8B-class model)

Companion observation (Property A — variance absorption): the variance-tightening effect scales with baseline σ. On the quiet SimpleQA (σ=0.31), routed σ=0.47 — slightly wider. TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter; HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's tightening; on quiet baselines the effect is null or slightly wider (a noise-conditional property by design, not a universal law).

1-2. Local LM applicability benchmark — n=1000 in-house corpus, 10 domains Published 2026-06-05

In June 2026, on our RTX 5060 Ti we ran Gemma 4 12B Q4_K_M (ollama 0.30.5) against 1,000 prompts × 10 domains, then judged every output through SlimeTree-RLM v3.113 R-meta verdict. This is the per-domain baseline for running a Local LM under the SlimeTree-RLM quality gate (95% confidence interval ±10% per domain).

Domain	n	"sufficient" quality	Canonical disclaimer detection (±95% CI)	Hallucination signal rate	Mean score
medical_factual	100	99/100	1.0% (±2.0%)	0%	0.797
medical_advisory	100	100/100	12.0% (±6.4%)	0%	0.800
legal_factual	100	94/100	8.0% (±5.3%)	0%	0.764
legal_advisory	100	93/100	27.0% (±8.7%)	0%	0.762
finance_factual	100	100/100	9.0% (±5.6%)	0%	0.800
finance_advisory	100	100/100	10.0% (±5.9%)	0%	0.800
code_factual	100	98/100	7.0% (±5.0%)	0%	0.793
business_advisory	100	100/100	8.0% (±5.3%)	0%	0.800
educational_factual	100	100/100	6.0% (±4.7%)	0%	0.800
japanese_business (advisory-leaning)	100	77/100 ★	2.0% (±2.7%) ★	4.0% ★	0.725
10-domain total	1,000	961/1000 (96.1%)	9.0% (overall)	0.4% (overall)	0.789

Key findings

Overall Local LM quality is at first-draft business grade: 96.1% sufficient (±1.2% CI). Not a frontier-cloud replacement — suitable for review-attached workflows under SlimeTree-RLM verdict.
Advisory-domain canonical disclaimer detection stays at 8-27% (legal / medical / finance). Gemma 4 12B does emit disclaimer-intent text, but the phrasing does not match the canonical patterns D/µ/R checks against.
japanese_business is a triple outlier: 77/100 sufficient + 2% disclaimer + the only domain with hallucination signals (4%). The R-meta verdict's canonical patterns assume English, so a Japanese-specific calibration is needed.

Judging at µs scale

Total time to judge 1,000 records	1.07 seconds (Hyperscan + LRU memoization, cold compile included)
Judge p50 latency	67.7 µs
Judge p99 latency	101.6 µs (SLO 200 µs ✓)
Judge p99.9 latency	163.8 µs
Judge max latency	519.3 µs (cold-compile cost, first call only)
vs. cloud LLM-as-judge	Frontier LLM judge calls take 1-3 s/record; SlimeTree-RLM ~100 µs/record = 10,000-30,000× faster

Reproduction conditions

Local LM	gemma4:12b Q4_K_M (ollama 0.30.5 with native gemma4 architecture support)
Hardware	NVIDIA GeForce RTX 5060 Ti (16 GB) / CUDA 13.1 / WSL2 Ubuntu
Generation	/api/chat, think:false, num_predict=512, temperature=0.7
Generation time	3 hours for 1,000 prompts (avg 5.5 prompts/min, ~46 tok/s sustained)
VRAM	8.7 GB sustained (well within 16 GB)
Corpus design	10 domains × 100 prompts (330 seeds + 670 template expansions), deterministic builder
Judge layer	SlimeTree-RLM R-meta verdict v3.113 (Hyperscan + memoization stacked, Phase B 121-version lineage)

How to read this: §1-2 measures "what Gemma 4 12B can do under SlimeTree-RLM" — an applicability benchmark, independent from the -20.4 ± 0.3 pt architectural constant in §1. The two together: §1-2 tells you whether to deploy Local LM and which LoRA corrections it needs; §1 tells you the architectural effect SlimeTree-RLM has on the LLM's incorrect rate. Both apply simultaneously when SlimeTree-RLM sits on top of a Local LM.

Full data is stored on our internal D drive under Phase D v0.2 corpus (2026-06-05). Available for PoC sharing on request (corpus prompts MIT, Gemma outputs covered by Gemma Terms of Use).

2. 4-LLM cross-validation Open

To show this is not a Qwen3-only number, 4 LLMs were re-run under identical conditions: 100 traps × cache=200 × seed=23, baseline vs routed.

LLM	Size	Baseline halluc	Routed halluc	Δ halluc	Δ Latency	Routes (D/μ/R)
Qwen3:8b	8B	63%	19%	-44 pt	-85.7%	51/46/3
Llama 3.1:8b	8B	51%	19%	-32 pt	-83.3%	51/46/3
Mistral 7B	7B	70%	51%	-19 pt	-74.8%	51/45/4
Gemma 3:4B	4B	79%	59%	-20 pt	-79.3%	51/46/3

★ Performance equalizer: Both Tier-A 8B-class LLMs (Qwen3 and Llama 3.1) land at 19% hallucination = 81% correct ceiling after routing. Within the same Tier, the choice of LLM stops mattering. Multilingual: Japanese +54 pt / English +24 pt / Arabic +7 pt (paper v10 §3 multilingual matrix).

3. Paper Published on Zenodo (CC-BY 4.0)

Paper (English, Zenodo)	"SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; published 2026-01-14; CC-BY 4.0). DOI: 10.5281/zenodo.18238339 Direct PDF: slimetree_rlm_paper_final_en.pdf (968.7 KB) Zenodo record: zenodo.org/records/18238339
Citation	`Sasaki, H. (2026). SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference. Zenodo. https://doi.org/10.5281/zenodo.18238339`
Japanese version v2	jxiv submission preparing, 15 pages / ~24,685 chars / 221 KB.
Target venues	EMNLP / MLSys / VLDB / AMIA / EACL / AAAI / NeurIPS (experimental-rigor requirements cleared).
Further inquiries	Contact us (affiliation-/use-case-specific supplements, PoC re-runs, etc.)

4. Patent Coverage public only — text NDA-gated

The architecture of SlimeTree-RLM is covered by patent claims 1-44. Only the coverage area is public:

(SemanticTime, SensoryTime) tuple, credibility / forget_index (claims 1, 17, 25)
Hot Shelf (Treap) + Cold Shelf (RB-Tree) (claims 2, 7, 8)
Branch-free 3-mode router, failure signal + w·exp(-η·regret), Adaptive η (claims 16, 38-42)
SAS semantic-area sampling, SpiralIndex + LazySpiralUpdate (claims 2-4, 8)
Operator ring + Bernstein commutator, Kosaraju SCC (claims 5, 11, 30-31)
Bron-Kerbosch + greedy mutually-disjoint clique cover (claims 6, 34)
Hilbert-curve index (claim 9)
WAL + cascade rollback (non-commuting-side propagation only) (claims 21, 35-37)
P_split / merge / freeze + fixed-point (claim 43)
WASM + SharedArrayBuffer + Atomics (claim 12), SlotAdapterAPI (claim 13), MetaGeneSlot GDPR/HIPAA (claim 14), Redlock distributed mutex (claim 16), LLVM Function Pass (claims 30-34), RocksDB/Redis backends (claim 19)

Text access via contact → after NDA.

5. Implementations (code) Distribution preparing

Python reference implementation	`impl/` v0.1: 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. README maps paper §x and patent claim N.
Improved implementation	`impl_v2/`: Phase A (subtype bias trial) → Phase B (R-prompt softening + bias inversion + strict grader), reaching 81.3% (σ=4%) at cache=200.
Rust port + WASM	272 KB single binary, 24× vs Python, 138 unit tests, zero data loss under 10,000 slot × 500 step stress. WASM evaluation copies distributed individually.
Bench harness	Same-condition replay scripts for SimpleQA / TruthfulQA / HaluEval-QA, including 4-LLM Ollama connector examples.

Distribution forms (evaluation licence / joint PoC / sponsored development / OEM integration) via contact or the partners page.

6. Related

Product page: SlimeTree-RLM — product detail (deployment scenarios, enterprise / AI provider angles)
Deep-dive blog: Just 272 KB to re-route LLM wrong answers to abstention (fail-closed) — SlimeTree-RLM (Japanese, 7 chapters)
Applied blog (Phase D corpus + LoRA + vLLM): Gemma 4 12B on RTX 5060 Ti, 1000 prompts — enterprise Local AI: where it stands (9 chapters, LoRA + vLLM addendum, with Errata)
Related news: research releases and announcements
Same family, simple-record-system variant: SlimeTree-VSAM + deep-dive blog
Category: DEVICE products / Resource home

Contact Partners