★ DEVICE primary ★ AI application

SlimeTree-RLM — Measurement procedure and primary materials

For evaluators and procurement. The procedure, rubrics, and LLM configurations behind the -20.4 ± 0.3 pt architectural constant measured across 3 external benchmarks × 3 seeds = 6,870 trials, the 4-LLM cross-validation conditions, paper v10, and access information for patent claims 1-44.

For product overview and deployment scenarios, see the product page (/products/device/slimetree-rlm/). This page focuses on primary materials for reproduction and verification.

🎛 AI GATE This page, at your resolution.

Suppresses LLM hallucinations (plausible lies) without changing any weights. Measured a stable −20.4 ± 0.3 pt improvement across 3 external benchmarks × 3 seeds = 6,870 trials. A "performance equalizer" where 8B-class converges to an 81% ceiling across 4 LLMs. Procedure, rubric and seeds are all public.

📋 "Ask your AI at this level" copies this page's explanation with an instruction matched to the level you picked. Paste it into your own AI (Claude · GPT · Gemini · Grok) to dig deeper at that resolution.

1. Evaluation data and external benchmarks Open

We measured against external public benchmarks, not self-made ones. The author, difficulty axis, scale, and scoring metric of each benchmark are listed below. All conditions needed for reproduction are public — a same-condition PoC on your LLM can be set up in 3-5 business days.

BenchmarkAuthor / originAxis (paper §3.5)ScaleScoring metricResult (RLM effect)
SimpleQA OpenAI T1: long-tail entity 500 Q × 3 seeds = 1,500 trials F-score (correct / attempted), SimpleQA paper preferred metric incorrect -20.5 pt, F +3.7 pt
TruthfulQA Lin et al. 2022 T5+T6: misconception / trick 790 Q × 3 seeds = 2,370 trials
(790 of standard 817, those that admit binary scoring)
Truth metric, Llama-3 judge / NLI-equivalent incorrect -20.1 pt, Truth +20.1 pt
HaluEval-QA HotpotQA-derived (THUDM) T2+T6: false premise / multi-hop 1,000 Q × 3 seeds = 3,000 trials binary correctness on (Question, hallucinated_answer) incorrect -20.7 pt, F +21.4 pt
3-bench combined 3 independent question sources T1 ↔ T5+T6 ↔ T2+T6 (full axis cover) 6,870 trials (2,290 distinct Q × 3 seeds) seed-mean ± SD of incorrect-rate Δ -20.4 ± 0.3 pt ★

1.1 Reproduction conditions — LLM, temperature, seed, cache

LLMQwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B (via Ollama). Primary benchmarks in this table run on Qwen3:8b; 4-LLM cross-validation in §2
temperaturebaseline 0.7, R-mode 0.4 (impl_v2 Phase B, suppresses fabrication randomness)
seeds3 seeds fixed (23, 47, 89) for reproducibility
cache200 (absorbs decoding noise)
ScoringSimpleQA: OpenAI preferred F-score (refusal-when-uncertain rewarded); TruthfulQA: Truth metric; HaluEval: binary correctness. Reference rubrics are kept unchanged for all 3 benchmarks
variance metricseed-to-seed σ (standard deviation of per-seed Δ). Enables measuring Property A variance absorption
Typical run timeHaluEval 6,000 LLM calls ≈ 22.5 min (same-host Ollama, reference value for an 8B-class model)
Companion observation (Property A — variance absorption): the variance-tightening effect scales with baseline σ. On the quiet SimpleQA (σ=0.31), routed σ=0.47 — slightly wider. TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter; HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's tightening; on quiet baselines the effect is null or slightly wider (a noise-conditional property by design, not a universal law).

1-2. Local LM applicability benchmark — n=1000 in-house corpus, 10 domains Published 2026-06-05

In June 2026, on our RTX 5060 Ti we ran Gemma 4 12B Q4_K_M (ollama 0.30.5) against 1,000 prompts × 10 domains, then judged every output through SlimeTree-RLM v3.113 R-meta verdict. This is the per-domain baseline for running a Local LM under the SlimeTree-RLM quality gate (95% confidence interval ±10% per domain).

Domainn"sufficient" qualityCanonical disclaimer detection (±95% CI)Hallucination signal rateMean score
medical_factual10099/1001.0% (±2.0%)0%0.797
medical_advisory100100/10012.0% (±6.4%)0%0.800
legal_factual10094/1008.0% (±5.3%)0%0.764
legal_advisory10093/10027.0% (±8.7%)0%0.762
finance_factual100100/1009.0% (±5.6%)0%0.800
finance_advisory100100/10010.0% (±5.9%)0%0.800
code_factual10098/1007.0% (±5.0%)0%0.793
business_advisory100100/1008.0% (±5.3%)0%0.800
educational_factual100100/1006.0% (±4.7%)0%0.800
japanese_business (advisory-leaning)10077/100 ★2.0% (±2.7%) ★4.0% ★0.725
10-domain total1,000961/1000 (96.1%)9.0% (overall)0.4% (overall)0.789

Key findings

  • Overall Local LM quality is at first-draft business grade: 96.1% sufficient (±1.2% CI). Not a frontier-cloud replacement — suitable for review-attached workflows under SlimeTree-RLM verdict.
  • Advisory-domain canonical disclaimer detection stays at 8-27% (legal / medical / finance). Gemma 4 12B does emit disclaimer-intent text, but the phrasing does not match the canonical patterns D/µ/R checks against.
  • japanese_business is a triple outlier: 77/100 sufficient + 2% disclaimer + the only domain with hallucination signals (4%). The R-meta verdict's canonical patterns assume English, so a Japanese-specific calibration is needed.

Judging at µs scale

Total time to judge 1,000 records1.07 seconds (Hyperscan + LRU memoization, cold compile included)
Judge p50 latency67.7 µs
Judge p99 latency101.6 µs (SLO 200 µs ✓)
Judge p99.9 latency163.8 µs
Judge max latency519.3 µs (cold-compile cost, first call only)
vs. cloud LLM-as-judgeFrontier LLM judge calls take 1-3 s/record; SlimeTree-RLM ~100 µs/record = 10,000-30,000× faster

Reproduction conditions

Local LMgemma4:12b Q4_K_M (ollama 0.30.5 with native gemma4 architecture support)
HardwareNVIDIA GeForce RTX 5060 Ti (16 GB) / CUDA 13.1 / WSL2 Ubuntu
Generation/api/chat, think:false, num_predict=512, temperature=0.7
Generation time3 hours for 1,000 prompts (avg 5.5 prompts/min, ~46 tok/s sustained)
VRAM8.7 GB sustained (well within 16 GB)
Corpus design10 domains × 100 prompts (330 seeds + 670 template expansions), deterministic builder
Judge layerSlimeTree-RLM R-meta verdict v3.113 (Hyperscan + memoization stacked, Phase B 121-version lineage)
How to read this: §1-2 measures "what Gemma 4 12B can do under SlimeTree-RLM" — an applicability benchmark, independent from the -20.4 ± 0.3 pt architectural constant in §1. The two together: §1-2 tells you whether to deploy Local LM and which LoRA corrections it needs; §1 tells you the architectural effect SlimeTree-RLM has on the LLM's incorrect rate. Both apply simultaneously when SlimeTree-RLM sits on top of a Local LM.

Full data is stored on our internal D drive under Phase D v0.2 corpus (2026-06-05). Available for PoC sharing on request (corpus prompts MIT, Gemma outputs covered by Gemma Terms of Use).

2. 4-LLM cross-validation Open

To show this is not a Qwen3-only number, 4 LLMs were re-run under identical conditions: 100 traps × cache=200 × seed=23, baseline vs routed.

LLMSizeBaseline hallucRouted hallucΔ hallucΔ LatencyRoutes (D/μ/R)
Qwen3:8b8B63%19%-44 pt-85.7%51/46/3
Llama 3.1:8b8B51%19%-32 pt-83.3%51/46/3
Mistral 7B7B70%51%-19 pt-74.8%51/45/4
Gemma 3:4B4B79%59%-20 pt-79.3%51/46/3

★ Performance equalizer: Both Tier-A 8B-class LLMs (Qwen3 and Llama 3.1) land at 19% hallucination = 81% correct ceiling after routing. Within the same Tier, the choice of LLM stops mattering. Multilingual: Japanese +54 pt / English +24 pt / Arabic +7 pt (paper v10 §3 multilingual matrix).

3. Paper Published on Zenodo (CC-BY 4.0)

Paper (English, Zenodo)"SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; published 2026-01-14; CC-BY 4.0).
DOI: 10.5281/zenodo.18238339
Direct PDF: slimetree_rlm_paper_final_en.pdf (968.7 KB)
Zenodo record: zenodo.org/records/18238339
CitationSasaki, H. (2026). SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference. Zenodo. https://doi.org/10.5281/zenodo.18238339
Japanese version v2jxiv submission preparing, 15 pages / ~24,685 chars / 221 KB.
Target venuesEMNLP / MLSys / VLDB / AMIA / EACL / AAAI / NeurIPS (experimental-rigor requirements cleared).
Further inquiriesContact us (affiliation-/use-case-specific supplements, PoC re-runs, etc.)

4. Patent Coverage public only — text NDA-gated

The architecture of SlimeTree-RLM is covered by patent claims 1-44. Only the coverage area is public:

  • (SemanticTime, SensoryTime) tuple, credibility / forget_index (claims 1, 17, 25)
  • Hot Shelf (Treap) + Cold Shelf (RB-Tree) (claims 2, 7, 8)
  • Branch-free 3-mode router, failure signal + w·exp(-η·regret), Adaptive η (claims 16, 38-42)
  • SAS semantic-area sampling, SpiralIndex + LazySpiralUpdate (claims 2-4, 8)
  • Operator ring + Bernstein commutator, Kosaraju SCC (claims 5, 11, 30-31)
  • Bron-Kerbosch + greedy mutually-disjoint clique cover (claims 6, 34)
  • Hilbert-curve index (claim 9)
  • WAL + cascade rollback (non-commuting-side propagation only) (claims 21, 35-37)
  • P_split / merge / freeze + fixed-point (claim 43)
  • WASM + SharedArrayBuffer + Atomics (claim 12), SlotAdapterAPI (claim 13), MetaGeneSlot GDPR/HIPAA (claim 14), Redlock distributed mutex (claim 16), LLVM Function Pass (claims 30-34), RocksDB/Redis backends (claim 19)

Text access via contact → after NDA.

5. Implementations (code) Distribution preparing

Python reference implementationimpl/ v0.1: 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. README maps paper §x and patent claim N.
Improved implementationimpl_v2/: Phase A (subtype bias trial) → Phase B (R-prompt softening + bias inversion + strict grader), reaching 81.3% (σ=4%) at cache=200.
Rust port + WASM272 KB single binary, 24× vs Python, 138 unit tests, zero data loss under 10,000 slot × 500 step stress. WASM evaluation copies distributed individually.
Bench harnessSame-condition replay scripts for SimpleQA / TruthfulQA / HaluEval-QA, including 4-LLM Ollama connector examples.

Distribution forms (evaluation licence / joint PoC / sponsored development / OEM integration) via contact or the partners page.