★ DEVICE primary ★ AI application

SlimeTree-RLM — Measurement procedure and primary materials

For evaluators and procurement. The procedure, rubrics, and LLM configurations behind the -20.4 ± 0.3 pt architectural constant measured across 3 external benchmarks × 3 seeds = 6,870 trials, the 4-LLM cross-validation conditions, paper v10, and access information for patent claims 1-44.

For product overview and deployment scenarios, see the product page (/products/device/slimetree-rlm/). This page focuses on primary materials for reproduction and verification.

1. Evaluation data and external benchmarks Open

We measured against external public benchmarks, not self-made ones. The author, difficulty axis, scale, and scoring metric of each benchmark are listed below. All conditions needed for reproduction are public — a same-condition PoC on your LLM can be set up in 3-5 business days.

BenchmarkAuthor / originAxis (paper §3.5)ScaleScoring metricResult (RLM effect)
SimpleQA OpenAI T1: long-tail entity 500 Q × 3 seeds = 1,500 trials F-score (correct / attempted), SimpleQA paper preferred metric incorrect -20.5 pt, F +3.7 pt
TruthfulQA Lin et al. 2022 T5+T6: misconception / trick 790 Q × 3 seeds = 2,370 trials
(790 of standard 817, those that admit binary scoring)
Truth metric, Llama-3 judge / NLI-equivalent incorrect -20.1 pt, Truth +20.1 pt
HaluEval-QA HotpotQA-derived (THUDM) T2+T6: false premise / multi-hop 1,000 Q × 3 seeds = 3,000 trials binary correctness on (Question, hallucinated_answer) incorrect -20.7 pt, F +21.4 pt
3-bench combined 3 independent question sources T1 ↔ T5+T6 ↔ T2+T6 (full axis cover) 6,870 trials (2,290 distinct Q × 3 seeds) seed-mean ± SD of incorrect-rate Δ -20.4 ± 0.3 pt ★

1.1 Reproduction conditions — LLM, temperature, seed, cache

LLMQwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B (via Ollama). Primary benchmarks in this table run on Qwen3:8b; 4-LLM cross-validation in §2
temperaturebaseline 0.7, R-mode 0.4 (impl_v2 Phase B, suppresses fabrication randomness)
seeds3 seeds fixed (23, 47, 89) for reproducibility
cache200 (absorbs decoding noise)
ScoringSimpleQA: OpenAI preferred F-score (refusal-when-uncertain rewarded); TruthfulQA: Truth metric; HaluEval: binary correctness. Reference rubrics are kept unchanged for all 3 benchmarks
variance metricseed-to-seed σ (standard deviation of per-seed Δ). Enables measuring Property A variance absorption
Typical run timeHaluEval 6,000 LLM calls ≈ 22.5 min (same-host Ollama, reference value for an 8B-class model)
Companion observation (Property A — variance absorption): the variance-tightening effect scales with baseline σ. On the quiet SimpleQA (σ=0.31), routed σ=0.47 — slightly wider. TruthfulQA (σ=0.31 → 0.10) is 3.1× tighter; HaluEval-QA (σ=1.23 → 0.26) is up to 4.7× tighter. Dynamic strength scaling: the noisier the baseline, the stronger the cascade's tightening; on quiet baselines the effect is null or slightly wider (a noise-conditional property by design, not a universal law).

2. 4-LLM cross-validation Open

To show this is not a Qwen3-only number, 4 LLMs were re-run under identical conditions: 100 traps × cache=200 × seed=23, baseline vs routed.

LLMSizeBaseline hallucRouted hallucΔ hallucΔ LatencyRoutes (D/μ/R)
Qwen3:8b8B63%19%-44 pt-85.7%51/46/3
Llama 3.1:8b8B51%19%-32 pt-83.3%51/46/3
Mistral 7B7B70%51%-19 pt-74.8%51/45/4
Gemma 3:4B4B79%59%-20 pt-79.3%51/46/3

★ Performance equalizer: Both Tier-A 8B-class LLMs (Qwen3 and Llama 3.1) land at 19% hallucination = 81% correct ceiling after routing. Within the same Tier, the choice of LLM stops mattering. Multilingual: Japanese +54 pt / English +24 pt / Arabic +7 pt (paper v10 §3 multilingual matrix).

3. Paper Published on Zenodo (CC-BY 4.0)

Paper (English, Zenodo)"SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; published 2026-01-14; CC-BY 4.0).
DOI: 10.5281/zenodo.18238339
Direct PDF: slimetree_rlm_paper_final_en.pdf (968.7 KB)
Zenodo record: zenodo.org/records/18238339
CitationSasaki, H. (2026). SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference. Zenodo. https://doi.org/10.5281/zenodo.18238339
Japanese version v2jxiv submission preparing, 15 pages / ~24,685 chars / 221 KB.
Target venuesEMNLP / MLSys / VLDB / AMIA / EACL / AAAI / NeurIPS (experimental-rigor requirements cleared).
Further inquiriesContact us (affiliation-/use-case-specific supplements, PoC re-runs, etc.)

4. Patent Coverage public only — text NDA-gated

The architecture of SlimeTree-RLM is covered by patent claims 1-44. Only the coverage area is public:

  • (SemanticTime, SensoryTime) tuple, credibility / forget_index (claims 1, 17, 25)
  • Hot Shelf (Treap) + Cold Shelf (RB-Tree) (claims 2, 7, 8)
  • Branch-free 3-mode router, failure signal + w·exp(-η·regret), Adaptive η (claims 16, 38-42)
  • SAS semantic-area sampling, SpiralIndex + LazySpiralUpdate (claims 2-4, 8)
  • Operator ring + Bernstein commutator, Kosaraju SCC (claims 5, 11, 30-31)
  • Bron-Kerbosch + greedy mutually-disjoint clique cover (claims 6, 34)
  • Hilbert-curve index (claim 9)
  • WAL + cascade rollback (non-commuting-side propagation only) (claims 21, 35-37)
  • P_split / merge / freeze + fixed-point (claim 43)
  • WASM + SharedArrayBuffer + Atomics (claim 12), SlotAdapterAPI (claim 13), MetaGeneSlot GDPR/HIPAA (claim 14), Redlock distributed mutex (claim 16), LLVM Function Pass (claims 30-34), RocksDB/Redis backends (claim 19)

Text access via contact → after NDA.

5. Implementations (code) Distribution preparing

Python reference implementationimpl/ v0.1: 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. README maps paper §x and patent claim N.
Improved implementationimpl_v2/: Phase A (subtype bias trial) → Phase B (R-prompt softening + bias inversion + strict grader), reaching 81.3% (σ=4%) at cache=200.
Rust port + WASM272 KB single binary, 24× vs Python, 138 unit tests, zero data loss under 10,000 slot × 500 step stress. WASM evaluation copies distributed individually.
Bench harnessSame-condition replay scripts for SimpleQA / TruthfulQA / HaluEval-QA, including 4-LLM Ollama connector examples.

Distribution forms (evaluation licence / joint PoC / sponsored development / OEM integration) via contact or the partners page.