★ DEVICE primary ★ AI application
SlimeTree-RLM — Measurement procedure and primary materials
For evaluators and procurement. The procedure, rubrics, and LLM configurations behind the -20.4 ± 0.3 pt architectural constant measured across 3 external benchmarks × 3 seeds = 6,870 trials, the 4-LLM cross-validation conditions, paper v10, and access information for patent claims 1-44.
For product overview and deployment scenarios, see the product page (/products/device/slimetree-rlm/). This page focuses on primary materials for reproduction and verification.
A technique that curbs an AI (a large language model like ChatGPT) from saying plausible-but-wrong things (hallucinations). It never touches the model's weights; it supports it from outside as a "record body" to raise answer reliability. A tiny 272 KB part that runs in the browser/phone with no server.
Suppresses LLM hallucinations (plausible lies) without changing any weights. Measured a stable −20.4 ± 0.3 pt improvement across 3 external benchmarks × 3 seeds = 6,870 trials. A "performance equalizer" where 8B-class converges to an 81% ceiling across 4 LLMs. Procedure, rubric and seeds are all public.
A meaning-driven record body. Routes into D (deterministic) / μ (suppression) / R (reasoning): certain parts deterministically, risky parts suppressed, the LLM only when needed. Weights untouched, so it retrofits onto any model. 272 KB WASM, no server in browser/mobile, with an audit WAL.
Hallucination suppression measured as a −20.4 ± 0.3 pt structural constant over 3 bench × 3 seed = 6,870 trials. A performance equalizer where Tier-A 8B-class converges to an 81% ceiling across 4 LLMs. A tier-③ implementation handling non-reproducible stochastic output via meaning-equivalence + convergence + residual. Procedure, LLM settings, seeds and rubric fully public, third-party reproducible.
📋 "Ask your AI at this level" copies this page's explanation with an instruction matched to the level you picked. Paste it into your own AI (Claude · GPT · Gemini · Grok) to dig deeper at that resolution.
1. Evaluation data and external benchmarks Open
We measured against external public benchmarks, not self-made ones. The author, difficulty axis, scale, and scoring metric of each benchmark are listed below. All conditions needed for reproduction are public — a same-condition PoC on your LLM can be set up in 3-5 business days.
| Benchmark | Author / origin | Axis (paper §3.5) | Scale | Scoring metric | Result (RLM effect) |
|---|---|---|---|---|---|
| SimpleQA | OpenAI | T1: long-tail entity | 500 Q × 3 seeds = 1,500 trials | F-score (correct / attempted), SimpleQA paper preferred metric | incorrect -20.5 pt, F +3.7 pt |
| TruthfulQA | Lin et al. 2022 | T5+T6: misconception / trick | 790 Q × 3 seeds = 2,370 trials (790 of standard 817, those that admit binary scoring) |
Truth metric, Llama-3 judge / NLI-equivalent | incorrect -20.1 pt, Truth +20.1 pt |
| HaluEval-QA | HotpotQA-derived (THUDM) | T2+T6: false premise / multi-hop | 1,000 Q × 3 seeds = 3,000 trials | binary correctness on (Question, hallucinated_answer) | incorrect -20.7 pt, F +21.4 pt |
| 3-bench combined | 3 independent question sources | T1 ↔ T5+T6 ↔ T2+T6 (full axis cover) | 6,870 trials (2,290 distinct Q × 3 seeds) | seed-mean ± SD of incorrect-rate Δ | -20.4 ± 0.3 pt ★ |
1.1 Reproduction conditions — LLM, temperature, seed, cache
| LLM | Qwen3:8b / Llama 3.1:8b / Mistral 7B / Gemma 3:4B (via Ollama). Primary benchmarks in this table run on Qwen3:8b; 4-LLM cross-validation in §2 |
|---|---|
| temperature | baseline 0.7, R-mode 0.4 (impl_v2 Phase B, suppresses fabrication randomness) |
| seeds | 3 seeds fixed (23, 47, 89) for reproducibility |
| cache | 200 (absorbs decoding noise) |
| Scoring | SimpleQA: OpenAI preferred F-score (refusal-when-uncertain rewarded); TruthfulQA: Truth metric; HaluEval: binary correctness. Reference rubrics are kept unchanged for all 3 benchmarks |
| variance metric | seed-to-seed σ (standard deviation of per-seed Δ). Enables measuring Property A variance absorption |
| Typical run time | HaluEval 6,000 LLM calls ≈ 22.5 min (same-host Ollama, reference value for an 8B-class model) |
1-2. Local LM applicability benchmark — n=1000 in-house corpus, 10 domains Published 2026-06-05
In June 2026, on our RTX 5060 Ti we ran Gemma 4 12B Q4_K_M (ollama 0.30.5) against 1,000 prompts × 10 domains, then judged every output through SlimeTree-RLM v3.113 R-meta verdict. This is the per-domain baseline for running a Local LM under the SlimeTree-RLM quality gate (95% confidence interval ±10% per domain).
| Domain | n | "sufficient" quality | Canonical disclaimer detection (±95% CI) | Hallucination signal rate | Mean score |
|---|---|---|---|---|---|
| medical_factual | 100 | 99/100 | 1.0% (±2.0%) | 0% | 0.797 |
| medical_advisory | 100 | 100/100 | 12.0% (±6.4%) | 0% | 0.800 |
| legal_factual | 100 | 94/100 | 8.0% (±5.3%) | 0% | 0.764 |
| legal_advisory | 100 | 93/100 | 27.0% (±8.7%) | 0% | 0.762 |
| finance_factual | 100 | 100/100 | 9.0% (±5.6%) | 0% | 0.800 |
| finance_advisory | 100 | 100/100 | 10.0% (±5.9%) | 0% | 0.800 |
| code_factual | 100 | 98/100 | 7.0% (±5.0%) | 0% | 0.793 |
| business_advisory | 100 | 100/100 | 8.0% (±5.3%) | 0% | 0.800 |
| educational_factual | 100 | 100/100 | 6.0% (±4.7%) | 0% | 0.800 |
| japanese_business (advisory-leaning) | 100 | 77/100 ★ | 2.0% (±2.7%) ★ | 4.0% ★ | 0.725 |
| 10-domain total | 1,000 | 961/1000 (96.1%) | 9.0% (overall) | 0.4% (overall) | 0.789 |
Key findings
- Overall Local LM quality is at first-draft business grade: 96.1% sufficient (±1.2% CI). Not a frontier-cloud replacement — suitable for review-attached workflows under SlimeTree-RLM verdict.
- Advisory-domain canonical disclaimer detection stays at 8-27% (legal / medical / finance). Gemma 4 12B does emit disclaimer-intent text, but the phrasing does not match the canonical patterns D/µ/R checks against.
- japanese_business is a triple outlier: 77/100 sufficient + 2% disclaimer + the only domain with hallucination signals (4%). The R-meta verdict's canonical patterns assume English, so a Japanese-specific calibration is needed.
Judging at µs scale
| Total time to judge 1,000 records | 1.07 seconds (Hyperscan + LRU memoization, cold compile included) |
|---|---|
| Judge p50 latency | 67.7 µs |
| Judge p99 latency | 101.6 µs (SLO 200 µs ✓) |
| Judge p99.9 latency | 163.8 µs |
| Judge max latency | 519.3 µs (cold-compile cost, first call only) |
| vs. cloud LLM-as-judge | Frontier LLM judge calls take 1-3 s/record; SlimeTree-RLM ~100 µs/record = 10,000-30,000× faster |
Reproduction conditions
| Local LM | gemma4:12b Q4_K_M (ollama 0.30.5 with native gemma4 architecture support) |
|---|---|
| Hardware | NVIDIA GeForce RTX 5060 Ti (16 GB) / CUDA 13.1 / WSL2 Ubuntu |
| Generation | /api/chat, think:false, num_predict=512, temperature=0.7 |
| Generation time | 3 hours for 1,000 prompts (avg 5.5 prompts/min, ~46 tok/s sustained) |
| VRAM | 8.7 GB sustained (well within 16 GB) |
| Corpus design | 10 domains × 100 prompts (330 seeds + 670 template expansions), deterministic builder |
| Judge layer | SlimeTree-RLM R-meta verdict v3.113 (Hyperscan + memoization stacked, Phase B 121-version lineage) |
Full data is stored on our internal D drive under Phase D v0.2 corpus (2026-06-05). Available for PoC sharing on request (corpus prompts MIT, Gemma outputs covered by Gemma Terms of Use).
2. 4-LLM cross-validation Open
To show this is not a Qwen3-only number, 4 LLMs were re-run under identical conditions: 100 traps × cache=200 × seed=23, baseline vs routed.
| LLM | Size | Baseline halluc | Routed halluc | Δ halluc | Δ Latency | Routes (D/μ/R) |
|---|---|---|---|---|---|---|
| Qwen3:8b | 8B | 63% | 19% | -44 pt | -85.7% | 51/46/3 |
| Llama 3.1:8b | 8B | 51% | 19% | -32 pt | -83.3% | 51/46/3 |
| Mistral 7B | 7B | 70% | 51% | -19 pt | -74.8% | 51/45/4 |
| Gemma 3:4B | 4B | 79% | 59% | -20 pt | -79.3% | 51/46/3 |
★ Performance equalizer: Both Tier-A 8B-class LLMs (Qwen3 and Llama 3.1) land at 19% hallucination = 81% correct ceiling after routing. Within the same Tier, the choice of LLM stops mattering. Multilingual: Japanese +54 pt / English +24 pt / Arabic +7 pt (paper v10 §3 multilingual matrix).
3. Paper Published on Zenodo (CC-BY 4.0)
| Paper (English, Zenodo) | "SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference" (SASAKI, HIROSHI; published 2026-01-14; CC-BY 4.0). DOI: 10.5281/zenodo.18238339 Direct PDF: slimetree_rlm_paper_final_en.pdf (968.7 KB) Zenodo record: zenodo.org/records/18238339 |
|---|---|
| Citation | Sasaki, H. (2026). SlimeTree-RLM: Failure-Aware Routing and Controlled Recursive Inference. Zenodo. https://doi.org/10.5281/zenodo.18238339 |
| Japanese version v2 | jxiv submission preparing, 15 pages / ~24,685 chars / 221 KB. |
| Target venues | EMNLP / MLSys / VLDB / AMIA / EACL / AAAI / NeurIPS (experimental-rigor requirements cleared). |
| Further inquiries | Contact us (affiliation-/use-case-specific supplements, PoC re-runs, etc.) |
4. Patent Coverage public only — text NDA-gated
The architecture of SlimeTree-RLM is covered by patent claims 1-44. Only the coverage area is public:
- (SemanticTime, SensoryTime) tuple, credibility / forget_index (claims 1, 17, 25)
- Hot Shelf (Treap) + Cold Shelf (RB-Tree) (claims 2, 7, 8)
- Branch-free 3-mode router, failure signal + w·exp(-η·regret), Adaptive η (claims 16, 38-42)
- SAS semantic-area sampling, SpiralIndex + LazySpiralUpdate (claims 2-4, 8)
- Operator ring + Bernstein commutator, Kosaraju SCC (claims 5, 11, 30-31)
- Bron-Kerbosch + greedy mutually-disjoint clique cover (claims 6, 34)
- Hilbert-curve index (claim 9)
- WAL + cascade rollback (non-commuting-side propagation only) (claims 21, 35-37)
- P_split / merge / freeze + fixed-point (claim 43)
- WASM + SharedArrayBuffer + Atomics (claim 12), SlotAdapterAPI (claim 13), MetaGeneSlot GDPR/HIPAA (claim 14), Redlock distributed mutex (claim 16), LLVM Function Pass (claims 30-34), RocksDB/Redis backends (claim 19)
Text access via contact → after NDA.
5. Implementations (code) Distribution preparing
| Python reference implementation | impl/ v0.1: 2,210 lines, zero dependencies, 25 unit tests pass, 80-step demo. README maps paper §x and patent claim N. |
|---|---|
| Improved implementation | impl_v2/: Phase A (subtype bias trial) → Phase B (R-prompt softening + bias inversion + strict grader), reaching 81.3% (σ=4%) at cache=200. |
| Rust port + WASM | 272 KB single binary, 24× vs Python, 138 unit tests, zero data loss under 10,000 slot × 500 step stress. WASM evaluation copies distributed individually. |
| Bench harness | Same-condition replay scripts for SimpleQA / TruthfulQA / HaluEval-QA, including 4-LLM Ollama connector examples. |
Distribution forms (evaluation licence / joint PoC / sponsored development / OEM integration) via contact or the partners page.
6. Related
- Product page: SlimeTree-RLM — product detail (deployment scenarios, enterprise / AI provider angles)
- Deep-dive blog: Just 272 KB to cut LLM hallucinations to one-third — SlimeTree-RLM (Japanese, 7 chapters)
- Applied blog (Phase D corpus + LoRA + vLLM): Gemma 4 12B on RTX 5060 Ti, 1000 prompts — enterprise Local AI: where it stands (9 chapters, LoRA + vLLM addendum, with Errata)
- Related news: research releases and announcements
- Same family, simple-record-system variant: SlimeTree-VSAM + deep-dive blog
- Category: DEVICE products / Resource home
