Why Do LLMs Return Different Answers to the Same Question — Creating "Consistency" Through Determinism

How High is Mount Fuji?

One day, I asked the AI (local LLM) in our office about the height of Mount Fuji. temperature=0, seed fixed. In theory, the setting should return "the same answer every time." But——

"What is the height of Mount Fuji?" → 3,776.1 m
"How many meters is Mount Fuji?" → 3,776.12 m
"How high is Mount Fuji?" → 3,776.24 m

Asking the same thing, but just changing the phrasing produced 3 different answers. Mount Fuji's elevation should be a single number. This is the "drift" of LLMs.

In casual conversation, it's a funny anecdote. But when "yesterday's sales," "who approved the transaction 3 days ago," or "is this shipping cost tax-inclusive?"——business that deals with facts——this drift is not funny.

What we tackled was eliminating this "drift" not through intelligence but through structure. Rather than making the LLM smarter, we normalize meaning at the input, ensuring that questions with identical meaning always converge to the same answer. Today I'll introduce an experimental log (internal codename SlimeLab) of this work.

The Idea: If the Meaning is the Same, Fold It into the Same "Key"

Think of a cache. When the same question comes in, return the previous answer without computing——it's fast, and above all the answer is the same every time.

The problem is determining whether questions are "the same." An ordinary cache only works when strings match exactly. "What is the height of Mount Fuji?" and "How many meters is Mount Fuji?" are different strings, so they're treated as different, and the LLM produces drifting answers each time.

So we placed a normalization layer at the input that deterministically folds questions with identical meaning into the same key (cache key). When keys match, we return one confirmed answer. Drift disappears in principle.

The key phrase is "deterministically." This is where our commitment lies——bit-exact(bit-perfect).

Commitment ①: Don't Cut Corners in the Middle of Calculation

"3,776.12 m" and "3.77612 km" and "377,612 cm" are the same length. To fold these into the same key, we need to align units. The naive approach:

3.77612 * 1000 == 3776.12   # → False
# Actual value: 3776.1200000000003

Computing with floating-point(float) shifts the trailing digits. Moreover, this shift can change with CPU and library versions. There is actually no guarantee that computation results are reproduced bit-by-bit in the middle of calculation. Communications (error correction) and storage (hash verification) have spent decades building bit-perfect "both ends," but the middle of calculation and conversion has long gotten by with "roughly correct."

We closed unit conversion within rational numbers (fractions).

from fractions import Fraction
Fraction("3776.12")          # = 94403/25
Fraction("3.77612") * 1000   # = 94403/25  ← exactly matches
Fraction("377612") / 100     # = 94403/25  ← this also matches

As the irreducible fraction 94403/25 they exactly match, so the 3 notations fold into the same key. No float rounding errors creep in. This is SlimeUnit(unit normalization).

Commitment ②: Explicitly State the "Reference Date" for Times

"3 days ago," "last Friday," and "2026/06/24" may all refer to the same day. But "3 days ago" is 3 days from when? Without a reference date, it's undefined.

Most date libraries in the world resolve relative dates against the system clock at that moment (now). Convenient, but this is non-deterministic. If you're later asked to audit and "reproduce that day's batch," you can't because now() has changed.

So SlimeClock takes an approach that uses no now() at all, explicitly receives a reference date (as_of), and solves everything with integer arithmetic alone.

as_of = 2026-06-27 (Sat)
  3 days ago        → 2026-06-24
  2026/06/24        → 2026-06-24
  June 24           → 2026-06-24   ← all the same key

Financial "T+3 business day settlement" works the same way. We skip weekends and holidays (including our company's founding anniversary in the business day calendar) with integer arithmetic to finalize the settlement date.

T+3 (as_of=6/27 Sat)
  → Sat 27, Sun 28 skipped as weekend
  → Mon 29 (1), Tue 30 is founding anniversary, skipped, Wed 7/1 (2), Thu 7/2 (3)
  → 2026-07-02

With the reference date and calendar fixed, the same date every calculation. Unaffected by server clock drift.

Commitment ③: Don't Fabricate What You Don't Know

This is the most different from LLMs.

LLMs, when asked something they don't know, make up a plausible-sounding answer (hallucination). Our normalization layer is the opposite: it stops and leaves things unknown (suspended) when decision-making material is insufficient. We call this internally a Null Slot. Instead of probabilistically fabricating, we deterministically say "I don't know."

For example, identifying a person. SlimeWho resolves person references by matching against our internal HR data.

"Manager Tanaka," "Tanaka Ichiro," "t.tanaka@…(company email)," "that Tanaka from sales," "employee number E1001" → all the same person, folded into the same key
Just "Tanaka" → If there are 2 Tanakas in the company, it's unclear which one, so it's suspended. We don't arbitrarily decide on one.

"Not arbitrarily deciding" is critically important. Getting the approval authority wrong in an authorization could be fatal in audit. Better to stop than get it wrong.

※ Person data is not shared externally. Matching is completed within internal data, and shared logs contain neither employee numbers nor names——only an irreversible hash (pseudonym).

Verify Only When Answers Diverge

Back to Mount Fuji at the start. When the answer diverges as "3,776.1 / .12 / .24," which one should we pin as the correct answer in the key?

Here we introduced LazyVerify(lazy verification). The thinking is simple——do nothing when answers match. Verify only when they diverge.

1. First majority vote. In this case, 3,776.12 is most common (free, offline).

2. If it still doesn't settle/we want to be cautious, reference a trusted source just once. Actually checking the Wikipedia Mount Fuji page shows "the highest point of the mountain body is 3,776.12 m."

3. Freeze the confirmed answer. After this, return the same confirmed value for the same key without referencing again.

Because we verify "only when diverging," external references trigger on only a tiny fraction of actual traffic. And once an answer is confirmed, it's frozen, so henceforth it reproduces bit-perfectly. We convert one point of uncertainty at the boundary into certainty, keeping the contents deterministic——what communications and storage have done "at both ends" for decades, we've replicated in the middle of calculation.

Finishing: Fold "Who, When, How Much, What" into One Key

Bundling these 3 (person, time, amount) yields something interesting. The next 3 sentences have completely different phrasing.

"Manager Tanaka approved 3 days ago for 500,000 yen"
"Tanaka Ichiro approved on 2026/06/24 for 500,000 yen"
"t.tanaka@… approved on June 24 for ¥500,000"

Normalizing these all yields

( approver=E1001, date=2026-06-24, amount=¥500000, action=approval )

and folds into the same audit key (sha256). Regardless of phrasing variation, "who approved how much when what" is uniquely confirmed. And per our principle, the ambiguous "Tanaka approved"——where we don't know who——is treated as suspended and falls into a different key——ambiguous decisions don't slip through into confirmed logs.

In internal control audits like J-SOX, this very "unique identification of approver" and "reproducibility" is crucial. International transactions mixing currencies (yen and dollar) also work in the same framework——fix the exchange rate for that date and they fold into the same key.

Results

A small experiment, but meaningful numbers:

Normalization of synonym queries: 11 phrasings → 4 keys (remaining 7 reuse confirmed answers)
Measured drift: raw LLM produces 4 different answers for 4 semantic variants even at temperature=0. With normalization+confirmation, perfect agreement
All no LLM use, no floating-point, zero external library dependencies. Integers, rationals, and hashes only
Design retains no document text in logs(hashes and flags only), enabling effect measurement while preserving confidentiality

Not "smart AI" but "no-drift machinery." The probabilistic layer (LLM) is inherently best-effort, but the deterministic layer (normalization+confirmation) guarantees stability. This division of labor is the core.

Why We Do This

We're bringing into real-world operations a quiet but important principle——that computation results should be reproducible bit-by-bit (bit-exact). Even if computations match, if the chosen representative answer is false or the approver is ambiguous, it's meaningless. So we add "correctness (verification)" and "identification (who is confirmed)" to "match (reproducibility)." Only then does "being deterministic = being correct" hold true.

The smarter AI becomes, the more critical the scaffolding for AI not to fabricate becomes. Say what you don't know is unknown, eliminate drift through structure, reproduce what's confirmed every time——that's a record of the quiet groundwork.

The content of this article is based on internal experiments (PoC). Not product specifications. SlimeClock / SlimeUnit / SlimeWho / LazyVerify are experimental codenames.