TELEMETRY
DETERMINISTIC TELEMETRY FOR LLM RELIABILITY

Stop using LLMs
to judge LLMs.

Parallax Telemetry measures the structure of a model's output — how it's built, not whether a grader liked it. The kernel is deterministic: same input, same output, every coordinate backed by countable evidence. No judge model. No vibes. No variance.

TRY IT LIVE ↓ READ THE API
8 structural axes· countable evidence per coordinate· no model in the loop

Run the instrument.

Three operations, the three things an eval actually needs to answer. Paste text — it's measured locally on the server, deterministically.

Where does this output sit in structural space, and what evidence supports it?

Did my prompt change really improve the output — or just reformat it? This is the question A/B prompt testing usually can't answer.

Fit a neutral envelope from a few representative "good" outputs, then test whether a new output has drifted outside your model's normal structure. This is the CI hook.

The ruler, not the critic.

Every incumbent eval tool runs on an LLM-as-judge: a second model grades the first. That's expensive, non-deterministic, and circular. Parallax Telemetry is a different category of instrument.

LLM-AS-JUDGE
  • A model grades a model — circular
  • Non-deterministic: same input, different scores
  • Per-call API cost and latency on every eval
  • Opaque: "8/10" with no auditable basis
  • Drifts when the judge model is updated
PARALLAX TELEMETRY
  • No model in the loop — measurement, not opinion
  • Deterministic: same input → same output, always
  • Runs in milliseconds, on your own hardware
  • Every coordinate traces to countable evidence
  • The kernel version is pinned; results are reproducible

It does not judge whether an answer is correct, safe, or persuasive — and never claims to. It quantifies how the output is structured, so prompts, models, and multi-turn workflows can be compared on signal instead of vibes. Two correct answers can land in very different coordinates. That's the point.

Validated across engines: 4,000+ responses, Cohen's d = 2.47. Read the Linguistic Kernel results or the full Cross-Engine EMS Study (PDF).

Three endpoints.

A plain JSON API. No SDK required — it's curl-able. The same calls the playground above makes.

POST /v1/score

One output → integrity verdict, 8 structural coordinates, basin label, reasoning signals, and the evidence counts behind them.

curl -s localhost:7434/v1/score \
  -H 'content-type: application/json' \
  -d '{"text":"your model output"}'
POST /v1/compare

Two outputs → real_change or reformat_only. The novelty gate catches reformatting that masquerades as improvement.

curl -s localhost:7434/v1/compare \
  -H 'content-type: application/json' \
  -d '{"a":"baseline","b":"revision"}'
POST /v1/baseline + /v1/monitor

Fit a neutral envelope from your good outputs, then band any new output: stable / elevated / breach.

Integrity and drift are independent axes. Every result carries both: drift says how far the output moved from your normal; integrity says whether it structurally broke (collapsed / unreliable / low_confidence / sound). A collapsed output fails the CI gate even when it lands inside the envelope — a truncated answer that drifts nowhere is still a broken answer.

curl -s localhost:7434/v1/baseline \
  -d '{"texts":["...","...","..."]}'
# → envelope, then:
curl -s localhost:7434/v1/monitor \
  -d '{"text":"...","envelope":{...}}'

Fail the build on structural drift.

The point of a deterministic score is that you can assert on it. Drop the CLI into CI and gate merges when a prompt or model change pushes outputs out of envelope — no flaky judge model deciding your pipeline.

$ python3 cli.py baseline --from golden_outputs.txt --save envelope.json
envelope saved to envelope.json  (kernel lang-kernel-rebuild-v19, n=8, r95=0.27221)

$ python3 cli.py monitor --envelope envelope.json --text "$(cat candidate.txt)" --fail-on breach
band: breach   drift_ratio: 3.688   integrity: sound   top driver: void_ratio (rᵥ) -0.62
FAIL: band 'breach' >= threshold 'breach'.
$ echo $?
1   # non-zero exit → CI fails the merge
# .github/workflows/eval.yml — envelope.json is committed, like a lockfile
- name: Structural drift gate
  run: python3 products/parallax-telemetry/cli.py monitor \
         --envelope tests/envelope.json \
         --text "$(python3 run_model.py < tests/prompt.txt)" \
         --fail-on breach

These numbers are real, not illustrative: the candidate is the same rambling cache answer you can run in the playground and the API console above, measured under lang-kernel-rebuild-v19 and pinned by the product's own test suite. A collapsed integrity verdict fails the gate even when drift is inside the envelope.