Parallax Telemetry — deterministic structural telemetry for LLM outputs

Run the instrument.

Three operations, the three things an eval actually needs to answer. Paste text — it's measured locally on the server, deterministically.

Where does this output sit in structural space, and what evidence supports it?

Did my prompt change really improve the output — or just reformat it? This is the question A/B prompt testing usually can't answer.

A — baseline

B — revision

Fit a neutral envelope from a few representative "good" outputs, then test whether a new output has drifted outside your model's normal structure. This is the CI hook.

Baseline set — your model's normal voice, one output per line (≥3)

The cache is checked before computing. If the key is missing, the value is computed once and stored for later reads.
The payload is validated before parsing. If a field is malformed, the request is rejected and the error is returned to the caller.
The retry waits with exponential backoff. If three attempts fail, the request is dropped and the failure is logged for review.
The token is verified against the signing key. If it has expired, the session is refused and the user is asked to log in again.
The queue is drained in arrival order. If a job fails, it is requeued once and then moved to the dead-letter store for inspection.
The session is read from the store. If the entry is missing, a new session is created and the cookie is reissued to the client.
The config is loaded at startup. If a required key is absent, the process exits early and the missing key is named in the log.
The metric is sampled every ten seconds. If the rolling average crosses the limit, an alert fires and the on-call engineer is paged.

New output to test against the envelope — the canonical rambling cache answer (breaches at 3.688× under lang-kernel-rebuild-v19; integrity stays sound)

The ruler, not the critic.

Every incumbent eval tool runs on an LLM-as-judge: a second model grades the first. That's expensive, non-deterministic, and circular. Parallax Telemetry is a different category of instrument.

LLM-AS-JUDGE

A model grades a model — circular
Non-deterministic: same input, different scores
Per-call API cost and latency on every eval
Opaque: "8/10" with no auditable basis
Drifts when the judge model is updated

PARALLAX TELEMETRY

No model in the loop — measurement, not opinion
Deterministic: same input → same output, always
Runs in milliseconds, on your own hardware
Every coordinate traces to countable evidence
The kernel version is pinned; results are reproducible

It does not judge whether an answer is correct, safe, or persuasive — and never claims to. It quantifies how the output is structured, so prompts, models, and multi-turn workflows can be compared on signal instead of vibes. Two correct answers can land in very different coordinates. That's the point.

Validated across engines: 4,000+ responses, Cohen's d = 2.47. Read the Linguistic Kernel results or the full Cross-Engine EMS Study (PDF).

Three endpoints.

A plain JSON API. No SDK required — it's curl-able. The same calls the playground above makes.

POST /v1/score

One output → integrity verdict, 8 structural coordinates, basin label, reasoning signals, and the evidence counts behind them.

curl -s localhost:7434/v1/score \
  -H 'content-type: application/json' \
  -d '{"text":"your model output"}'

POST /v1/compare

Two outputs → real_change or reformat_only. The novelty gate catches reformatting that masquerades as improvement.

curl -s localhost:7434/v1/compare \
  -H 'content-type: application/json' \
  -d '{"a":"baseline","b":"revision"}'

POST /v1/baseline + /v1/monitor

Fit a neutral envelope from your good outputs, then band any new output: stable / elevated / breach.

Integrity and drift are independent axes. Every result carries both: drift says how far the output moved from your normal; integrity says whether it structurally broke (collapsed / unreliable / low_confidence / sound). A collapsed output fails the CI gate even when it lands inside the envelope — a truncated answer that drifts nowhere is still a broken answer.

curl -s localhost:7434/v1/baseline \
  -d '{"texts":["...","...","..."]}'
# → envelope, then:
curl -s localhost:7434/v1/monitor \
  -d '{"text":"...","envelope":{...}}'

Fail the build on structural drift.

The point of a deterministic score is that you can assert on it. Drop the CLI into CI and gate merges when a prompt or model change pushes outputs out of envelope — no flaky judge model deciding your pipeline.

$ python3 cli.py baseline --from golden_outputs.txt --save envelope.json
envelope saved to envelope.json  (kernel lang-kernel-rebuild-v19, n=8, r95=0.27221)

$ python3 cli.py monitor --envelope envelope.json --text "$(cat candidate.txt)" --fail-on breach
band: breach   drift_ratio: 3.688   integrity: sound   top driver: void_ratio (rᵥ) -0.62
FAIL: band 'breach' >= threshold 'breach'.
$ echo $?
1   # non-zero exit → CI fails the merge

# .github/workflows/eval.yml — envelope.json is committed, like a lockfile
- name: Structural drift gate
  run: python3 products/parallax-telemetry/cli.py monitor \
         --envelope tests/envelope.json \
         --text "$(python3 run_model.py < tests/prompt.txt)" \
         --fail-on breach

These numbers are real, not illustrative: the candidate is the same rambling cache answer you can run in the playground and the API console above, measured under lang-kernel-rebuild-v19 and pinned by the product's own test suite. A collapsed integrity verdict fails the gate even when drift is inside the envelope.

Stop using LLMsto judge LLMs.

Run the instrument.

The ruler, not the critic.

Three endpoints.

Fail the build on structural drift.

Stop using LLMs
to judge LLMs.