Parallax Telemetry measures the structure of a model's output — how it's built, not whether a grader liked it. The kernel is deterministic: same input, same output, every coordinate backed by countable evidence. No judge model. No vibes. No variance.
Three operations, the three things an eval actually needs to answer. Paste text — it's measured locally on the server, deterministically.
Where does this output sit in structural space, and what evidence supports it?
Did my prompt change really improve the output — or just reformat it? This is the question A/B prompt testing usually can't answer.
Fit a neutral envelope from a few representative "good" outputs, then test whether a new output has drifted outside your model's normal structure. This is the CI hook.
Every incumbent eval tool runs on an LLM-as-judge: a second model grades the first. That's expensive, non-deterministic, and circular. Parallax Telemetry is a different category of instrument.
It does not judge whether an answer is correct, safe, or persuasive — and never claims to. It quantifies how the output is structured, so prompts, models, and multi-turn workflows can be compared on signal instead of vibes. Two correct answers can land in very different coordinates. That's the point.
Validated across engines: 4,000+ responses, Cohen's d = 2.47. Read the Linguistic Kernel results or the full Cross-Engine EMS Study (PDF).
A plain JSON API. No SDK required — it's curl-able. The same calls the playground above makes.
One output → integrity verdict, 8 structural coordinates, basin label, reasoning signals, and the evidence counts behind them.
curl -s localhost:7434/v1/score \
-H 'content-type: application/json' \
-d '{"text":"your model output"}'
Two outputs → real_change or reformat_only. The novelty gate catches reformatting that masquerades as improvement.
curl -s localhost:7434/v1/compare \
-H 'content-type: application/json' \
-d '{"a":"baseline","b":"revision"}'
Fit a neutral envelope from your good outputs, then band any new output: stable / elevated / breach.
Integrity and drift are independent axes. Every result carries both: drift says how far the output moved from your normal; integrity says whether it structurally broke (collapsed / unreliable / low_confidence / sound). A collapsed output fails the CI gate even when it lands inside the envelope — a truncated answer that drifts nowhere is still a broken answer.
curl -s localhost:7434/v1/baseline \
-d '{"texts":["...","...","..."]}'
# → envelope, then:
curl -s localhost:7434/v1/monitor \
-d '{"text":"...","envelope":{...}}'
The point of a deterministic score is that you can assert on it. Drop the CLI into CI and gate merges when a prompt or model change pushes outputs out of envelope — no flaky judge model deciding your pipeline.
$ python3 cli.py baseline --from golden_outputs.txt --save envelope.json envelope saved to envelope.json (kernel lang-kernel-rebuild-v19, n=8, r95=0.27221) $ python3 cli.py monitor --envelope envelope.json --text "$(cat candidate.txt)" --fail-on breach band: breach drift_ratio: 3.688 integrity: sound top driver: void_ratio (rᵥ) -0.62 FAIL: band 'breach' >= threshold 'breach'. $ echo $? 1 # non-zero exit → CI fails the merge
# .github/workflows/eval.yml — envelope.json is committed, like a lockfile
- name: Structural drift gate
run: python3 products/parallax-telemetry/cli.py monitor \
--envelope tests/envelope.json \
--text "$(python3 run_model.py < tests/prompt.txt)" \
--fail-on breach
These numbers are real, not illustrative: the candidate is the same rambling cache answer you can run in the playground and the API console above, measured under lang-kernel-rebuild-v19 and pinned by the product's own test suite. A collapsed integrity verdict fails the gate even when drift is inside the envelope.