← Back to Amalgix

Benchmark Methodology

How we measure the Crucible™ evidence pipeline — test sets, evaluation criteria, cost calculations, and limitations.

Benchmark results are snapshots, not universal guarantees. Model performance, pricing, and capabilities change over time. We publish methodology so you can evaluate the claims independently.

1. Test Set

Financial Filings

Commercial Contracts

Multilingual Stress Test

2. Evaluation Criteria

Metric Method Description
Completeness Recall Deterministic Does the output contain all required evidence strings from the source document?
Faithfulness LLM Judge GPT-5.4 mini (temperature=0) verifies each finding against the source text. Are claims supported by cited evidence?
Contradiction Detection Deterministic + structural Keyword scan + structure analysis for contradictions, corrections, and restated figures
Translation Accuracy Structural JSON structure validation — key preservation, type integrity, translation completeness
Evidence Contract Schema check Every finding must contain: claim, evidence, sourceRef, confidence, verificationStatus
Forbidden Hallucination Deterministic Output must not contain numbers, entities, or dates absent from the source document

3. Latest Results (May 2026)

98.8%
Overall Score
94.0%
Faithfulness
20/20
CJK Recall
0
Hallucinations
Metric Amalgix Claude 4.6 Direct GPT-5.4 Direct
Analyze Document 100.0% 76.4% 49.0%
Summarize 93.8% 87.5% 77.5%
Financial Filing 100.0% 100.0% 30.0%
Contract Review 100.0% 100.0% 70.0%
Contradiction Detection 100.0% 100.0% 77.8%
Faithfulness (exclusive) 94.0% N/A N/A

Ablation: Crucible™ pipeline adds +6.0 percentage points over the same Claude model called directly — measuring the value of cross-model verification above raw model capability.

4. Cost Comparison Methodology

The "3.8× cheaper" claim is calculated by comparing the total cost of a 1MB document analysis:

Provider1MB CostMethod
Amalgix$0.82x402 quote via estimate_cost
GPT-5.5 Pro$3.12OpenAI API pricing (input + output tokens)
Claude Opus 4.8$2.88Anthropic API pricing (input + output tokens)

Amalgix achieves lower cost by routing different analysis stages to cost-efficient specialist models (e.g., GPT-5.4 mini for extraction, Claude Haiku for verification) rather than calling a single frontier model for the entire task.

Ratio: $3.12 / $0.82 ≈ 3.8×. Savings range from 72%–74% depending on the comparison model. Actual costs vary with document complexity, language, and analysis depth.

5. Known Limitations

6. Reproducibility

We do not publish the full test set (it includes licensed commercial contracts). However:

Not investment advice. Not legal advice. Amalgix outputs are analytical tools for AI agents and developers. Users must verify critical findings against original source documents. Benchmark results do not constitute a warranty of fitness for any specific use case.

Last updated: May 2026 · Report version: v1.1.0 · Back to Amalgix