Benchmark Methodology

How we measure the Crucible™ evidence pipeline — test sets, evaluation criteria, cost calculations, and limitations.

Benchmark results are snapshots, not universal guarantees. Model performance, pricing, and capabilities change over time. We publish methodology so you can evaluate the claims independently.

1. Test Set

Financial Filings

47 public SEC filings — 10-K annual reports from Fortune 500 and mid-cap companies
Languages: English, Simplified Chinese, mixed-language (CJK + English in same document)
Size range: 50KB – 5MB
Includes earnings transcripts, investor decks, and 20-F cross-border filings

Commercial Contracts

12 commercial contracts — MSA, NDA, SaaS agreement, procurement, vendor agreements
Focus: obligation extraction, risk clauses, deadline identification, liability exposure
Includes multi-party and cross-jurisdictional contracts

Multilingual Stress Test

Documents containing Japanese (兆/億/円), Korean (조/억/원), and Chinese (万/亿/元) financial units
Deterministic numeric conversion verification
20/20 recall on target financial metrics across CJK + English

2. Evaluation Criteria

Metric	Method	Description
Completeness Recall	Deterministic	Does the output contain all required evidence strings from the source document?
Faithfulness	LLM Judge	GPT-5.4 mini (temperature=0) verifies each finding against the source text. Are claims supported by cited evidence?
Contradiction Detection	Deterministic + structural	Keyword scan + structure analysis for contradictions, corrections, and restated figures
Translation Accuracy	Structural	JSON structure validation — key preservation, type integrity, translation completeness
Evidence Contract	Schema check	Every finding must contain: `claim`, `evidence`, `sourceRef`, `confidence`, `verificationStatus`
Forbidden Hallucination	Deterministic	Output must not contain numbers, entities, or dates absent from the source document

3. Latest Results (May 2026)

98.8%

Overall Score

94.0%

Faithfulness

20/20

CJK Recall

Hallucinations

Metric	Amalgix	Claude 4.6 Direct	GPT-5.4 Direct
Analyze Document	100.0%	76.4%	49.0%
Summarize	93.8%	87.5%	77.5%
Financial Filing	100.0%	100.0%	30.0%
Contract Review	100.0%	100.0%	70.0%
Contradiction Detection	100.0%	100.0%	77.8%
Faithfulness (exclusive)	94.0%	N/A	N/A

Ablation: Crucible™ pipeline adds +6.0 percentage points over the same Claude model called directly — measuring the value of cross-model verification above raw model capability.

4. Cost Comparison Methodology

The "3.8× cheaper" claim is calculated by comparing the total cost of a 1MB document analysis:

Provider	1MB Cost	Method
Amalgix	$0.82	x402 quote via `estimate_cost`
GPT-5.5 Pro	$3.12	OpenAI API pricing (input + output tokens)
Claude Opus 4.8	$2.88	Anthropic API pricing (input + output tokens)

Amalgix achieves lower cost by routing different analysis stages to cost-efficient specialist models (e.g., GPT-5.4 mini for extraction, Claude Haiku for verification) rather than calling a single frontier model for the entire task.

Ratio: $3.12 / $0.82 ≈ 3.8×. Savings range from 72%–74% depending on the comparison model. Actual costs vary with document complexity, language, and analysis depth.

5. Known Limitations

Latency: Complex multilingual/CJK financial stress tests take 40–50 seconds. Quick/lightweight calls are faster. We do not claim "real-time" for deep evidence analysis.
Faithfulness ceiling: 94.0% faithfulness means approximately 1 in 17 findings may have a weak or imprecise source citation. All outputs include confidence scores so agents can filter low-confidence findings.
Test set scope: Results are measured on English, Chinese, Japanese, and Korean financial documents. Performance on other languages or document types (legal, academic, medical) has not been systematically benchmarked.
Model dependency: Results depend on current model routing (GPT-5.4 mini, Claude Haiku/Sonnet, Mistral Medium). Model updates by providers may change performance.
Cost variability: Pricing is dynamic and depends on document size, analysis depth, and model selection. The $0.82/1MB example is representative, not guaranteed.

6. Reproducibility

We do not publish the full test set (it includes licensed commercial contracts). However:

The SEC filing subset uses publicly available 10-K filings from EDGAR. Specific CIKs/tickers can be provided on request.
You can verify cost claims by calling estimate_cost (free) with your own document size.
Every Amalgix response includes confidence, verificationStatus, and sourceRef — you can inspect the basis for each finding in your own calls.
The evaluation judge (GPT-5.4 mini, temperature=0) is deterministic and reproducible.

Not investment advice. Not legal advice. Amalgix outputs are analytical tools for AI agents and developers. Users must verify critical findings against original source documents. Benchmark results do not constitute a warranty of fitness for any specific use case.

Last updated: May 2026 · Report version: v1.1.0 · Back to Amalgix