← Back to Amalgix
Benchmark Methodology
How we measure the Crucible™ evidence pipeline — test sets, evaluation criteria, cost calculations, and limitations.
Benchmark results are snapshots, not universal guarantees. Model performance, pricing, and capabilities change over time. We publish methodology so you can evaluate the claims independently.
1. Test Set
Financial Filings
- 47 public SEC filings — 10-K annual reports from Fortune 500 and mid-cap companies
- Languages: English, Simplified Chinese, mixed-language (CJK + English in same document)
- Size range: 50KB – 5MB
- Includes earnings transcripts, investor decks, and 20-F cross-border filings
Commercial Contracts
- 12 commercial contracts — MSA, NDA, SaaS agreement, procurement, vendor agreements
- Focus: obligation extraction, risk clauses, deadline identification, liability exposure
- Includes multi-party and cross-jurisdictional contracts
Multilingual Stress Test
- Documents containing Japanese (兆/億/円), Korean (조/억/원), and Chinese (万/亿/元) financial units
- Deterministic numeric conversion verification
- 20/20 recall on target financial metrics across CJK + English
2. Evaluation Criteria
| Metric |
Method |
Description |
| Completeness Recall |
Deterministic |
Does the output contain all required evidence strings from the source document? |
| Faithfulness |
LLM Judge |
GPT-5.4 mini (temperature=0) verifies each finding against the source text. Are claims supported by cited evidence? |
| Contradiction Detection |
Deterministic + structural |
Keyword scan + structure analysis for contradictions, corrections, and restated figures |
| Translation Accuracy |
Structural |
JSON structure validation — key preservation, type integrity, translation completeness |
| Evidence Contract |
Schema check |
Every finding must contain: claim, evidence, sourceRef, confidence, verificationStatus |
| Forbidden Hallucination |
Deterministic |
Output must not contain numbers, entities, or dates absent from the source document |
3. Latest Results (May 2026)
| Metric |
Amalgix |
Claude 4.6 Direct |
GPT-5.4 Direct |
| Analyze Document |
100.0% |
76.4% |
49.0% |
| Summarize |
93.8% |
87.5% |
77.5% |
| Financial Filing |
100.0% |
100.0% |
30.0% |
| Contract Review |
100.0% |
100.0% |
70.0% |
| Contradiction Detection |
100.0% |
100.0% |
77.8% |
| Faithfulness (exclusive) |
94.0% |
N/A |
N/A |
Ablation: Crucible™ pipeline adds +6.0 percentage points over the same Claude model called directly — measuring the value of cross-model verification above raw model capability.
4. Cost Comparison Methodology
The "3.8× cheaper" claim is calculated by comparing the total cost of a 1MB document analysis:
| Provider | 1MB Cost | Method |
| Amalgix | $0.82 | x402 quote via estimate_cost |
| GPT-5.5 Pro | $3.12 | OpenAI API pricing (input + output tokens) |
| Claude Opus 4.8 | $2.88 | Anthropic API pricing (input + output tokens) |
Amalgix achieves lower cost by routing different analysis stages to cost-efficient specialist models (e.g., GPT-5.4 mini for extraction, Claude Haiku for verification) rather than calling a single frontier model for the entire task.
Ratio: $3.12 / $0.82 ≈ 3.8×. Savings range from 72%–74% depending on the comparison model. Actual costs vary with document complexity, language, and analysis depth.
5. Known Limitations
- Latency: Complex multilingual/CJK financial stress tests take 40–50 seconds. Quick/lightweight calls are faster. We do not claim "real-time" for deep evidence analysis.
- Faithfulness ceiling: 94.0% faithfulness means approximately 1 in 17 findings may have a weak or imprecise source citation. All outputs include confidence scores so agents can filter low-confidence findings.
- Test set scope: Results are measured on English, Chinese, Japanese, and Korean financial documents. Performance on other languages or document types (legal, academic, medical) has not been systematically benchmarked.
- Model dependency: Results depend on current model routing (GPT-5.4 mini, Claude Haiku/Sonnet, Mistral Medium). Model updates by providers may change performance.
- Cost variability: Pricing is dynamic and depends on document size, analysis depth, and model selection. The $0.82/1MB example is representative, not guaranteed.
6. Reproducibility
We do not publish the full test set (it includes licensed commercial contracts). However:
- The SEC filing subset uses publicly available 10-K filings from EDGAR. Specific CIKs/tickers can be provided on request.
- You can verify cost claims by calling
estimate_cost (free) with your own document size.
- Every Amalgix response includes
confidence, verificationStatus, and sourceRef — you can inspect the basis for each finding in your own calls.
- The evaluation judge (GPT-5.4 mini, temperature=0) is deterministic and reproducible.
Not investment advice. Not legal advice. Amalgix outputs are analytical tools for AI agents and developers. Users must verify critical findings against original source documents. Benchmark results do not constitute a warranty of fitness for any specific use case.
Last updated: May 2026 · Report version: v1.1.0 · Back to Amalgix