Benchmarks
Reproducible accuracy and latency measurements. All numbers from automated CI runs, not marketing claims.
Accuracy — NLI scoring
Evaluated on LLM-AggreFact (29,320 claim–source pairs). Balanced accuracy accounts for class imbalance.
| Scorer | Parameters | Balanced Accuracy | Latency (GPU) | License |
|---|---|---|---|---|
| FactCG-DeBERTa-v3-Large | 0.4B | 75.8% | 14.6 ms | MIT |
| MiniCheck-Flan-T5-Large | 0.8B | 77.4% | ~40 ms | MIT |
| MiniCheck-DeBERTa-Large | 0.4B | ~73% | ~15 ms | MIT |
| Heuristic-only (Lite) | 0 | ~55% | <0.5 ms | AGPL |
Note: FactCG is the default scorer. It achieves 98% of MiniCheck-7B accuracy at 7× lower latency. The heuristic-only mode is a zero-dependency fallback useful for CPU-constrained environments where approximate scoring is acceptable.
Latency — hardware comparison
p99 latency per claim–source pair. 16-pair batch. FactCG-DeBERTa scorer.
| Hardware | Backend | VRAM / RAM | Latency (p99) |
|---|---|---|---|
| NVIDIA GTX 1060 6 GB | ONNX CUDA | 6 GB | 17.9 ms/pair |
| AMD RX 6600 XT 8 GB | ROCm | 8 GB | 80.1 ms/pair |
| AMD EPYC 9575F | ONNX CPU | — | 118.9 ms/pair |
| Intel Xeon E5-2640 v3 | ONNX CPU | — | 207.3 ms/pair |
| Any CPU (heuristic) | Python stdlib | 0 | <0.5 ms/pair |
Takeaway: Even a 6-year-old GTX 1060 achieves 17.9 ms — well within real-time streaming requirements. ONNX CPU is viable for batch processing. The heuristic mode enables deployment on any hardware including edge devices.
Rust acceleration — backfire-kernel
Performance-critical compute functions compiled to native code via PyO3 FFI. 5,000 iterations, median times.
| Function | Python | Rust | Speedup |
|---|---|---|---|
| sanitiser_score | 57 µs | 2.1 µs | 27× |
| probs_to_confidence (200×3) | 486 µs | 15 µs | 33× |
| temporal_freshness | 53 µs | 2.5 µs | 21× |
| probs_to_divergence | 89 µs | 8.2 µs | 11× |
| verify_numeric | 42 µs | 5.8 µs | 7.2× |
| detect_task_type | 38 µs | 6.1 µs | 6.2× |
| extract_reasoning_steps | 65 µs | 12 µs | 5.4× |
| word_overlap | 31 µs | 7.5 µs | 4.1× |
| softmax | 22 µs | 6.8 µs | 3.2× |
| lite_score | 47 µs | 26 µs | 1.8× |
| lite_score_batch | 520 µs | 185 µs | 2.8× |
| bidirectional_divergence | 110 µs | 14 µs | 7.9× |
| Geometric mean (12 functions) | 9.4× | ||
Note: Rust acceleration is optional. Director Class AI falls back to pure Python automatically if backfire-kernel is not installed. Install with
pip install director-ai[rust].Known limitations
Summarisation false positive rate: 10.5%
Summary-style outputs trigger more false positives than Q&A-style outputs. Domain presets help mitigate this. Active area of improvement.
Knowledge base dependency
NLI scoring quality depends on your knowledge base coverage. Without a KB (e.g., PubMedQA), F1 drops to 62.1%. With a curated KB, performance matches benchmarks above.
Long documents need ≥16 GB VRAM
Documents exceeding 4,096 tokens require chunking or a GPU with ≥16 GB VRAM for single-pass scoring. RTX 4090 or A6000 recommended for production.
ONNX CPU is slow
383 ms/pair on older CPUs. Use heuristic mode or GPU acceleration for real-time applications. CPU is viable for batch processing.
Run your own benchmarks
director-ai bench runs the full benchmark suite on your hardware. Results are reproducible and CI-verified.