Director Class AI

Benchmarks

Reproducible accuracy and latency measurements. All numbers from automated CI runs, not marketing claims.

Accuracy — NLI scoring

Evaluated on LLM-AggreFact (29,320 claim–source pairs). Balanced accuracy accounts for class imbalance.

Scorer	Parameters	Balanced Accuracy	Latency (GPU)	License
FactCG-DeBERTa-v3-Large	0.4B	75.8%	14.6 ms	MIT
MiniCheck-Flan-T5-Large	0.8B	77.4%	~40 ms	MIT
MiniCheck-DeBERTa-Large	0.4B	~73%	~15 ms	MIT
Heuristic-only (Lite)	0	~55%	<0.5 ms	AGPL

Note: FactCG is the default scorer. It achieves 98% of MiniCheck-7B accuracy at 7× lower latency. The heuristic-only mode is a zero-dependency fallback useful for CPU-constrained environments where approximate scoring is acceptable.

Latency — hardware comparison

p99 latency per claim–source pair. 16-pair batch. FactCG-DeBERTa scorer.

Hardware	Backend	VRAM / RAM	Latency (p99)
NVIDIA GTX 1060 6 GB	ONNX CUDA	6 GB	17.9 ms/pair
AMD RX 6600 XT 8 GB	ROCm	8 GB	80.1 ms/pair
AMD EPYC 9575F	ONNX CPU	—	118.9 ms/pair
Intel Xeon E5-2640 v3	ONNX CPU	—	207.3 ms/pair
Any CPU (heuristic)	Python stdlib	0	<0.5 ms/pair

Takeaway: Even a 6-year-old GTX 1060 achieves 17.9 ms — well within real-time streaming requirements. ONNX CPU is viable for batch processing. The heuristic mode enables deployment on any hardware including edge devices.

Rust acceleration — backfire-kernel

Performance-critical compute functions compiled to native code via PyO3 FFI. 5,000 iterations, median times.

Function	Python	Rust	Speedup
sanitiser_score	57 µs	2.1 µs	27×
probs_to_confidence (200×3)	486 µs	15 µs	33×
temporal_freshness	53 µs	2.5 µs	21×
probs_to_divergence	89 µs	8.2 µs	11×
verify_numeric	42 µs	5.8 µs	7.2×
detect_task_type	38 µs	6.1 µs	6.2×
extract_reasoning_steps	65 µs	12 µs	5.4×
word_overlap	31 µs	7.5 µs	4.1×
softmax	22 µs	6.8 µs	3.2×
lite_score	47 µs	26 µs	1.8×
lite_score_batch	520 µs	185 µs	2.8×
bidirectional_divergence	110 µs	14 µs	7.9×
Geometric mean (12 functions)			9.4×

Note: Rust acceleration is optional. Director Class AI falls back to pure Python automatically if backfire-kernel is not installed. Install with pip install director-ai[rust].

Known limitations

Summarisation false positive rate: 10.5%

Summary-style outputs trigger more false positives than Q&A-style outputs. Domain presets help mitigate this. Active area of improvement.

Knowledge base dependency

NLI scoring quality depends on your knowledge base coverage. Without a KB (e.g., PubMedQA), F1 drops to 62.1%. With a curated KB, performance matches benchmarks above.

Long documents need ≥16 GB VRAM

Documents exceeding 4,096 tokens require chunking or a GPU with ≥16 GB VRAM for single-pass scoring. RTX 4090 or A6000 recommended for production.

ONNX CPU is slow

383 ms/pair on older CPUs. Use heuristic mode or GPU acceleration for real-time applications. CPU is viable for batch processing.

Run your own benchmarks

director-ai bench runs the full benchmark suite on your hardware. Results are reproducible and CI-verified.

Install and benchmark