← Anulum Institute
Director Class AI

Benchmarks

Reproducible accuracy and latency measurements. All numbers from automated CI runs, not marketing claims.

Accuracy — NLI scoring

Evaluated on LLM-AggreFact (29,320 claim–source pairs). Balanced accuracy accounts for class imbalance.

ScorerParametersBalanced AccuracyLatency (GPU)License
FactCG-DeBERTa-v3-Large0.4B75.8%14.6 msMIT
MiniCheck-Flan-T5-Large0.8B77.4%~40 msMIT
MiniCheck-DeBERTa-Large0.4B~73%~15 msMIT
Heuristic-only (Lite)0~55%<0.5 msAGPL
Note: FactCG is the default scorer. It achieves 98% of MiniCheck-7B accuracy at 7× lower latency. The heuristic-only mode is a zero-dependency fallback useful for CPU-constrained environments where approximate scoring is acceptable.

Latency — hardware comparison

p99 latency per claim–source pair. 16-pair batch. FactCG-DeBERTa scorer.

HardwareBackendVRAM / RAMLatency (p99)
NVIDIA GTX 1060 6 GBONNX CUDA6 GB17.9 ms/pair
AMD RX 6600 XT 8 GBROCm8 GB80.1 ms/pair
AMD EPYC 9575FONNX CPU118.9 ms/pair
Intel Xeon E5-2640 v3ONNX CPU207.3 ms/pair
Any CPU (heuristic)Python stdlib0<0.5 ms/pair
Takeaway: Even a 6-year-old GTX 1060 achieves 17.9 ms — well within real-time streaming requirements. ONNX CPU is viable for batch processing. The heuristic mode enables deployment on any hardware including edge devices.

Rust acceleration — backfire-kernel

Performance-critical compute functions compiled to native code via PyO3 FFI. 5,000 iterations, median times.

FunctionPythonRustSpeedup
sanitiser_score57 µs2.1 µs27×
probs_to_confidence (200×3)486 µs15 µs33×
temporal_freshness53 µs2.5 µs21×
probs_to_divergence89 µs8.2 µs11×
verify_numeric42 µs5.8 µs7.2×
detect_task_type38 µs6.1 µs6.2×
extract_reasoning_steps65 µs12 µs5.4×
word_overlap31 µs7.5 µs4.1×
softmax22 µs6.8 µs3.2×
lite_score47 µs26 µs1.8×
lite_score_batch520 µs185 µs2.8×
bidirectional_divergence110 µs14 µs7.9×
Geometric mean (12 functions)9.4×
Note: Rust acceleration is optional. Director Class AI falls back to pure Python automatically if backfire-kernel is not installed. Install with pip install director-ai[rust].

Known limitations

Summarisation false positive rate: 10.5%
Summary-style outputs trigger more false positives than Q&A-style outputs. Domain presets help mitigate this. Active area of improvement.
Knowledge base dependency
NLI scoring quality depends on your knowledge base coverage. Without a KB (e.g., PubMedQA), F1 drops to 62.1%. With a curated KB, performance matches benchmarks above.
Long documents need ≥16 GB VRAM
Documents exceeding 4,096 tokens require chunking or a GPU with ≥16 GB VRAM for single-pass scoring. RTX 4090 or A6000 recommended for production.
ONNX CPU is slow
383 ms/pair on older CPUs. Use heuristic mode or GPU acceleration for real-time applications. CPU is viable for batch processing.

Run your own benchmarks

director-ai bench runs the full benchmark suite on your hardware. Results are reproducible and CI-verified.

Install and benchmark