Token-Level Truth: Real-Time Hallucination Detection for Production LLMs
HaluGate delivers real-time, token-level hallucination detection for production LLMs. Leveraging tool context as ground truth, it prevents unsupported claims, offering fast, explainable, and cost-effective verification for reliable AI deployment.
Imagine your Large Language Model (LLM) calls a tool, retrieves accurate data, yet still delivers an incorrect answer. This is the challenge of extrinsic hallucination, where models confidently disregard factual ground truth. Building upon our existing Signal-Decision Architecture, we are proud to introduce HaluGate. HaluGate is a conditional, token-level hallucination detection pipeline designed to identify and intercept unsupported claims before they ever reach your users. It operates without relying on an LLM-as-judge or a separate Python runtime, offering fast, explainable verification precisely at the point of delivery.
The Problem: Hallucinations Impede Production LLM Deployment
Hallucinations represent the most significant obstacle to deploying LLMs effectively in production environments. Across diverse sectors—including legal (fabricated case citations), healthcare (incorrect drug interactions), finance (invented financial data), and customer service (non-existent policies)—a consistent pattern emerges: AI systems generate plausible-sounding content that appears authoritative but fails upon closer examination. The critical issue isn't overt nonsense, but rather subtle fabrications embedded within otherwise accurate responses. These errors often demand specialized domain expertise or external verification to detect. For enterprises, this inherent uncertainty transforms LLM deployment into a potential liability instead of a valuable asset.
Scenario: When Tools Provide Correct Data, But Models Fail
To illustrate this challenge, consider a typical function-calling interaction:
User: “When was the Eiffel Tower built?”
Tool Call: get_landmark_info("Eiffel Tower")
Tool Response:
{"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}
LLM Response: “The Eiffel Tower was built in 1950 and stands at 500 meters tall in Paris, France.”
In this example, the tool successfully retrieved accurate data. However, the LLM’s response, while appearing factual, contains two fabricated elements—extrinsic hallucinations that directly contradict the provided context. This failure mode is particularly deceptive because:
- Users tend to trust the output, as they observe a tool was invoked.
- Traditional content filters often miss such errors, as they don't involve toxic or harmful content.
- Evaluation becomes costly if relying on another LLM to judge accuracy.
This raises a crucial question: What if we could automatically detect such errors in real-time, with millisecond latency?
The Insight: Leveraging Function Calling as Ground Truth
The fundamental insight is that modern function-calling APIs inherently provide crucial grounding context. When users pose factual questions, LLMs invoke tools for tasks such as database lookups, API calls, or document retrieval. The results from these tool calls are semantically equivalent to retrieved documents in a Retrieval-Augmented Generation (RAG) system.

This eliminates the need for separate retrieval infrastructure or relying on powerful models like GPT-4 as judges. Instead, we extract three essential components directly from the existing API flow:
| Component | Source | Purpose |
|---|---|---|
| Context | Tool message content | Ground truth for verification |
| Question | User message | Intent understanding |
| Answer | Assistant response | Claims to verify |
The core question then becomes: Is the answer faithful to the provided context?
Why Not Rely on an LLM-as-Judge?
The seemingly straightforward approach of using another LLM for verification introduces fundamental challenges in a production environment:
| Approach | Latency | Cost | Explainability |
|---|---|---|---|
| GPT-4 as judge | 2-5 seconds | $0.01-0.03/request | Low (black box) |
| Local LLM judge | 500ms-2s | GPU compute | Low |
| HaluGate | 76-162ms | CPU only | High (token-level + NLI) |
Furthermore, LLM-based judges often exhibit several biases:
- Position bias: A tendency to favor specific answer positions.
- Verbosity bias: Longer answers may be rated higher, irrespective of their actual accuracy.
- Self-preference: Models may favor outputs that align with their own stylistic patterns.
- Inconsistency: Identical inputs can sometimes yield different judgments.
These limitations underscored the need for a solution that is faster, more cost-effective, and provides superior explainability.
HaluGate: A Two-Stage Detection Pipeline
HaluGate employs a conditional, two-stage pipeline meticulously designed to balance efficiency with detection precision.

Stage 1: HaluGate Sentinel (Prompt Classification)
Not every query necessitates hallucination detection. For instance, consider the following prompt types:
| Prompt | Needs Fact-Check? | Reason |
|---|---|---|
| “When was Einstein born?” | ✅ Yes | Verifiable fact |
| “Write a poem about autumn” | ❌ No | Creative task |
| “Debug this Python code” | ❌ No | Technical assistance |
| “What’s your opinion on AI?” | ❌ No | Opinion request |
| “Is the Earth round?” | ✅ Yes | Factual claim |
Applying token-level detection to creative writing or code review tasks is inefficient and risks generating false positives (e.g., "your poem contains unsupported claims!").
The Importance of Pre-classification: Token-level detection scales linearly with context length. For example, a 4K token RAG context might require ~125ms for detection, while 16K tokens could take ~365ms. In typical production workloads, approximately 35% of queries are non-factual. Pre-classification offers a 72.2% efficiency gain by completely bypassing expensive detection for creative, coding, and opinion-based queries.
HaluGate Sentinel is a ModernBERT-based classifier specifically designed to answer one critical question: Does this prompt warrant factual verification?

This model is trained on a meticulously curated dataset comprising:
- Fact-Check Needed (Positive Class):
- Question Answering: SQuAD, TriviaQA, Natural Questions, HotpotQA
- Truthfulness: TruthfulQA (addressing common misconceptions)
- Hallucination Benchmarks: HaluEval, FactCHD
- Information-Seeking Dialogue: FaithDial, CoQA
- RAG Datasets: neural-bridge/rag-dataset-12000
- No Fact-Check Needed (Negative Class):
- Creative Writing: WritingPrompts, story generation
- Code: CodeSearchNet docstrings, programming tasks
- Opinion/Instruction: Dolly non-factual, Alpaca creative
This binary classification achieves a 96.4% validation accuracy with an inference latency of approximately 12ms through native Rust/Candle integration.
Stage 2: Token-Level Detection with NLI Explanation
For prompts identified as fact-seeking, HaluGate initiates a sophisticated two-model detection pipeline.
Token-Level Hallucination Detection
In contrast to sentence-level classifiers that provide a singular "hallucinated/not hallucinated" label, token-level detection precisely identifies which specific tokens within a response are unsupported by the provided context.

The model architecture for token-level detection follows this structure:
Input: [CLS] context [SEP] question [SEP] answer [SEP]
↓
ModernBERT Encoder
↓
Token Classification Head (Binary per token)
↓
Label: 0 = Supported, 1 = Hallucinated (for answer tokens only)
Key design principles include:
- Answer-only classification: Only tokens within the answer segment are classified, excluding context or question tokens.
- Span merging: Consecutive hallucinated tokens are merged into unified spans to enhance readability.
- Confidence thresholding: A configurable threshold (default 0.8) is applied to balance precision and recall.
NLI Explanation Layer
Merely knowing that something is a hallucination is insufficient; understanding why is crucial. Our Natural Language Inference (NLI) model classifies each detected span against its context:

| NLI Label | Meaning | Severity | Action |
|---|---|---|---|
| CONTRADICTION | Claim conflicts with context | 4 (High) | Flag as error |
| NEUTRAL | Claim not supported by context | 2 (Medium) | Flag as unverifiable |
| ENTAILMENT | Context supports the claim | 0 | Filter false positive |
How the Ensemble Works: Token-level detection alone yields only 59% F1 on the hallucinated class, meaning nearly half of all hallucinations are missed, and one-third of flags are false positives. While we experimented with a unified 5-class model (e.g., SUPPORTED/CONTRADICTION/FABRICATION), it achieved a mere 21.7% F1. This highlights that token-level classification on its own struggles to discern why something is incorrect. The two-stage approach transforms a basic detector into an actionable system: the initial token-level detection provides recall (identifying potential issues), while the NLI layer enhances precision (filtering false positives) and offers crucial explainability (categorizing why each span is problematic).
Integration with the Signal-Decision Architecture
HaluGate is not a standalone component; it is deeply integrated into our Signal-Decision Architecture as a novel signal type and plugin.
fact_check as a Signal Type
Just as our architecture utilizes keyword, embedding, and domain signals, fact_check is now a first-class signal type, enabling robust conditional logic.

This integration allows decisions to be precisely conditioned on whether a query is fact-seeking. It's important to note that even advanced frontier models exhibit hallucination variance between releases (e.g., GPT-5.2's system card shows a measurable hallucination delta compared to previous versions), underscoring the critical need for continuous verification regardless of model sophistication.
decisions:
-
name: "factual-query-with-verification"
priority: 100
rules:
operator: "AND"
conditions:
-
type: "fact_check"
name: "needs_fact_check"
-
type: "domain"
name: "general"
plugins:
-
type: "hallucination"
configuration:
enabled: true
use_nli: true
hallucination_action: "header"
Request-Response Context Propagation
A key engineering challenge lies in propagating state across the request-response boundary, as classification occurs at request time while detection happens at response time.

The RequestContext structure efficiently carries all necessary state:
RequestContext:
# Classification results (set at request time)
FactCheckNeeded: true
FactCheckConfidence: 0.87
# Tool context (extracted at request time)
HasToolsForFactCheck: true
ToolResultsContext: "Built 1887-1889, 330 meters..."
UserContent: "When was the Eiffel Tower built?"
# Detection results (set at response time)
HallucinationDetected: true
HallucinationSpans: [
"1950",
"500 meters"
]
HallucinationConfidence: 0.92
The hallucination Plugin
The hallucination plugin is configured on a per-decision basis, allowing for granular control over its behavior:
plugins:
-
type: "hallucination"
configuration:
enabled: true
use_nli: true # Enable NLI explanations
hallucination_action: "header" # Action when hallucination detected ("header" | "body" | "block" | "none")
unverified_factual_action: "header" # Action when fact-check needed but no tool context
include_hallucination_details: true # Include detailed info in response
| Action | Behavior |
|---|---|
header | Adds warning headers, allows response to pass through |
body | Injects a warning directly into the response body |
block | Returns an error response, preventing LLM output |
none | Logs the event only, with no user-visible action |
Response Headers: Actionable Transparency
Detection results are transparently communicated via HTTP headers, empowering downstream systems to implement custom policies:
HTTP/1.1 200 OK
Content-Type: application/json
x-vsr-fact-check-needed: true
x-vsr-hallucination-detected: true
x-vsr-hallucination-spans: 1950; 500 meters
x-vsr-nli-contradictions: 2
x-vsr-max-severity: 4
For factual responses that cannot be verified due to a lack of available tools, specific headers provide transparency:
HTTP/1.1 200 OK
x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true
These headers facilitate various actions:
- UI Disclaimers: Display warnings to users when confidence levels are low.
- Human Review Queues: Route flagged responses for manual expert review.
- Audit Logging: Track unverified claims for compliance and analytical purposes.
- Conditional Blocking: Automatically block responses containing high-severity contradictions.
The Complete Pipeline: Three Distinct Paths

The HaluGate pipeline manages requests through three distinct paths:
| Path | Condition | Latency Added | Action |
|---|---|---|---|
| Path 1 | Non-factual prompt | ~12ms (classifier only) | Passes through |
| Path 2 | Factual + No tools | ~12ms | Adds warning headers |
| Path 3 | Factual + Tools available | 76-162ms | Full detection + headers |
Model Architecture Deep Dive
Let's examine the three specialized models that collectively power HaluGate:

HaluGate Sentinel: Binary Prompt Classification
- Architecture: ModernBERT-base with a LoRA adapter and a binary classification head.
- Training:
- Base Model:
answerdotai/ModernBERT-base - Fine-tuning: LoRA (rank=16, alpha=32, dropout=0.1)
- Training Data: 50,000 samples meticulously gathered from 14 diverse datasets.
- Loss Function: CrossEntropy with class weights to address data imbalance.
- Optimization: AdamW, learning rate=2e-5, trained over 3 epochs.
- Base Model:
- Inference:
- Input: Raw prompt text.
- Output: (class_id, confidence).
- Latency: Approximately 12ms on CPU.
The LoRA (Low-Rank Adaptation) approach enables efficient fine-tuning, preserving the vast pre-trained knowledge while only updating a small fraction of parameters (2.2%, or 3.4M out of 149M) during training.
HaluGate Detector: Token-Level Binary Classification
- Architecture: ModernBERT-base coupled with a token classification head.
- Input Format:
[CLS] The Eiffel Tower was built in 1887-1889 and is 330 meters tall. [SEP] When was the Eiffel Tower built? [SEP] The Eiffel Tower was built in 1950 and is 500 meters tall. [SEP]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Answer tokens (classification targets) - Output: A binary label (0=Supported, 1=Hallucinated) for each token within the answer.
- Post-processing Steps:
- Predictions are filtered to apply only to the answer segment.
- A configurable confidence threshold (default: 0.8) is applied.
- Consecutive hallucinated tokens are merged into unified spans.
- The system returns these spans along with their confidence scores.
HaluGate Explainer: Three-Way NLI Classification
-
Architecture: ModernBERT-base, specifically fine-tuned for Natural Language Inference (NLI).
-
Input Format:
[CLS] The Eiffel Tower was built in 1887-1889. [SEP] built in 1950 [SEP]^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ Premise (context) Hypothesis (span) -
Output: A three-way classification with confidence scores:
- ENTAILMENT (0): The context unequivocally supports the claim.
- NEUTRAL (1): The claim cannot be definitively determined from the context.
- CONTRADICTION (2): The context directly conflicts with the claim.
-
Severity Mapping:
| NLI Label | Severity Score | Interpretation |
|---|---|---|
| ENTAILMENT | 0 | Likely false positive—filter out |
| NEUTRAL | 2 | Claim is unverifiable |
| CONTRADICTION | 4 | Direct factual error |
The Advantages of Native Rust/Candle Implementation
All three HaluGate models execute natively using Candle, Hugging Face’s machine learning framework written in Rust, leveraging CGO bindings for Go integration.

This architectural choice provides significant benefits:
| Aspect | Python (PyTorch) | Native (Candle) |
|---|---|---|
| Cold start | 5-10 seconds | <500ms |
| Memory | 2-4GB per model | 500MB-1GB per model |
| Latency | +50-100ms overhead | Near-zero overhead |
| Deployment | Python runtime required | Single binary |
| Scaling | GIL contention | True parallelism |
This native implementation eliminates the need for a separate Python service, sidecar containers, or external model servers, ensuring that all processing runs efficiently in-process.
Latency Breakdown
The following table outlines the measured latency for each component within the HaluGate production pipeline:
| Component | P50 | P99 | Notes |
|---|---|---|---|
| Fact-check classifier | 12ms | 28ms | ModernBERT inference |
| Tool context extraction | 1ms | 3ms | JSON parsing |
| Hallucination detector | 45ms | 89ms | Token classification |
| NLI explainer | 18ms | 42ms | Per-span classification |
| Total overhead (when detection runs) | 76ms | 162ms |
This total overhead of 76-162ms is remarkably low and negligible when compared to typical LLM generation times, which range from 5 to 30 seconds. This makes HaluGate a practical solution for synchronous request processing.
Configuration Reference
Below is a complete reference for configuring HaluGate's hallucination mitigation:
hallucination_mitigation:
# Stage 1: Prompt classification
fact_check_model:
model_id: "models/halugate-sentinel"
threshold: 0.6 # Confidence threshold for FACT_CHECK_NEEDED
use_cpu: true
# Stage 2a: Token-level detection
hallucination_model:
model_id: "models/halugate-detector"
threshold: 0.8 # Token confidence threshold
use_cpu: true
# Stage 2b: NLI explanation
nli_model:
model_id: "models/halugate-explainer"
threshold: 0.9 # NLI confidence threshold
use_cpu: true
# Signal rules for fact-check classification
fact_check_rules:
-
name: needs_fact_check
description: "Query contains factual claims that should be verified"
-
name: no_fact_check_needed
description: "Query is creative, code-related, or opinion-based"
# Decision with hallucination plugin
decisions:
-
name: "verified-factual"
priority: 100
rules:
operator: "AND"
conditions:
-
type: "fact_check"
name: "needs_fact_check"
plugins:
-
type: "hallucination"
configuration:
enabled: true
use_nli: true
hallucination_action: "header"
unverified_factual_action: "header"
include_hallucination_details: true
Beyond Production: HaluGate as an Evaluation Framework
While HaluGate is primarily engineered for real-time production deployment, its robust pipeline is equally effective for offline model evaluation. Rather than intercepting live requests, HaluGate can process benchmark datasets to systematically quantify hallucination rates across various LLMs.

Evaluation Workflow
- Load Dataset: Utilize established QA/RAG benchmarks (e.g., TriviaQA, Natural Questions, HotpotQA) or integrate custom enterprise datasets featuring context-question pairs.
- Generate Responses: Execute the LLM under test against each query, providing the relevant context.
- Detect Hallucinations: Feed the (context, query, response) triples through the HaluGate Detector.
- Classify Severity: Employ the HaluGate Explainer to categorize the severity and type of each identified hallucinated span.
- Aggregate Metrics: Calculate crucial metrics such as overall hallucination rates, contradiction ratios, and detailed breakdowns by category.
Limitations and Scope
HaluGate is specifically designed to address extrinsic hallucinations, which occur when verification can be grounded by tool or RAG context. It has specific limitations:
What HaluGate Cannot Detect
| Limitation | Example | Reason |
|---|---|---|
| Intrinsic hallucinations | Model states “Einstein was born in 1900” without any tool call | No external context for verification |
| No-context scenarios | User asks a factual question, but no tools are defined | Absence of ground truth |
Transparent Degradation
For requests classified as fact-seeking but where tool context is unavailable, HaluGate explicitly flags responses as “unverified factual” instead of silently permitting them. This is indicated by specific HTTP headers:
x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true
This transparent approach empowers downstream systems to manage uncertainty and unverified claims appropriately.
Conclusion
HaluGate introduces a principled and robust approach to hallucination detection for production LLM deployments, characterized by:
- Conditional Verification: Non-factual queries are intelligently skipped, while factual ones undergo rigorous verification.
- Token-Level Precision: It precisely identifies which specific claims within a response are unsupported by the context.
- Explainable Results: The NLI classification provides clear insights into why a particular claim is deemed problematic.
- Zero-Latency Integration: Leveraging native Rust inference, HaluGate integrates seamlessly without the need for Python sidecars, ensuring minimal overhead.
- Actionable Transparency: Rich HTTP headers empower downstream systems to enforce custom policies and actions.
HaluGate ensures that if your LLM calls a tool, receives accurate data, yet still generates an incorrect answer, it will be detected and addressed proactively—before it ever reaches your users.