Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

artificial-intelligence

HaluGate delivers real-time, token-level hallucination detection for production LLMs. Leveraging tool context as ground truth, it prevents unsupported claims, offering fast, explainable, and cost-effective verification for reliable AI deployment.

Imagine your Large Language Model (LLM) calls a tool, retrieves accurate data, yet still delivers an incorrect answer. This is the challenge of extrinsic hallucination, where models confidently disregard factual ground truth. Building upon our existing Signal-Decision Architecture, we are proud to introduce HaluGate. HaluGate is a conditional, token-level hallucination detection pipeline designed to identify and intercept unsupported claims before they ever reach your users. It operates without relying on an LLM-as-judge or a separate Python runtime, offering fast, explainable verification precisely at the point of delivery.

The Problem: Hallucinations Impede Production LLM Deployment

Hallucinations represent the most significant obstacle to deploying LLMs effectively in production environments. Across diverse sectors—including legal (fabricated case citations), healthcare (incorrect drug interactions), finance (invented financial data), and customer service (non-existent policies)—a consistent pattern emerges: AI systems generate plausible-sounding content that appears authoritative but fails upon closer examination. The critical issue isn't overt nonsense, but rather subtle fabrications embedded within otherwise accurate responses. These errors often demand specialized domain expertise or external verification to detect. For enterprises, this inherent uncertainty transforms LLM deployment into a potential liability instead of a valuable asset.

Scenario: When Tools Provide Correct Data, But Models Fail

To illustrate this challenge, consider a typical function-calling interaction:

User: “When was the Eiffel Tower built?”

Tool Call: get_landmark_info("Eiffel Tower")

Tool Response:

{"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}

LLM Response: “The Eiffel Tower was built in 1950 and stands at 500 meters tall in Paris, France.”

In this example, the tool successfully retrieved accurate data. However, the LLM’s response, while appearing factual, contains two fabricated elements—extrinsic hallucinations that directly contradict the provided context. This failure mode is particularly deceptive because:

  • Users tend to trust the output, as they observe a tool was invoked.
  • Traditional content filters often miss such errors, as they don't involve toxic or harmful content.
  • Evaluation becomes costly if relying on another LLM to judge accuracy.

This raises a crucial question: What if we could automatically detect such errors in real-time, with millisecond latency?

The Insight: Leveraging Function Calling as Ground Truth

The fundamental insight is that modern function-calling APIs inherently provide crucial grounding context. When users pose factual questions, LLMs invoke tools for tasks such as database lookups, API calls, or document retrieval. The results from these tool calls are semantically equivalent to retrieved documents in a Retrieval-Augmented Generation (RAG) system.

This eliminates the need for separate retrieval infrastructure or relying on powerful models like GPT-4 as judges. Instead, we extract three essential components directly from the existing API flow:

ComponentSourcePurpose
ContextTool message contentGround truth for verification
QuestionUser messageIntent understanding
AnswerAssistant responseClaims to verify

The core question then becomes: Is the answer faithful to the provided context?

Why Not Rely on an LLM-as-Judge?

The seemingly straightforward approach of using another LLM for verification introduces fundamental challenges in a production environment:

ApproachLatencyCostExplainability
GPT-4 as judge2-5 seconds$0.01-0.03/requestLow (black box)
Local LLM judge500ms-2sGPU computeLow
HaluGate76-162msCPU onlyHigh (token-level + NLI)

Furthermore, LLM-based judges often exhibit several biases:

  • Position bias: A tendency to favor specific answer positions.
  • Verbosity bias: Longer answers may be rated higher, irrespective of their actual accuracy.
  • Self-preference: Models may favor outputs that align with their own stylistic patterns.
  • Inconsistency: Identical inputs can sometimes yield different judgments.

These limitations underscored the need for a solution that is faster, more cost-effective, and provides superior explainability.

HaluGate: A Two-Stage Detection Pipeline

HaluGate employs a conditional, two-stage pipeline meticulously designed to balance efficiency with detection precision.

Stage 1: HaluGate Sentinel (Prompt Classification)

Not every query necessitates hallucination detection. For instance, consider the following prompt types:

PromptNeeds Fact-Check?Reason
“When was Einstein born?”✅ YesVerifiable fact
“Write a poem about autumn”❌ NoCreative task
“Debug this Python code”❌ NoTechnical assistance
“What’s your opinion on AI?”❌ NoOpinion request
“Is the Earth round?”✅ YesFactual claim

Applying token-level detection to creative writing or code review tasks is inefficient and risks generating false positives (e.g., "your poem contains unsupported claims!").

The Importance of Pre-classification: Token-level detection scales linearly with context length. For example, a 4K token RAG context might require ~125ms for detection, while 16K tokens could take ~365ms. In typical production workloads, approximately 35% of queries are non-factual. Pre-classification offers a 72.2% efficiency gain by completely bypassing expensive detection for creative, coding, and opinion-based queries.

HaluGate Sentinel is a ModernBERT-based classifier specifically designed to answer one critical question: Does this prompt warrant factual verification?

This model is trained on a meticulously curated dataset comprising:

  • Fact-Check Needed (Positive Class):
    • Question Answering: SQuAD, TriviaQA, Natural Questions, HotpotQA
    • Truthfulness: TruthfulQA (addressing common misconceptions)
    • Hallucination Benchmarks: HaluEval, FactCHD
    • Information-Seeking Dialogue: FaithDial, CoQA
    • RAG Datasets: neural-bridge/rag-dataset-12000
  • No Fact-Check Needed (Negative Class):
    • Creative Writing: WritingPrompts, story generation
    • Code: CodeSearchNet docstrings, programming tasks
    • Opinion/Instruction: Dolly non-factual, Alpaca creative

This binary classification achieves a 96.4% validation accuracy with an inference latency of approximately 12ms through native Rust/Candle integration.

Stage 2: Token-Level Detection with NLI Explanation

For prompts identified as fact-seeking, HaluGate initiates a sophisticated two-model detection pipeline.

Token-Level Hallucination Detection

In contrast to sentence-level classifiers that provide a singular "hallucinated/not hallucinated" label, token-level detection precisely identifies which specific tokens within a response are unsupported by the provided context.

The model architecture for token-level detection follows this structure:

Input: [CLS] context [SEP] question [SEP] answer [SEP] ↓ ModernBERT Encoder ↓ Token Classification Head (Binary per token) ↓ Label: 0 = Supported, 1 = Hallucinated (for answer tokens only)

Key design principles include:

  • Answer-only classification: Only tokens within the answer segment are classified, excluding context or question tokens.
  • Span merging: Consecutive hallucinated tokens are merged into unified spans to enhance readability.
  • Confidence thresholding: A configurable threshold (default 0.8) is applied to balance precision and recall.

NLI Explanation Layer

Merely knowing that something is a hallucination is insufficient; understanding why is crucial. Our Natural Language Inference (NLI) model classifies each detected span against its context:

NLI LabelMeaningSeverityAction
CONTRADICTIONClaim conflicts with context4 (High)Flag as error
NEUTRALClaim not supported by context2 (Medium)Flag as unverifiable
ENTAILMENTContext supports the claim0Filter false positive

How the Ensemble Works: Token-level detection alone yields only 59% F1 on the hallucinated class, meaning nearly half of all hallucinations are missed, and one-third of flags are false positives. While we experimented with a unified 5-class model (e.g., SUPPORTED/CONTRADICTION/FABRICATION), it achieved a mere 21.7% F1. This highlights that token-level classification on its own struggles to discern why something is incorrect. The two-stage approach transforms a basic detector into an actionable system: the initial token-level detection provides recall (identifying potential issues), while the NLI layer enhances precision (filtering false positives) and offers crucial explainability (categorizing why each span is problematic).

Integration with the Signal-Decision Architecture

HaluGate is not a standalone component; it is deeply integrated into our Signal-Decision Architecture as a novel signal type and plugin.

fact_check as a Signal Type

Just as our architecture utilizes keyword, embedding, and domain signals, fact_check is now a first-class signal type, enabling robust conditional logic.

This integration allows decisions to be precisely conditioned on whether a query is fact-seeking. It's important to note that even advanced frontier models exhibit hallucination variance between releases (e.g., GPT-5.2's system card shows a measurable hallucination delta compared to previous versions), underscoring the critical need for continuous verification regardless of model sophistication.

decisions:
  -
    name: "factual-query-with-verification"
    priority: 100
    rules:
      operator: "AND"
      conditions:
        -
          type: "fact_check"
          name: "needs_fact_check"
        -
          type: "domain"
          name: "general"
    plugins:
      -
        type: "hallucination"
        configuration:
          enabled: true
          use_nli: true
          hallucination_action: "header"

Request-Response Context Propagation

A key engineering challenge lies in propagating state across the request-response boundary, as classification occurs at request time while detection happens at response time.

The RequestContext structure efficiently carries all necessary state:

RequestContext:
  # Classification results (set at request time)
  FactCheckNeeded: true
  FactCheckConfidence: 0.87

  # Tool context (extracted at request time)
  HasToolsForFactCheck: true
  ToolResultsContext: "Built 1887-1889, 330 meters..."
  UserContent: "When was the Eiffel Tower built?"

  # Detection results (set at response time)
  HallucinationDetected: true
  HallucinationSpans: [
    "1950",
    "500 meters"
  ]
  HallucinationConfidence: 0.92

The hallucination Plugin

The hallucination plugin is configured on a per-decision basis, allowing for granular control over its behavior:

plugins:
  -
    type: "hallucination"
    configuration:
      enabled: true
      use_nli: true                 # Enable NLI explanations
      hallucination_action: "header" # Action when hallucination detected ("header" | "body" | "block" | "none")
      unverified_factual_action: "header" # Action when fact-check needed but no tool context
      include_hallucination_details: true # Include detailed info in response
ActionBehavior
headerAdds warning headers, allows response to pass through
bodyInjects a warning directly into the response body
blockReturns an error response, preventing LLM output
noneLogs the event only, with no user-visible action

Response Headers: Actionable Transparency

Detection results are transparently communicated via HTTP headers, empowering downstream systems to implement custom policies:

HTTP/1.1 200 OK
Content-Type: application/json
x-vsr-fact-check-needed: true
x-vsr-hallucination-detected: true
x-vsr-hallucination-spans: 1950; 500 meters
x-vsr-nli-contradictions: 2
x-vsr-max-severity: 4

For factual responses that cannot be verified due to a lack of available tools, specific headers provide transparency:

HTTP/1.1 200 OK
x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true

These headers facilitate various actions:

  • UI Disclaimers: Display warnings to users when confidence levels are low.
  • Human Review Queues: Route flagged responses for manual expert review.
  • Audit Logging: Track unverified claims for compliance and analytical purposes.
  • Conditional Blocking: Automatically block responses containing high-severity contradictions.

The Complete Pipeline: Three Distinct Paths

The HaluGate pipeline manages requests through three distinct paths:

PathConditionLatency AddedAction
Path 1Non-factual prompt~12ms (classifier only)Passes through
Path 2Factual + No tools~12msAdds warning headers
Path 3Factual + Tools available76-162msFull detection + headers

Model Architecture Deep Dive

Let's examine the three specialized models that collectively power HaluGate:

HaluGate Sentinel: Binary Prompt Classification

  • Architecture: ModernBERT-base with a LoRA adapter and a binary classification head.
  • Training:
    • Base Model: answerdotai/ModernBERT-base
    • Fine-tuning: LoRA (rank=16, alpha=32, dropout=0.1)
    • Training Data: 50,000 samples meticulously gathered from 14 diverse datasets.
    • Loss Function: CrossEntropy with class weights to address data imbalance.
    • Optimization: AdamW, learning rate=2e-5, trained over 3 epochs.
  • Inference:
    • Input: Raw prompt text.
    • Output: (class_id, confidence).
    • Latency: Approximately 12ms on CPU.

The LoRA (Low-Rank Adaptation) approach enables efficient fine-tuning, preserving the vast pre-trained knowledge while only updating a small fraction of parameters (2.2%, or 3.4M out of 149M) during training.

HaluGate Detector: Token-Level Binary Classification

  • Architecture: ModernBERT-base coupled with a token classification head.
  • Input Format: [CLS] The Eiffel Tower was built in 1887-1889 and is 330 meters tall. [SEP] When was the Eiffel Tower built? [SEP] The Eiffel Tower was built in 1950 and is 500 meters tall. [SEP] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Answer tokens (classification targets)
  • Output: A binary label (0=Supported, 1=Hallucinated) for each token within the answer.
  • Post-processing Steps:
    • Predictions are filtered to apply only to the answer segment.
    • A configurable confidence threshold (default: 0.8) is applied.
    • Consecutive hallucinated tokens are merged into unified spans.
    • The system returns these spans along with their confidence scores.

HaluGate Explainer: Three-Way NLI Classification

  • Architecture: ModernBERT-base, specifically fine-tuned for Natural Language Inference (NLI).

  • Input Format:

    [CLS] The Eiffel Tower was built in 1887-1889. [SEP] built in 1950 [SEP] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ Premise (context) Hypothesis (span)
  • Output: A three-way classification with confidence scores:

    • ENTAILMENT (0): The context unequivocally supports the claim.
    • NEUTRAL (1): The claim cannot be definitively determined from the context.
    • CONTRADICTION (2): The context directly conflicts with the claim.
  • Severity Mapping:

NLI LabelSeverity ScoreInterpretation
ENTAILMENT0Likely false positive—filter out
NEUTRAL2Claim is unverifiable
CONTRADICTION4Direct factual error

The Advantages of Native Rust/Candle Implementation

All three HaluGate models execute natively using Candle, Hugging Face’s machine learning framework written in Rust, leveraging CGO bindings for Go integration.

This architectural choice provides significant benefits:

AspectPython (PyTorch)Native (Candle)
Cold start5-10 seconds<500ms
Memory2-4GB per model500MB-1GB per model
Latency+50-100ms overheadNear-zero overhead
DeploymentPython runtime requiredSingle binary
ScalingGIL contentionTrue parallelism

This native implementation eliminates the need for a separate Python service, sidecar containers, or external model servers, ensuring that all processing runs efficiently in-process.

Latency Breakdown

The following table outlines the measured latency for each component within the HaluGate production pipeline:

ComponentP50P99Notes
Fact-check classifier12ms28msModernBERT inference
Tool context extraction1ms3msJSON parsing
Hallucination detector45ms89msToken classification
NLI explainer18ms42msPer-span classification
Total overhead (when detection runs)76ms162ms

This total overhead of 76-162ms is remarkably low and negligible when compared to typical LLM generation times, which range from 5 to 30 seconds. This makes HaluGate a practical solution for synchronous request processing.

Configuration Reference

Below is a complete reference for configuring HaluGate's hallucination mitigation:

hallucination_mitigation:
  # Stage 1: Prompt classification
  fact_check_model:
    model_id: "models/halugate-sentinel"
    threshold: 0.6 # Confidence threshold for FACT_CHECK_NEEDED
    use_cpu: true

  # Stage 2a: Token-level detection
  hallucination_model:
    model_id: "models/halugate-detector"
    threshold: 0.8 # Token confidence threshold
    use_cpu: true

  # Stage 2b: NLI explanation
  nli_model:
    model_id: "models/halugate-explainer"
    threshold: 0.9 # NLI confidence threshold
    use_cpu: true

# Signal rules for fact-check classification
fact_check_rules:
  -
    name: needs_fact_check
    description: "Query contains factual claims that should be verified"
  -
    name: no_fact_check_needed
    description: "Query is creative, code-related, or opinion-based"

# Decision with hallucination plugin
decisions:
  -
    name: "verified-factual"
    priority: 100
    rules:
      operator: "AND"
      conditions:
        -
          type: "fact_check"
          name: "needs_fact_check"
    plugins:
      -
        type: "hallucination"
        configuration:
          enabled: true
          use_nli: true
          hallucination_action: "header"
          unverified_factual_action: "header"
          include_hallucination_details: true

Beyond Production: HaluGate as an Evaluation Framework

While HaluGate is primarily engineered for real-time production deployment, its robust pipeline is equally effective for offline model evaluation. Rather than intercepting live requests, HaluGate can process benchmark datasets to systematically quantify hallucination rates across various LLMs.

Evaluation Workflow

  1. Load Dataset: Utilize established QA/RAG benchmarks (e.g., TriviaQA, Natural Questions, HotpotQA) or integrate custom enterprise datasets featuring context-question pairs.
  2. Generate Responses: Execute the LLM under test against each query, providing the relevant context.
  3. Detect Hallucinations: Feed the (context, query, response) triples through the HaluGate Detector.
  4. Classify Severity: Employ the HaluGate Explainer to categorize the severity and type of each identified hallucinated span.
  5. Aggregate Metrics: Calculate crucial metrics such as overall hallucination rates, contradiction ratios, and detailed breakdowns by category.

Limitations and Scope

HaluGate is specifically designed to address extrinsic hallucinations, which occur when verification can be grounded by tool or RAG context. It has specific limitations:

What HaluGate Cannot Detect

LimitationExampleReason
Intrinsic hallucinationsModel states “Einstein was born in 1900” without any tool callNo external context for verification
No-context scenariosUser asks a factual question, but no tools are definedAbsence of ground truth

Transparent Degradation

For requests classified as fact-seeking but where tool context is unavailable, HaluGate explicitly flags responses as “unverified factual” instead of silently permitting them. This is indicated by specific HTTP headers:

x-vsr-fact-check-needed: true
x-vsr-unverified-factual-response: true
x-vsr-verification-context-missing: true

This transparent approach empowers downstream systems to manage uncertainty and unverified claims appropriately.

Conclusion

HaluGate introduces a principled and robust approach to hallucination detection for production LLM deployments, characterized by:

  • Conditional Verification: Non-factual queries are intelligently skipped, while factual ones undergo rigorous verification.
  • Token-Level Precision: It precisely identifies which specific claims within a response are unsupported by the context.
  • Explainable Results: The NLI classification provides clear insights into why a particular claim is deemed problematic.
  • Zero-Latency Integration: Leveraging native Rust inference, HaluGate integrates seamlessly without the need for Python sidecars, ensuring minimal overhead.
  • Actionable Transparency: Rich HTTP headers empower downstream systems to enforce custom policies and actions.

HaluGate ensures that if your LLM calls a tool, receives accurate data, yet still generates an incorrect answer, it will be detected and addressed proactively—before it ever reaches your users.