Skip to main content

Four-Phase Pipeline

Every document in Talonic passes through a four-phase extraction pipeline: Resolve, Agent, Validate, and Re-read. Each phase adds confidence scores and per-cell provenance.

Pipeline overview

PhasePurposeOutput
1. ResolveParse document, classify type, match schema fields to Field RegistryDocument model with field mappings
2. AgentAI extraction of values from document regionsRaw extraction with bounding boxes
3. ValidateType checking, confidence scoring, constraint enforcementValidated data with confidence scores
4. Re-readCross-check low-confidence values against sourceFinal data with provenance traces

Phase 1: Resolve

The Resolve phase parses the document into a structured model. It identifies the document type using the 529-type ontology, maps schema fields to Field Registry entries, and prepares the extraction plan. For documents submitted without a schema, this phase also performs field discovery.

Phase 2: Agent

The Agent phase runs AI extraction against the document model. It locates values in the source document and extracts them with bounding box coordinates. Each extraction is tied to a specific region of the original document, forming the basis for per-cell provenance.

Phase 3: Validate

Validation applies type constraints, format checks, and the confidence gate. Values that pass validation receive a confidence score above the gate threshold. Values below the gate are flagged for human review. The Schema Graph enforces field-level constraints from the schema definition.

Confidence gate

The confidence gate is a configurable threshold (default 0.85) that determines whether extracted values are accepted automatically or flagged for review. Adjust the gate per schema to balance automation with accuracy. Values below the gate remain in the extraction output but are marked as requiring human verification.

Phase 4: Re-read

The Re-read phase cross-checks flagged and low-confidence values against the source document. It performs a second pass using different extraction strategies to improve accuracy. Values that improve above the confidence gate after re-reading are accepted. The reasoning trace records both extraction attempts.

Monitoring pipeline progress

Track pipeline progress via the jobs API. Each job reports its current phase and completion percentage. Use webhooks to receive notifications when extraction completes. The extract endpoint returns a job ID for async tracking.

Frequently asked questions