Four-Phase Pipeline
Every document in Talonic passes through a four-phase extraction pipeline: Resolve, Agent, Validate, and Re-read. Each phase adds confidence scores and per-cell provenance.
Pipeline overview
| Phase | Purpose | Output |
|---|---|---|
| 1. Resolve | Parse document, classify type, match schema fields to Field Registry | Document model with field mappings |
| 2. Agent | AI extraction of values from document regions | Raw extraction with bounding boxes |
| 3. Validate | Type checking, confidence scoring, constraint enforcement | Validated data with confidence scores |
| 4. Re-read | Cross-check low-confidence values against source | Final data with provenance traces |
Phase 1: Resolve
The Resolve phase parses the document into a structured model. It identifies the document type using the 529-type ontology, maps schema fields to Field Registry entries, and prepares the extraction plan. For documents submitted without a schema, this phase also performs field discovery.
Phase 2: Agent
The Agent phase runs AI extraction against the document model. It locates values in the source document and extracts them with bounding box coordinates. Each extraction is tied to a specific region of the original document, forming the basis for per-cell provenance.
Phase 3: Validate
Validation applies type constraints, format checks, and the confidence gate. Values that pass validation receive a confidence score above the gate threshold. Values below the gate are flagged for human review. The Schema Graph enforces field-level constraints from the schema definition.
Confidence gate
The confidence gate is a configurable threshold (default 0.85) that determines whether extracted values are accepted automatically or flagged for review. Adjust the gate per schema to balance automation with accuracy. Values below the gate remain in the extraction output but are marked as requiring human verification.
Phase 4: Re-read
The Re-read phase cross-checks flagged and low-confidence values against the source document. It performs a second pass using different extraction strategies to improve accuracy. Values that improve above the confidence gate after re-reading are accepted. The reasoning trace records both extraction attempts.
Monitoring pipeline progress
Track pipeline progress via the jobs API. Each job reports its current phase and completion percentage. Use webhooks to receive notifications when extraction completes. The extract endpoint returns a job ID for async tracking.