agentic data-engineering

What Is Agentic Data Routing? A Primer for Data Engineers

Wei Tan · November 3, 2025

The term "agentic" has been applied to everything from marketing copy to database migrations in the past two years, to the point where it's lost most of its technical precision. So before describing what agentic data routing actually is, it helps to define what it isn't.

It's not an LLM generating SQL queries. It's not a chatbot interface for your warehouse. It's not magic — there's no black-box model making opaque decisions about your data pipeline. Agentic data routing is a decision layer that monitors upstream state, evaluates routing policies you define, and executes transform changes without requiring manual intervention. The "agentic" part means the system acts — it doesn't just alert.

Traditional Orchestration vs. Agentic Routing: The Core Difference

Traditional orchestration tools — Airflow, Prefect, Dagster — are extremely good at scheduling and dependency resolution. You define a DAG: task A runs before task B, task C runs when tasks A and B succeed. If task A fails, the DAG halts at a defined point. That's deterministic, auditable, and reliable for stable pipelines.

The problem is that real data pipelines are not stable. Your upstream schema changes. A source system starts returning nulls for a column that was previously never null. A new data source comes online and needs to be merged with an existing model. In traditional orchestration, the response to all of these is: the DAG fails, an alert fires, a human intervenes, the DAG resumes. Human-in-the-loop for every schema event.

Agentic routing flips this. Instead of halting and alerting, the system has a pre-defined policy: "If column user_id is renamed in source table events, update the downstream transform to reference the new column name and re-run the affected models. Log the action. Alert only if the re-run fails or if the column rename is ambiguous (multiple candidate columns)." The human reviews the log; they don't have to wake up to perform the rerouting manually.

This requires the routing layer to understand the semantic relationship between source columns and transform models — which is why agentic routing is inseparable from the semantic layer. You can't route intelligently if you don't know what the data means.

The Architecture of an Agentic Routing Pipeline

A minimal agentic data routing system has four components:

1. Schema registry and snapshot store. The system maintains a versioned snapshot of every source schema it monitors. Each sync cycle, it diffs the incoming schema against the stored snapshot. Changes — column additions, renames, type changes, removals — are logged as events with timestamps.

2. Semantic model graph. The semantic model maps source columns to semantic concepts (dimensions, measures, entity keys). When a schema change event fires, the system can walk the graph to find every downstream model that references the changed column. This is column-level lineage, not just table-level.

3. Routing policy engine. Policies are defined declaratively — either in YAML or as structured rules. A policy says: "If a column is renamed, resolve the most likely new name using fuzzy matching against candidate columns (edit distance ≤ 2, same type) and update references automatically. If confidence is below threshold, queue for human review." Policies can also define circuit breakers: "If more than 30% of a source table's columns change in a single sync, treat as a breaking schema change and halt affected transforms pending review."

4. Execution layer with rollback. When the routing engine decides to reroute a transform, it generates the updated transform code, runs it in a dry-run mode to verify row counts and null rates are within expected bounds, then promotes to production. Every decision and its outcome is logged. If a rerouting decision causes a downstream model to fail validation, the system rolls back to the previous transform version automatically.

A Concrete Scenario: Column Rename Detection at a Growing Analytics Team

Consider an analytics team at an early-stage SaaS company managing order processing data. Their source system is a third-party CRM that periodically restructures its export schema. In February, the CRM renames account.client_id to account.customer_id — a common normalization they pushed as part of a schema cleanup.

Without agentic routing: the next dbt run fails on every model that references client_id. An on-call alert fires. The data engineer on call opens a PR, updates 14 model files, runs the models manually, verifies. Total elapsed time: two to three hours, depending on whether the engineer is available and how deeply nested the references are.

With agentic routing: the schema diff detects client_id → customer_id (same type: VARCHAR, same position in the schema, edit distance of 1 character). Confidence score: 0.94. The routing engine is configured to auto-reroute on confidence ≥ 0.90. It updates the 14 model references, re-runs the affected models, confirms row counts match the previous run within a 0.1% tolerance, and closes the event. Total elapsed time: four minutes. The engineer sees a Slack message: "Schema change resolved automatically. 14 models updated. Review log attached." They review it over coffee, not at 3am.

LLM Components: Where They Fit and Where They Don't

There are two places where language model capabilities genuinely improve agentic data routing, and a lot of places where they add latency and cost without value.

SQL synthesis for novel transform generation. When a new source table appears and needs to be integrated into an existing semantic model, you can use a small fine-tuned model to generate an initial transform — essentially few-shot prompting against your existing transform catalog. The LLM produces a first draft. A rule-based validator checks it for schema consistency and SQL safety before it runs. The LLM is not trusted for production SQL without the validator pass.

Ambiguous rename resolution. When confidence is below threshold (edit distance > 2, type mismatch, multiple candidates), you can use an LLM to reason over column names and their descriptions from the semantic model. "Is cust_ref_code more likely to be the rename of client_id or billing_ref?" The LLM outputs a ranked list with reasoning. A human reviews the top candidate before the rerouting is applied.

We're not saying LLMs are bad for data pipelines — the synthesis and disambiguation use cases are real. We're saying that running every routing decision through a large model is expensive, slow, and unnecessary when 80% of schema drift events are simple renames detectable with string distance metrics.

Cost Guardrails and Observability

Any agentic system running autonomously against production data needs hard cost guardrails. For data routing specifically, this means:

Compute budget per routing event — how many models can be re-run automatically before the system pauses and requests human review. For most teams, this is 20-50 models per auto-routing event.
Row count divergence threshold — if the rerouted model produces a row count more than X% different from the previous run, halt and alert. Typical setting: 5% for core metrics, 15% for intermediate models.
LLM token budget — if using an LLM for disambiguation, set a daily token cap so a noisy upstream schema doesn't run up an unexpected API bill. Most routing events shouldn't need the LLM at all.
Circuit breaker on cascade — if rerouting one model causes three downstream models to fail validation, stop the cascade, roll back, alert. Don't let one schema change propagate failures through the entire DAG.

Observability is the other half of this. Every routing decision — whether automatic or queued for human review — should be logged with: timestamp, triggering schema event, confidence score, action taken (or not taken), validation outcome, and rollback status. This log is how you audit the system, tune confidence thresholds, and build trust with stakeholders who are skeptical of autonomous pipeline changes.

When Agentic Routing Makes Sense

Agentic routing adds the most value when: your upstream schemas change frequently (more than once a month), your transform layer is complex (50+ models with deep dependency chains), and your on-call rotation is small (schema changes require waking people up). If your schema is stable, your pipeline is shallow, and you have a large team that handles incidents quickly, the ROI is lower.

The right frame is not "should we automate everything" but "which routing decisions are safe to automate, and what are the exact conditions under which we require a human." That decision table is your routing policy. Writing it explicitly — rather than having it live in the heads of your senior data engineers — is itself a meaningful improvement in pipeline reliability.