data-contracts quality

Data Contracts and the Semantic Layer: Enforcing Consistency Upstream

Wei Tan · March 30, 2026

Data contracts have been discussed in the data engineering community for several years, and the concept is straightforward: a formal agreement between a data producer (the team that owns the upstream system generating the data) and data consumers (the analytics team, the ML pipeline, the reporting layer) about what that data will contain, at what frequency, with what quality guarantees.

What's less discussed is where contracts should be enforced — and the answer most teams land on, after a few cycles of contract violation incidents, is that enforcement at the ingestion layer alone is insufficient. You need contract enforcement at the semantic layer too.

The Producer-Consumer Model

The canonical data contract pattern comes from Chad Sanderson's work on data mesh contracts (public writing, widely referenced in the MDS community). The model: the data producer publishes a schema contract — column names, types, nullability guarantees, SLA for freshness — and the consumer depends on that contract being honored. When the contract changes, the producer must version the change, notify consumers, and support the old version for a defined migration window.

In practice, this requires three things: a schema registry where contracts are stored (Confluent Schema Registry, AWS Glue Data Catalog, a Git-versioned YAML directory — the implementation varies), a validation mechanism that enforces the contract at the boundary where data enters the system, and a notification mechanism that alerts consumers when contracts are updated or about to be broken.

The ingestion layer (Fivetran, Airbyte, a custom Kafka consumer) is the natural first enforcement point. When data arrives from a producer, it's validated against the registered contract schema. If a required column is missing, or a column type has changed incompatibly, the data is rejected and an alert fires — before the bad data enters the raw table in your warehouse.

This is the right first line of defense. But it's not sufficient on its own.

Why Ingestion-Layer Enforcement Isn't Enough

Consider what happens after data passes ingestion validation. It lands in a raw schema in your warehouse. From there, dbt models transform it into staging models, intermediate models, and finally into marts and semantic models that your BI tools and analysts query. Each of those downstream transforms is a consumer of the raw table.

If the ingestion contract says "column order_status is VARCHAR NOT NULL with values in {pending, confirmed, shipped, cancelled}", and the data passes that check, the raw table is technically contract-compliant. But your semantic model might define a metric — "confirmed order rate" — as the count of rows where order_status = 'confirmed' divided by total orders. If the source starts using a new status value ("approved" instead of "confirmed" in a backend code change), the data passes ingestion validation (it's still a VARCHAR NOT NULL) but your semantic metric is now computing the wrong thing.

The ingestion-layer contract caught structural compliance. It didn't catch semantic drift. The semantic layer is where semantic contracts need to live.

Data Contracts at the Semantic Layer

A semantic-layer contract extends the basic schema contract with three additional dimensions:

Value set contracts. Not just "this column is VARCHAR" but "this column takes values from {pending, confirmed, approved, shipped, cancelled} — any value not in this set is treated as anomalous and surfaced for review." This catches the "new status code added silently" case. In dbt, you'd express this as an accepted_values test. In a semantic layer context, it means the metric definition validates against the expected domain of input values.

Metric integrity contracts. A metric contract specifies the expected behavioral properties of a computed metric: "monthly_active_users should be between 1,000 and 50,000; a value outside this range triggers an alert." This is row-level statistical validation — not schema validation. Tools like Soda and Monte Carlo apply this kind of monitoring to computed tables. The semantic layer is where these expectations should be attached, because the metric definition is where the computation logic lives.

Lineage contracts. A lineage contract says: "this metric depends on these specific source columns; if any of those columns change, the metric owner must be notified and must re-validate the metric definition." This is the contract that bridges ingestion changes to semantic impacts. Without it, the ingestion team changes a column and the semantic team finds out when a dashboard breaks.

A Concrete Implementation Pattern

Consider an analytics team at a growing e-commerce company with three data sources (orders, users, inventory) feeding into 60 dbt models and a semantic layer with 30 defined metrics. Here's what a contract-enforced semantic layer looks like in practice:

Each metric in the YAML semantic model file has a contracts block alongside the metric definition:

metrics:
  - name: order_confirmation_rate
    type: ratio
    label: Order Confirmation Rate
    type_params:
      numerator: confirmed_orders
      denominator: total_orders
    contracts:
      source_columns:
        - table: raw.orders
          column: order_status
          expected_values: [pending, confirmed, shipped, cancelled, refunded]
          pii_classification: none
      value_range:
        min: 0.40
        max: 0.95
        alert_on_breach: true
      freshness_sla_hours: 4

The semantic layer engine evaluates these contracts on every pipeline run. If order_status contains a value not in expected_values, the run logs a contract violation warning (not a failure — you can decide whether an unknown status code should fail the pipeline or just alert). If the computed metric value falls outside [0.40, 0.95], an alert fires to the metric owner. If the source hasn't been updated within 4 hours, the metric is marked stale.

This is enforcement at the right altitude. Ingestion handles structural schema compliance. The semantic layer handles semantic integrity and statistical behavioral properties.

Versioning and Migration Windows

Contracts are only useful if they're versioned. When a data producer needs to change a column — add a new status value, rename a field, change a type — the process should be:

Publish the new contract version to the schema registry.
Notify all registered consumers (automated, via the registry's notification mechanism).
Support both old and new schema versions during a migration window (typically 2-4 weeks for analytics pipelines).
After the migration window, deprecate the old version.

In practice, this requires the ingestion layer to support dual-write during the migration window — writing to both the old and new column names simultaneously. Fivetran handles schema additions automatically (new columns appear in the destination). Column renames require more care — most ingestion tools treat a rename as a drop + add, which means the old column disappears unless you explicitly add a passthrough.

We're not saying every team needs a full contract versioning system with migration windows — that's overhead that doesn't make sense for a 2-person team with 5 data sources. We're saying that as team size grows and data sources multiply, the cost of unmanaged schema changes accumulates and eventually exceeds the cost of implementing lightweight contract governance.

PII as a First-Class Contract Dimension

The pii_classification field in the contract block above is worth explaining. When a new column appears in a source table — through additive drift — the default assumption should be that it might contain PII until confirmed otherwise. A contract enforcement system that validates PII classification on new columns before they're queryable in analytics contexts prevents the pattern of personal data landing in unmasked analytical tables.

The practical implementation: new columns in raw tables are quarantined behind a view that returns NULL for any column lacking a PII classification tag. An analytics engineer must review and classify the column (null-safe: no PII, or one of: user identifier, contact info, financial, sensitive attribute) before it becomes accessible. This adds one review step per new column in exchange for significantly reduced PII exposure risk in the analytics warehouse.

Data contracts at the semantic layer aren't just about catching upstream schema changes. They're the mechanism for encoding what your data means — not just what it contains — and enforcing that meaning as your data ecosystem grows and producers and consumers evolve independently.