Data Engineering7 min read

Building Data Pipelines That Don't Break at 3 AM

Sarah Mitchell·November 27, 2025

If you've ever been woken up at 3 AM because a data pipeline failed and the morning dashboards are empty, this post is for you. After building and maintaining pipelines that process billions of records daily, here's what we've learned about building data infrastructure that actually works.

Why Pipelines Fail

Most pipeline failures fall into a few categories:

**Schema changes.** An upstream API adds a field, removes a field, or changes a type. Your pipeline doesn't expect it and crashes — or worse, silently produces wrong data.

**Volume spikes.** Black Friday hits and your pipeline gets 10x the normal volume. It either falls over or runs so slowly that downstream consumers don't get their data on time.

**Infrastructure failures.** A node goes down, a network partition happens, a cloud service has an outage. Your pipeline doesn't handle partial failures gracefully.

**Silent data quality issues.** The pipeline runs successfully, but the data it produces is wrong. No one notices until someone makes a bad business decision based on stale or incorrect numbers.

Patterns That Work

Schema Contracts

Define explicit contracts between producers and consumers. Use tools like Avro, Protobuf, or JSON Schema to validate data at pipeline boundaries. When a schema change breaks the contract, you want a loud failure at ingestion — not a silent corruption downstream.

Idempotent Processing

Design every stage of your pipeline to be safely re-runnable. If a stage fails halfway through, you should be able to restart it without duplicating data. This means using upserts instead of inserts, processing data in deterministic batches, and using watermarks to track progress.

Data Quality Checks

Build quality checks directly into your pipeline, not as an afterthought. At minimum, check:

Row counts match expectations (within a tolerance)

Key columns are never null

Values fall within expected ranges

Freshness — data arrived within the expected window

Tools like dbt tests, Great Expectations, or simple SQL assertions work well here.

Circuit Breakers

When a pipeline stage fails repeatedly, stop retrying and alert. Don't let a failing stage consume resources and create cascading failures. Implement exponential backoff with a maximum retry count, and have a dead-letter queue for records that can't be processed.

Observability

You can't fix what you can't see. Every pipeline should have:

Metrics: Records processed, processing time, error rates, lag

Logs: Structured, searchable, with correlation IDs

Alerts: Based on SLOs, not just error counts. "Data is 30 minutes late" is more useful than "5 errors occurred"

The Modern Data Stack

Our current go-to architecture for most data workloads:

Orchestration: Airflow or Dagster for workflow management

Transformation: dbt for SQL-based transformations with built-in testing

Storage: Snowflake or BigQuery for the warehouse, with S3/GCS for the lake

Streaming: Kafka for real-time ingestion when needed

Quality: dbt tests + custom assertions for data validation

Infrastructure: Terraform for everything, deployed via CI/CD

Start Simple

The biggest mistake we see is over-engineering on day one. Start with batch processing. Add streaming only when you have a real-time requirement. Use managed services instead of self-hosting. You can always add complexity later — removing it is much harder.

Build the pipeline that solves today's problem reliably, and design it so tomorrow's problem is an extension, not a rewrite.

data engineeringdata pipelinesETLdata quality

When to Migrate from a Monolith to Microservices (And When Not To)A Practical Guide to Integrating LLMs Into Your Product

Want to discuss this topic?

Our team is always happy to chat about engineering challenges. Let's see how we can help.

Get in Touch