Data Engineering7 min read

Building Data Pipelines That Don't Break at 3 AM

Sarah Mitchell·

If you've ever been woken up at 3 AM because a data pipeline failed and the morning dashboards are empty, this post is for you. After building and maintaining pipelines that process billions of records daily, here's what we've learned about building data infrastructure that actually works.

Why Pipelines Fail

Most pipeline failures fall into a few categories:

**Schema changes.** An upstream API adds a field, removes a field, or changes a type. Your pipeline doesn't expect it and crashes — or worse, silently produces wrong data.

**Volume spikes.** Black Friday hits and your pipeline gets 10x the normal volume. It either falls over or runs so slowly that downstream consumers don't get their data on time.

**Infrastructure failures.** A node goes down, a network partition happens, a cloud service has an outage. Your pipeline doesn't handle partial failures gracefully.

**Silent data quality issues.** The pipeline runs successfully, but the data it produces is wrong. No one notices until someone makes a bad business decision based on stale or incorrect numbers.

Patterns That Work

Schema Contracts

Define explicit contracts between producers and consumers. Use tools like Avro, Protobuf, or JSON Schema to validate data at pipeline boundaries. When a schema change breaks the contract, you want a loud failure at ingestion — not a silent corruption downstream.

Idempotent Processing

Design every stage of your pipeline to be safely re-runnable. If a stage fails halfway through, you should be able to restart it without duplicating data. This means using upserts instead of inserts, processing data in deterministic batches, and using watermarks to track progress.

Data Quality Checks

Build quality checks directly into your pipeline, not as an afterthought. At minimum, check:

  • Row counts match expectations (within a tolerance)
  • Key columns are never null
  • Values fall within expected ranges
  • Freshness — data arrived within the expected window
  • Tools like dbt tests, Great Expectations, or simple SQL assertions work well here.

    Circuit Breakers

    When a pipeline stage fails repeatedly, stop retrying and alert. Don't let a failing stage consume resources and create cascading failures. Implement exponential backoff with a maximum retry count, and have a dead-letter queue for records that can't be processed.

    Observability

    You can't fix what you can't see. Every pipeline should have:

  • Metrics: Records processed, processing time, error rates, lag
  • Logs: Structured, searchable, with correlation IDs
  • Alerts: Based on SLOs, not just error counts. "Data is 30 minutes late" is more useful than "5 errors occurred"
  • The Modern Data Stack

    Our current go-to architecture for most data workloads:

  • Orchestration: Airflow or Dagster for workflow management
  • Transformation: dbt for SQL-based transformations with built-in testing
  • Storage: Snowflake or BigQuery for the warehouse, with S3/GCS for the lake
  • Streaming: Kafka for real-time ingestion when needed
  • Quality: dbt tests + custom assertions for data validation
  • Infrastructure: Terraform for everything, deployed via CI/CD
  • Start Simple

    The biggest mistake we see is over-engineering on day one. Start with batch processing. Add streaming only when you have a real-time requirement. Use managed services instead of self-hosting. You can always add complexity later — removing it is much harder.

    Build the pipeline that solves today's problem reliably, and design it so tomorrow's problem is an extension, not a rewrite.

    data engineeringdata pipelinesETLdata quality

    Want to discuss this topic?

    Our team is always happy to chat about engineering challenges. Let's see how we can help.