Building Data Pipelines That Don't Break at 3 AM
If you've ever been woken up at 3 AM because a data pipeline failed and the morning dashboards are empty, this post is for you. After building and maintaining pipelines that process billions of records daily, here's what we've learned about building data infrastructure that actually works.
Why Pipelines Fail
Most pipeline failures fall into a few categories:
**Schema changes.** An upstream API adds a field, removes a field, or changes a type. Your pipeline doesn't expect it and crashes — or worse, silently produces wrong data.
**Volume spikes.** Black Friday hits and your pipeline gets 10x the normal volume. It either falls over or runs so slowly that downstream consumers don't get their data on time.
**Infrastructure failures.** A node goes down, a network partition happens, a cloud service has an outage. Your pipeline doesn't handle partial failures gracefully.
**Silent data quality issues.** The pipeline runs successfully, but the data it produces is wrong. No one notices until someone makes a bad business decision based on stale or incorrect numbers.
Patterns That Work
Schema Contracts
Define explicit contracts between producers and consumers. Use tools like Avro, Protobuf, or JSON Schema to validate data at pipeline boundaries. When a schema change breaks the contract, you want a loud failure at ingestion — not a silent corruption downstream.
Idempotent Processing
Design every stage of your pipeline to be safely re-runnable. If a stage fails halfway through, you should be able to restart it without duplicating data. This means using upserts instead of inserts, processing data in deterministic batches, and using watermarks to track progress.
Data Quality Checks
Build quality checks directly into your pipeline, not as an afterthought. At minimum, check:
Tools like dbt tests, Great Expectations, or simple SQL assertions work well here.
Circuit Breakers
When a pipeline stage fails repeatedly, stop retrying and alert. Don't let a failing stage consume resources and create cascading failures. Implement exponential backoff with a maximum retry count, and have a dead-letter queue for records that can't be processed.
Observability
You can't fix what you can't see. Every pipeline should have:
The Modern Data Stack
Our current go-to architecture for most data workloads:
Start Simple
The biggest mistake we see is over-engineering on day one. Start with batch processing. Add streaming only when you have a real-time requirement. Use managed services instead of self-hosting. You can always add complexity later — removing it is much harder.
Build the pipeline that solves today's problem reliably, and design it so tomorrow's problem is an extension, not a rewrite.
Want to discuss this topic?
Our team is always happy to chat about engineering challenges. Let's see how we can help.