How to Build a Data-Driven Organization: A Practical BI Transformation Roadmap
How to build a data-driven organization: a practical BI transformation roadmap across people, process, technol...
How to build a real-time data pipeline at enterprise scale: Kafka, Flink, CDC, exactly-once, schema management, and a reference architecture.
To build a real-time data pipeline for enterprise scale, you stream change events through a durable log (almost always Apache Kafka), process them with a stateful engine such as Apache Flink, enforce contracts with a schema registry, and guarantee correctness with exactly-once semantics end to end. That sentence is the whole pipeline in one breath. The hard part is the operational discipline that keeps it correct when volume, schema drift, and partial failures arrive together.
Real-time pipelines have moved from a competitive edge to table stakes. Fraud scoring, inventory sync, personalisation, observability, and operational analytics all expect freshness measured in seconds, not the overnight batch windows that defined data engineering a decade ago. Kafka now sits in production at more than 80% of the Fortune 100 [6], which tells you the architectural pattern has stabilised even as the tooling around it keeps moving.
This guide is written for CTOs, heads of platform, and data leaders who need a reference architecture they can defend in a budget review. We cover the core components, change data capture (CDC), exactly-once delivery, schema governance, a phased roadmap, the mistakes that quietly destroy data quality, realistic costs, and a build-versus-buy decision you can actually act on. If you want a second pair of senior hands on the design, teams like Mind Supernova help enterprises stand up streaming platforms without a multi-quarter ramp. You can talk to our engineering team when you reach that point.
Key Takeaways
- A durable, partitioned log (Kafka or a managed equivalent) is the backbone. Everything else is a producer or consumer of that log.
- Exactly-once is achievable but expensive in latency and complexity. Most enterprises should design idempotent consumers and reserve true exactly-once for financial and ledger flows.
- CDC from operational databases via log-based tools (Debezium) is the cleanest way to stream existing systems without rewriting them.
- A schema registry with enforced compatibility rules prevents the single most common production outage: a producer change that silently breaks every downstream consumer.
- Build the platform yourself only if streaming is a core competency or compliance demands it. Otherwise managed services (Confluent, MSK, Redpanda Cloud) win on total cost of ownership.
Before any architecture, agree on a definition. "Real-time" is not a single number. It is a latency budget tied to a business outcome, and conflating the tiers is how teams overspend by an order of magnitude.
Hard real-time (sub-100ms) is for fraud blocking, ad bidding, and trading. Near real-time (one to ten seconds) covers personalisation, operational dashboards, and alerting. Streaming ETL (tens of seconds to a few minutes) feeds analytics and feature stores. Each tier has a different cost curve, and the engineering effort to shave latency rises steeply at the bottom.
Scale has three independent axes: throughput (events per second), state size (how much the processing layer must remember to compute a result), and consumer fan-out (how many systems read the same stream). A pipeline doing 50,000 events per second with tiny state is far easier than 5,000 events per second that maintains a multi-terabyte windowed join. Size your design against the axis that actually stresses you, not the headline event count.
| Tier | Latency budget | Typical use case | Processing approach | Relative cost |
|---|---|---|---|---|
| Hard real-time | <100ms | Fraud block, ad bidding | In-memory stream processor, hot state | Very high |
| Near real-time | 1–10s | Personalisation, alerting | Flink or Kafka Streams, windowed | High |
| Streaming ETL | 10s–5min | Feature stores, ops analytics | Micro-batch or stream-to-lake | Moderate |
| Batch (for contrast) | Minutes–hours | Reporting, reconciliation | Scheduled jobs on the warehouse | Low |
Every enterprise real-time pipeline, regardless of vendor, decomposes into the same five layers. Get the responsibilities clear and the technology choices follow.
Sources produce events: application emitters, IoT devices, clickstreams, and database change feeds. They all land in a partitioned, replayable log. Kafka is the default for good reason; its partitioning model gives you ordered, parallel consumption and its retention lets consumers replay history after a bug fix. Redpanda offers a Kafka-compatible API with lower operational overhead, and Apache Pulsar separates compute from storage for very large multi-tenant estates. The log is the contract boundary of your whole platform.
This layer transforms, enriches, aggregates, and joins streams. Apache Flink is the heavyweight choice for stateful processing with event-time semantics and strong exactly-once support. Kafka Streams is a library that runs inside your own services and suits simpler, Kafka-native topologies. Spark Structured Streaming fits teams already invested in Spark and comfortable with micro-batch latency. The decision usually comes down to how much state you carry and whether you need true event-time windowing.
Processed output lands in a warehouse, a lakehouse, an OLAP store such as ClickHouse or Apache Pinot for low-latency queries, or a key-value store for serving features. Cutting across all of this sits governance: the schema registry, lineage, and access control. Treat governance as a first-class layer, not an afterthought, because it is what keeps the platform trustworthy as it grows. For how this serving layer feeds analytics and the semantic model, see our piece on modern BI architecture.
Most enterprises do not start with a clean greenfield event stream. They start with operational databases (PostgreSQL, MySQL, SQL Server, Oracle) that already hold the truth. CDC is how you turn those databases into event sources without rewriting the applications on top of them.
There are two broad approaches, and the difference matters enormously in production. Query-based CDC polls tables for changed rows using a timestamp or version column. It is simple but misses deletes, adds load to the source, and introduces polling latency. Log-based CDC reads the database transaction log directly (the write-ahead log in Postgres, the binlog in MySQL). It captures every change including deletes, imposes near-zero query load, and preserves transaction order. For enterprise scale, log-based CDC is almost always the right answer.
Debezium is the de facto open-source standard for log-based CDC, running as a set of Kafka Connect connectors. It snapshots the initial table state, then streams every subsequent change as a structured event. The common pattern pairs Debezium with the outbox pattern: your application writes business data and an event record in the same transaction, and Debezium streams the outbox table. This sidesteps the dual-write problem, where writing to the database and publishing to Kafka separately can leave the two permanently out of sync after a crash.
| Approach | Captures deletes | Source load | Latency | Ordering | Best for |
|---|---|---|---|---|---|
| Query-based polling | No | High | Polling interval | Weak | Small tables, quick prototypes |
| Log-based (Debezium) | Yes | Very low | Sub-second | Transaction-ordered | Enterprise CDC at scale |
| Trigger-based | Yes | High | Low | Strong | Legacy DBs without log access |
| Outbox + log-based | Yes | Low | Sub-second | Transaction-ordered | Reliable app-driven events |
Distributed systems give you three delivery guarantees: at-most-once (fast, can lose data), at-least-once (no loss, can duplicate), and exactly-once (no loss, no duplicates). Most pipelines naturally land on at-least-once, which means duplicates will happen and you must handle them.
Exactly-once processing is genuinely achievable within a Kafka and Flink stack. Kafka supports idempotent producers and transactions, and Flink coordinates these with checkpoint barriers so that state changes and output commits happen atomically. The result is end-to-end exactly-once within the streaming system. The catch is the boundary: once data leaves for an external system (a REST API, a non-transactional sink), the guarantee depends on that sink supporting idempotent or transactional writes too.
The price is real. Exactly-once adds coordination latency, increases checkpoint overhead, and complicates failure recovery. Our recommendation for most enterprises: build idempotent consumers as the default discipline (use a natural key or a deduplication store), and reserve true exactly-once for flows where a duplicate is financially or legally unacceptable, such as ledger postings, billing, and payments. Idempotency gives you correctness with far less operational weight. The pattern is worth more than the guarantee.
Ask any team that has run streaming in anger about their worst outage, and a large share will describe the same thing: a producer changed a field, no consumer was warned, and dozens of downstream jobs failed or, worse, silently corrupted data. The schema registry exists to make that class of failure impossible.
A registry (Confluent Schema Registry, Apicurio, or AWS Glue Schema Registry) stores the schema for each topic and enforces compatibility rules on every new version. Producers register schemas; the registry rejects an incompatible change before it ever reaches the log. The serialisation format is usually Avro, Protobuf, or JSON Schema, with Avro the most common in Kafka estates for its compact binary encoding and rich evolution rules.
Compatibility modes are the policy you set per topic. Backward compatibility lets new consumers read old data, which suits consumer-led upgrades. Forward compatibility lets old consumers read new data, which suits producer-led upgrades. Full compatibility enforces both and is the safest default for shared, high-fan-out topics. Pick the mode deliberately per topic rather than applying one global rule, because the right answer depends on who upgrades first.
| Mode | Allowed change | Who can upgrade first | Use when |
|---|---|---|---|
| Backward | Add optional fields, remove fields | Consumers | Many consumers, one producer |
| Forward | Add fields, remove optional fields | Producers | One consumer, evolving producers |
| Full | Add or remove optional fields only | Either order | Shared, high-fan-out topics |
| None | Anything (no checks) | Coordinated only | Avoid in production |
Here is a reference architecture that holds up across most enterprise contexts. Operational databases and application services feed Kafka via Debezium and native producers. Flink performs stateful processing and writes to both a serving store and a lakehouse. The schema registry governs every topic, and a metadata layer tracks lineage and access.
SOURCES LOG / TRANSPORT PROCESSING SERVING
+-----------+ +----------------+ +-----------+ +---------------+
| App / API |--produce-->| | | |--low-->| OLAP / KV |
+-----------+ | KAFKA (topics | | FLINK | latency| (ClickHouse, |
+-----------+ Debezium | + partitions |---> | stateful | | Pinot, Redis)|
| Postgres |---(CDC)--->| + retention) | | processing| +---------------+
+-----------+ | | | exactly- |--batch->+---------------+
+-----------+ +-------+--------+ | once | sink | Lakehouse / |
| IoT / Logs|--produce-->| | +-----------+ | Warehouse |
+-----------+ +-------+--------+ +---------------+
|
+-------v--------+ +-----------------------------+
| SCHEMA REGISTRY|<---->| Governance: lineage, access |
+----------------+ +-----------------------------+
Use these questions in order. Each one narrows your design before you commit budget.
Every choice trades one property for another. Stronger delivery guarantees cost latency and throughput. More processing state increases recovery time after a failure. Higher retention buys replay safety at the price of storage. Self-managed infrastructure cuts licence cost but raises the operational burden that quietly consumes your best engineers. The right design is not the one with the strongest guarantees everywhere; it is the one that spends complexity only where the business outcome justifies it.
Consider a familiar enterprise pattern, the kind LinkedIn pioneered when it created Kafka and that large retailers now run daily. A retailer with stores, a website, and a mobile app needs a single, accurate view of inventory across all channels within seconds, so it never sells stock it does not have.
Point-of-sale systems and the e-commerce platform write to separate operational databases. Debezium captures every stock movement from those databases as CDC events into Kafka. A Flink job maintains the running stock level per SKU per location as keyed state, applies reservations, and emits an authoritative availability event. That event feeds a low-latency serving store the website queries, and the same stream lands in the lakehouse for demand forecasting.
The design choices map directly to the framework. Latency budget is near real-time (a few seconds), so Flink windowing fits. Duplicates would oversell or undersell stock, so the reservation flow uses exactly-once while the analytics sink tolerates at-least-once. Schema governance is strict because merchandising, fulfilment, and finance all consume the availability topic. This is the same shape whether you are a retailer, a logistics firm, or a payments business; only the entities change.
Do not attempt the full architecture in one programme. Streaming platforms succeed when they grow with proven demand. The phasing below has worked across our team's collective experience standing up these systems.
These failures are predictable. Every one of them is cheaper to design out than to debug at scale.
Streaming costs split into infrastructure, licensing or managed-service fees, and the often-underestimated operational headcount. The figures below are industry estimates and vary widely by region, volume, and retention; treat them as planning anchors, not quotes.
| Component | Self-managed (open source) | Managed service | Main cost driver |
|---|---|---|---|
| Message broker | Cluster compute and storage | Throughput plus retention tiers | Events/sec and retention days |
| Stream processing | Flink cluster compute | Per-task or per-CPU pricing | State size and parallelism |
| Serving store | OLAP/KV cluster | Query and storage units | Query latency target |
| Operations | 2–4 senior engineers | Reduced, partly outsourced | SLA and platform breadth |
The point most cost models miss is people. A self-managed Kafka and Flink platform realistically needs two to four senior engineers to run well, and senior streaming talent is scarce. That salary line frequently dwarfs the infrastructure bill, which is exactly why the build-versus-buy calculation rarely turns on licence fees alone.
Our recommendation is direct. Buy managed services unless streaming infrastructure is a genuine core competency or a compliance constraint forces self-hosting. Managed Kafka (Confluent Cloud, Amazon MSK, Redpanda Cloud) and managed Flink remove the operational toil that consumes senior engineers, and the total cost of ownership usually favours managed once you price in headcount and the cost of outages during the learning curve.
Build (self-manage) when you have unusual scale economics where licence fees genuinely exceed a dedicated team, strict data-residency or air-gap requirements, or deep existing platform expertise you want to leverage. Even then, "build" rarely means writing a broker; it means operating proven open-source software (Kafka, Flink, Debezium) on your own infrastructure.
The pragmatic middle path most enterprises take: start fully managed to prove value and learn the patterns, then selectively bring high-volume workloads in-house once the economics are clear and the team is ready. This is where an engineering partner earns its keep. Mind Supernova can provide senior data engineers who have run these stacks, with 4+ hours of daily UK overlap and people who can start in 5–7 days, so you build capability without a long hiring cycle. Whether you staff up through staff augmentation or a dedicated team, the goal is the same: own the architecture, borrow the experience.
Kafka Streams is enough for stateless or lightly stateful, Kafka-native processing and runs inside your own services with no extra cluster. Choose Flink when you need large stateful joins, true event-time windowing, or end-to-end exactly-once across multiple systems. Many enterprises run both for different workloads.
Yes, within a Kafka and Flink boundary, using idempotent producers, transactions, and coordinated checkpoints. The limitation is external sinks: the guarantee only holds end to end if every sink supports idempotent or transactional writes. For most flows, idempotent consumers on at-least-once delivery are cheaper and sufficient.
Direct reads poll the database, add query load, miss deletes, and lag behind by the polling interval. Log-based CDC reads the transaction log instead, capturing every change including deletes in transaction order with near-zero source load. For enterprise scale, log-based CDC with a tool like Debezium is the standard approach.
Schema changes that break downstream consumers, followed closely by unmonitored consumer lag and skewed partition keys. A schema registry with enforced compatibility rules eliminates the first class entirely. Treating consumer lag as a first-class alert and choosing partition keys carefully addresses the other two.
Most enterprises should use a managed service. The operational burden of self-hosting Kafka and Flink typically requires two to four senior engineers, and that headcount usually outweighs any licence savings. Self-host only when scale economics, data residency, or compliance genuinely force it, and start managed regardless to learn the patterns first.
A real-time pipeline at enterprise scale is not a single product you buy. It is a layered architecture: a durable log at the core, CDC to stream what you already have, stateful processing where the value is created, and strict schema governance to keep it trustworthy. The teams that succeed resist the temptation to put exactly-once and sub-second latency everywhere, and instead spend complexity only where a business outcome demands it.
This quarter: define your latency tiers against real business outcomes, stand up a managed broker and schema registry, and ship one latency-tolerant CDC use case end to end. Next 90 days: introduce stateful processing for your first real-time workload, make idempotent consumers the team standard, and write the replay runbook before you need it.
If you want senior data engineers who have built and operated streaming platforms to accelerate that roadmap, schedule a call with our engineering team. For the wider organisational change that turns these pipelines into decisions, the companion read is our BI transformation roadmap. And if AI workloads will sit on top of this stream, see how modern data platforms for AI-driven organisations build on the same foundation, plus the broader enterprise AI stack.
How to build a data-driven organization: a practical BI transformation roadmap across people, process, technol...
A modern data platform for AI is the governed foundation that makes enterprise intelligence possible. Here is...
Modern BI architecture explained: from data warehouses to lakehouse and self-service analytics, with a referen...