Skip to main content
Blog

How to Build a Real-Time Data Pipeline for Enterprise Scale

How to build a real-time data pipeline at enterprise scale: Kafka, Flink, CDC, exactly-once, schema management, and a reference architecture.

How to Build a Real-Time Data Pipeline for Enterprise Scale

To build a real-time data pipeline for enterprise scale, you stream change events through a durable log (almost always Apache Kafka), process them with a stateful engine such as Apache Flink, enforce contracts with a schema registry, and guarantee correctness with exactly-once semantics end to end. That sentence is the whole pipeline in one breath. The hard part is the operational discipline that keeps it correct when volume, schema drift, and partial failures arrive together.

Real-time pipelines have moved from a competitive edge to table stakes. Fraud scoring, inventory sync, personalisation, observability, and operational analytics all expect freshness measured in seconds, not the overnight batch windows that defined data engineering a decade ago. Kafka now sits in production at more than 80% of the Fortune 100 [6], which tells you the architectural pattern has stabilised even as the tooling around it keeps moving.

This guide is written for CTOs, heads of platform, and data leaders who need a reference architecture they can defend in a budget review. We cover the core components, change data capture (CDC), exactly-once delivery, schema governance, a phased roadmap, the mistakes that quietly destroy data quality, realistic costs, and a build-versus-buy decision you can actually act on. If you want a second pair of senior hands on the design, teams like Mind Supernova help enterprises stand up streaming platforms without a multi-quarter ramp. You can talk to our engineering team when you reach that point.

Key Takeaways

  • A durable, partitioned log (Kafka or a managed equivalent) is the backbone. Everything else is a producer or consumer of that log.
  • Exactly-once is achievable but expensive in latency and complexity. Most enterprises should design idempotent consumers and reserve true exactly-once for financial and ledger flows.
  • CDC from operational databases via log-based tools (Debezium) is the cleanest way to stream existing systems without rewriting them.
  • A schema registry with enforced compatibility rules prevents the single most common production outage: a producer change that silently breaks every downstream consumer.
  • Build the platform yourself only if streaming is a core competency or compliance demands it. Otherwise managed services (Confluent, MSK, Redpanda Cloud) win on total cost of ownership.

What "real-time" actually means at enterprise scale

Before any architecture, agree on a definition. "Real-time" is not a single number. It is a latency budget tied to a business outcome, and conflating the tiers is how teams overspend by an order of magnitude.

Hard real-time (sub-100ms) is for fraud blocking, ad bidding, and trading. Near real-time (one to ten seconds) covers personalisation, operational dashboards, and alerting. Streaming ETL (tens of seconds to a few minutes) feeds analytics and feature stores. Each tier has a different cost curve, and the engineering effort to shave latency rises steeply at the bottom.

Scale has three independent axes: throughput (events per second), state size (how much the processing layer must remember to compute a result), and consumer fan-out (how many systems read the same stream). A pipeline doing 50,000 events per second with tiny state is far easier than 5,000 events per second that maintains a multi-terabyte windowed join. Size your design against the axis that actually stresses you, not the headline event count.

Latency tiers and their typical architecture fit
TierLatency budgetTypical use caseProcessing approachRelative cost
Hard real-time<100msFraud block, ad biddingIn-memory stream processor, hot stateVery high
Near real-time1–10sPersonalisation, alertingFlink or Kafka Streams, windowedHigh
Streaming ETL10s–5minFeature stores, ops analyticsMicro-batch or stream-to-lakeModerate
Batch (for contrast)Minutes–hoursReporting, reconciliationScheduled jobs on the warehouseLow

The core components of a streaming pipeline

Every enterprise real-time pipeline, regardless of vendor, decomposes into the same five layers. Get the responsibilities clear and the technology choices follow.

Ingestion and the durable log

Sources produce events: application emitters, IoT devices, clickstreams, and database change feeds. They all land in a partitioned, replayable log. Kafka is the default for good reason; its partitioning model gives you ordered, parallel consumption and its retention lets consumers replay history after a bug fix. Redpanda offers a Kafka-compatible API with lower operational overhead, and Apache Pulsar separates compute from storage for very large multi-tenant estates. The log is the contract boundary of your whole platform.

Stream processing

This layer transforms, enriches, aggregates, and joins streams. Apache Flink is the heavyweight choice for stateful processing with event-time semantics and strong exactly-once support. Kafka Streams is a library that runs inside your own services and suits simpler, Kafka-native topologies. Spark Structured Streaming fits teams already invested in Spark and comfortable with micro-batch latency. The decision usually comes down to how much state you carry and whether you need true event-time windowing.

Storage, serving, and governance

Processed output lands in a warehouse, a lakehouse, an OLAP store such as ClickHouse or Apache Pinot for low-latency queries, or a key-value store for serving features. Cutting across all of this sits governance: the schema registry, lineage, and access control. Treat governance as a first-class layer, not an afterthought, because it is what keeps the platform trustworthy as it grows. For how this serving layer feeds analytics and the semantic model, see our piece on modern BI architecture.

Change data capture: streaming your existing databases

Most enterprises do not start with a clean greenfield event stream. They start with operational databases (PostgreSQL, MySQL, SQL Server, Oracle) that already hold the truth. CDC is how you turn those databases into event sources without rewriting the applications on top of them.

There are two broad approaches, and the difference matters enormously in production. Query-based CDC polls tables for changed rows using a timestamp or version column. It is simple but misses deletes, adds load to the source, and introduces polling latency. Log-based CDC reads the database transaction log directly (the write-ahead log in Postgres, the binlog in MySQL). It captures every change including deletes, imposes near-zero query load, and preserves transaction order. For enterprise scale, log-based CDC is almost always the right answer.

Debezium is the de facto open-source standard for log-based CDC, running as a set of Kafka Connect connectors. It snapshots the initial table state, then streams every subsequent change as a structured event. The common pattern pairs Debezium with the outbox pattern: your application writes business data and an event record in the same transaction, and Debezium streams the outbox table. This sidesteps the dual-write problem, where writing to the database and publishing to Kafka separately can leave the two permanently out of sync after a crash.

CDC approaches compared
ApproachCaptures deletesSource loadLatencyOrderingBest for
Query-based pollingNoHighPolling intervalWeakSmall tables, quick prototypes
Log-based (Debezium)YesVery lowSub-secondTransaction-orderedEnterprise CDC at scale
Trigger-basedYesHighLowStrongLegacy DBs without log access
Outbox + log-basedYesLowSub-secondTransaction-orderedReliable app-driven events

Exactly-once semantics: the promise and the price

Distributed systems give you three delivery guarantees: at-most-once (fast, can lose data), at-least-once (no loss, can duplicate), and exactly-once (no loss, no duplicates). Most pipelines naturally land on at-least-once, which means duplicates will happen and you must handle them.

Exactly-once processing is genuinely achievable within a Kafka and Flink stack. Kafka supports idempotent producers and transactions, and Flink coordinates these with checkpoint barriers so that state changes and output commits happen atomically. The result is end-to-end exactly-once within the streaming system. The catch is the boundary: once data leaves for an external system (a REST API, a non-transactional sink), the guarantee depends on that sink supporting idempotent or transactional writes too.

The price is real. Exactly-once adds coordination latency, increases checkpoint overhead, and complicates failure recovery. Our recommendation for most enterprises: build idempotent consumers as the default discipline (use a natural key or a deduplication store), and reserve true exactly-once for flows where a duplicate is financially or legally unacceptable, such as ledger postings, billing, and payments. Idempotency gives you correctness with far less operational weight. The pattern is worth more than the guarantee.

A practical idempotency pattern

  1. Assign every event a stable, deterministic key (an order ID, a transaction UUID), never a generated timestamp.
  2. On the consumer, check whether the key has already been processed using a fast store (Redis, a database unique constraint).
  3. Process and persist the result and the processed-key marker in a single transaction where possible.
  4. Treat a duplicate as a no-op, not an error. Log it, do not alert on it.

Schema governance: the registry that prevents 3am incidents

Ask any team that has run streaming in anger about their worst outage, and a large share will describe the same thing: a producer changed a field, no consumer was warned, and dozens of downstream jobs failed or, worse, silently corrupted data. The schema registry exists to make that class of failure impossible.

A registry (Confluent Schema Registry, Apicurio, or AWS Glue Schema Registry) stores the schema for each topic and enforces compatibility rules on every new version. Producers register schemas; the registry rejects an incompatible change before it ever reaches the log. The serialisation format is usually Avro, Protobuf, or JSON Schema, with Avro the most common in Kafka estates for its compact binary encoding and rich evolution rules.

Compatibility modes are the policy you set per topic. Backward compatibility lets new consumers read old data, which suits consumer-led upgrades. Forward compatibility lets old consumers read new data, which suits producer-led upgrades. Full compatibility enforces both and is the safest default for shared, high-fan-out topics. Pick the mode deliberately per topic rather than applying one global rule, because the right answer depends on who upgrades first.

Schema compatibility modes and what they allow
ModeAllowed changeWho can upgrade firstUse when
BackwardAdd optional fields, remove fieldsConsumersMany consumers, one producer
ForwardAdd fields, remove optional fieldsProducersOne consumer, evolving producers
FullAdd or remove optional fields onlyEither orderShared, high-fan-out topics
NoneAnything (no checks)Coordinated onlyAvoid in production

Reference architecture, decision framework, and trade-offs

Here is a reference architecture that holds up across most enterprise contexts. Operational databases and application services feed Kafka via Debezium and native producers. Flink performs stateful processing and writes to both a serving store and a lakehouse. The schema registry governs every topic, and a metadata layer tracks lineage and access.

 SOURCES                  LOG / TRANSPORT         PROCESSING            SERVING
 +-----------+            +----------------+      +-----------+        +---------------+
 | App / API |--produce-->|                |      |           |--low-->| OLAP / KV     |
 +-----------+            |  KAFKA (topics |      |  FLINK    | latency| (ClickHouse,  |
 +-----------+   Debezium |  + partitions  |--->  | stateful  |        |  Pinot, Redis)|
 | Postgres  |---(CDC)--->|  + retention)  |      | processing|        +---------------+
 +-----------+            |                |      |  exactly- |--batch->+---------------+
 +-----------+            +-------+--------+      |  once     | sink   | Lakehouse /   |
 | IoT / Logs|--produce-->|       |               +-----------+        | Warehouse     |
 +-----------+            +-------+--------+                            +---------------+
                                  |
                          +-------v--------+      +-----------------------------+
                          | SCHEMA REGISTRY|<---->| Governance: lineage, access |
                          +----------------+      +-----------------------------+
Reference real-time pipeline: CDC and app events into Kafka, Flink processing with exactly-once, dual serving and lakehouse sinks, governed by a schema registry.

Decision framework

Use these questions in order. Each one narrows your design before you commit budget.

  1. What is the latency budget tied to a business outcome? If anything over a minute is acceptable, question whether you need streaming at all. Streaming ETL into a Kappa or lakehouse architecture may be enough.
  2. How much state does processing require? Large windowed joins push you toward Flink; stateless transforms suit Kafka Streams.
  3. Do duplicates cause financial or legal harm? If yes, design exactly-once for that flow. If no, idempotent at-least-once is cheaper and simpler.
  4. How many teams consume the same data? High fan-out demands strict schema governance and a platform team to own the contract.
  5. Is streaming a core competency or a means to an end? This answer drives the build-versus-buy call below.

Trade-off analysis

Every choice trades one property for another. Stronger delivery guarantees cost latency and throughput. More processing state increases recovery time after a failure. Higher retention buys replay safety at the price of storage. Self-managed infrastructure cuts licence cost but raises the operational burden that quietly consumes your best engineers. The right design is not the one with the strongest guarantees everywhere; it is the one that spends complexity only where the business outcome justifies it.

A real-world pattern: streaming inventory for omnichannel retail

Consider a familiar enterprise pattern, the kind LinkedIn pioneered when it created Kafka and that large retailers now run daily. A retailer with stores, a website, and a mobile app needs a single, accurate view of inventory across all channels within seconds, so it never sells stock it does not have.

Point-of-sale systems and the e-commerce platform write to separate operational databases. Debezium captures every stock movement from those databases as CDC events into Kafka. A Flink job maintains the running stock level per SKU per location as keyed state, applies reservations, and emits an authoritative availability event. That event feeds a low-latency serving store the website queries, and the same stream lands in the lakehouse for demand forecasting.

The design choices map directly to the framework. Latency budget is near real-time (a few seconds), so Flink windowing fits. Duplicates would oversell or undersell stock, so the reservation flow uses exactly-once while the analytics sink tolerates at-least-once. Schema governance is strict because merchandising, fulfilment, and finance all consume the availability topic. This is the same shape whether you are a retailer, a logistics firm, or a payments business; only the entities change.

A phased implementation roadmap

Do not attempt the full architecture in one programme. Streaming platforms succeed when they grow with proven demand. The phasing below has worked across our team's collective experience standing up these systems.

Phase 1 (weeks 1–6): foundation and one use case

  • Stand up Kafka (managed to start), the schema registry, and basic monitoring.
  • Pick one high-value, latency-tolerant use case. Resist the urge to start with the hardest flow.
  • Implement CDC for one source database with Debezium and the outbox pattern.
  • Establish schema compatibility policy and a deployment process for connectors.

Phase 2 (weeks 7–16): stateful processing and serving

  • Introduce Flink for the first stateful job; define checkpointing and state backends.
  • Add a serving store and prove an end-to-end latency target with real load.
  • Build idempotent consumers as the standard pattern across the team.
  • Add lineage and a data catalogue so consumers can discover topics safely.

Phase 3 (months 4–9): scale, harden, and platformise

  • Introduce exactly-once only for the flows that need it, with documented justification.
  • Establish a platform team owning golden paths so product teams self-serve. This mirrors the broader move toward internal platforms covered in cloud-native application development.
  • Add multi-region replication and a disaster-recovery runbook with tested replay.
  • Implement quota, cost attribution per team, and capacity planning.

Common mistakes that quietly destroy pipelines

These failures are predictable. Every one of them is cheaper to design out than to debug at scale.

  • Treating Kafka as a database. The log is transport and short-to-medium-term replay, not your system of record. Project state into a proper store.
  • No schema registry. Skipping it to "move fast" guarantees a breaking change will take down consumers within months.
  • Poor partition key choice. A skewed key sends most traffic to one partition, capping throughput regardless of how many brokers you add.
  • Ignoring backpressure. A slow consumer or sink will silently grow lag until retention expires and data is lost. Monitor consumer lag as a first-class metric.
  • Chasing exactly-once everywhere. You pay the full latency and complexity cost for flows that would have been fine with idempotency.
  • No replay strategy. When a processing bug corrupts output, you need to reprocess from the log. If retention is too short, you cannot.
  • Underestimating state management. Large Flink state means slow recovery and big checkpoints. Plan state size as carefully as throughput.

Cost considerations

Streaming costs split into infrastructure, licensing or managed-service fees, and the often-underestimated operational headcount. The figures below are industry estimates and vary widely by region, volume, and retention; treat them as planning anchors, not quotes.

Indicative annual cost components for an enterprise pipeline (estimates)
ComponentSelf-managed (open source)Managed serviceMain cost driver
Message brokerCluster compute and storageThroughput plus retention tiersEvents/sec and retention days
Stream processingFlink cluster computePer-task or per-CPU pricingState size and parallelism
Serving storeOLAP/KV clusterQuery and storage unitsQuery latency target
Operations2–4 senior engineersReduced, partly outsourcedSLA and platform breadth

The point most cost models miss is people. A self-managed Kafka and Flink platform realistically needs two to four senior engineers to run well, and senior streaming talent is scarce. That salary line frequently dwarfs the infrastructure bill, which is exactly why the build-versus-buy calculation rarely turns on licence fees alone.

Build versus buy

Our recommendation is direct. Buy managed services unless streaming infrastructure is a genuine core competency or a compliance constraint forces self-hosting. Managed Kafka (Confluent Cloud, Amazon MSK, Redpanda Cloud) and managed Flink remove the operational toil that consumes senior engineers, and the total cost of ownership usually favours managed once you price in headcount and the cost of outages during the learning curve.

Build (self-manage) when you have unusual scale economics where licence fees genuinely exceed a dedicated team, strict data-residency or air-gap requirements, or deep existing platform expertise you want to leverage. Even then, "build" rarely means writing a broker; it means operating proven open-source software (Kafka, Flink, Debezium) on your own infrastructure.

The pragmatic middle path most enterprises take: start fully managed to prove value and learn the patterns, then selectively bring high-volume workloads in-house once the economics are clear and the team is ready. This is where an engineering partner earns its keep. Mind Supernova can provide senior data engineers who have run these stacks, with 4+ hours of daily UK overlap and people who can start in 5–7 days, so you build capability without a long hiring cycle. Whether you staff up through staff augmentation or a dedicated team, the goal is the same: own the architecture, borrow the experience.

Frequently asked questions

Do I need both Kafka and Flink, or can Kafka Streams do the job?

Kafka Streams is enough for stateless or lightly stateful, Kafka-native processing and runs inside your own services with no extra cluster. Choose Flink when you need large stateful joins, true event-time windowing, or end-to-end exactly-once across multiple systems. Many enterprises run both for different workloads.

Is exactly-once delivery really achievable?

Yes, within a Kafka and Flink boundary, using idempotent producers, transactions, and coordinated checkpoints. The limitation is external sinks: the guarantee only holds end to end if every sink supports idempotent or transactional writes. For most flows, idempotent consumers on at-least-once delivery are cheaper and sufficient.

How is CDC different from just reading from the database?

Direct reads poll the database, add query load, miss deletes, and lag behind by the polling interval. Log-based CDC reads the transaction log instead, capturing every change including deletes in transaction order with near-zero source load. For enterprise scale, log-based CDC with a tool like Debezium is the standard approach.

What is the biggest cause of production incidents in streaming pipelines?

Schema changes that break downstream consumers, followed closely by unmonitored consumer lag and skewed partition keys. A schema registry with enforced compatibility rules eliminates the first class entirely. Treating consumer lag as a first-class alert and choosing partition keys carefully addresses the other two.

Should we self-host Kafka or use a managed service?

Most enterprises should use a managed service. The operational burden of self-hosting Kafka and Flink typically requires two to four senior engineers, and that headcount usually outweighs any licence savings. Self-host only when scale economics, data residency, or compliance genuinely force it, and start managed regardless to learn the patterns first.

Conclusion: build the log first, the guarantees second

A real-time pipeline at enterprise scale is not a single product you buy. It is a layered architecture: a durable log at the core, CDC to stream what you already have, stateful processing where the value is created, and strict schema governance to keep it trustworthy. The teams that succeed resist the temptation to put exactly-once and sub-second latency everywhere, and instead spend complexity only where a business outcome demands it.

This quarter: define your latency tiers against real business outcomes, stand up a managed broker and schema registry, and ship one latency-tolerant CDC use case end to end. Next 90 days: introduce stateful processing for your first real-time workload, make idempotent consumers the team standard, and write the replay runbook before you need it.

If you want senior data engineers who have built and operated streaming platforms to accelerate that roadmap, schedule a call with our engineering team. For the wider organisational change that turns these pipelines into decisions, the companion read is our BI transformation roadmap. And if AI workloads will sit on top of this stream, see how modern data platforms for AI-driven organisations build on the same foundation, plus the broader enterprise AI stack.

References

  1. Gartner. Worldwide IT and cloud spending forecasts, 2024–2026. [1] https://www.gartner.com/en/newsroom/press-releases/2024-11-19-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-total-723-billion-dollars-in-2025
  2. CNCF. Annual Survey 2025 (Kubernetes and cloud-native adoption). [7] https://www.cncf.io/reports/cncf-annual-survey-2025/
  3. Flexera. 2025 State of the Cloud Report (cost, multi-cloud, repatriation). [6] https://www.flexera.com/blog/finops/the-latest-cloud-computing-trends-flexera-2025-state-of-the-cloud-report/
  4. IBM. Cost of a Data Breach Report 2025. https://www.ibm.com/reports/data-breach
  5. Stack Overflow. 2025 Developer Survey (tooling and adoption). https://stackoverflow.co/company/press/archive/stack-overflow-2025-developer-survey/
  6. Confluent / Apache Kafka. Kafka adoption across the Fortune 100. https://kafka.apache.org/
  7. Wavestone. 2024 Data and AI Leadership Executive Survey. https://www.wavestone.com/en/news/2024-data-and-ai-leadership-executive-survey-41/
Keep reading

Related articles.