Batch processing is no longer enough. Modern enterprises need to react to events as they happen - fraud detection in milliseconds, IoT telemetry at scale, real-time personalisation, and live operational dashboards. This guide walks through architecting production-grade streaming pipelines using Apache Kafka and Azure Event Hubs.
Streaming Architecture Overview
End-to-end real-time data pipeline architecture
IoT / Apps / APIs
Message Broker
Flink / Spark
DB / Dashboard
Kafka vs Azure Event Hubs: When to Use Which
- ✓ Full control over configuration
- ✓ Multi-cloud / on-premise
- ✓ Kafka Streams & Connect ecosystem
- ✓ Unlimited retention policies
- ⚠ Requires operational expertise
- ✓ Fully managed PaaS
- ✓ Kafka-compatible API
- ✓ Auto-scale with throughput units
- ✓ Native Azure integration
- ⚠ Azure ecosystem lock-in
Step-by-Step: Building the Pipeline
Define Your Topics & Partitioning Strategy
Design your topic structure around business domains, not technical components. Use partition keys to ensure ordered processing within a business entity (e.g., customer ID, device ID). Target 3-6 partitions per topic for most workloads; scale to hundreds for high-throughput scenarios.
Set Up Producers with Schema Registry
Use Apache Avro or Protobuf schemas with a Schema Registry to enforce data contracts between producers and consumers. This prevents breaking changes and enables schema evolution without downtime.
Configure Consumer Groups & Processing Guarantees
Choose your processing semantics carefully: at-least-once (default, safe), at-most-once (fast, lossy), or exactly-once (Kafka Transactions / idempotent consumers). Most enterprise workloads need at-least-once with idempotent sinks.
Build Stream Processing Logic
Use Apache Flink for complex event processing with windowed aggregations, or Kafka Streams for lightweight, embedded processing. For Azure-native workloads, Azure Stream Analytics provides SQL-like stream processing without infrastructure management.
Sink to Your Analytics Layer
Route processed data to your target systems: Azure Cosmos DB for operational queries, Azure Synapse for data warehousing, Power BI for real-time dashboards, or back into another Kafka topic for downstream services.
Sample Producer Configuration
# Kafka Producer - Python
from confluent_kafka import Producer
import json
config = {
'bootstrap.servers': 'your-cluster.kafka.azure.net:9093',
'security.protocol': 'SASL_SSL',
'sasl.mechanism': 'PLAIN',
'acks': 'all', # Durability guarantee
'retries': 3,
'linger.ms': 5, # Batch for throughput
'compression.type': 'snappy' # Reduce network I/O
}
producer = Producer(config)
def publish_event(topic, key, data):
producer.produce(
topic=topic,
key=key.encode('utf-8'),
value=json.dumps(data).encode('utf-8'),
callback=delivery_report
)
producer.flush()
Monitoring & Observability
Critical metrics to monitor across your streaming pipeline
< 1000 msgs
Latency < 500ms
< 0.1%
msgs/sec
"Real-time data isn't a luxury anymore - it's the baseline expectation. The organisations that win are the ones that can act on events in seconds, not hours."
Need a real-time data architecture?
Our data engineers design and build production streaming pipelines at scale.
Talk to Our Data Team