Guide

Building Real-Time Data Pipelines with Apache Kafka & Azure Event Hubs

By Hibba Limited · September 2025 · 10 min read

Batch processing is no longer enough. Modern enterprises need to react to events as they happen - fraud detection in milliseconds, IoT telemetry at scale, real-time personalisation, and live operational dashboards. This guide walks through architecting production-grade streaming pipelines using Apache Kafka and Azure Event Hubs.

Streaming Architecture Overview

End-to-end real-time data pipeline architecture

📡
Producers
IoT / Apps / APIs
🚀
Kafka / Event Hubs
Message Broker
Stream Processing
Flink / Spark
📊
Sink / Analytics
DB / Dashboard

Kafka vs Azure Event Hubs: When to Use Which

🐦
Apache Kafka
  • ✓ Full control over configuration
  • ✓ Multi-cloud / on-premise
  • ✓ Kafka Streams & Connect ecosystem
  • ✓ Unlimited retention policies
  • ⚠ Requires operational expertise
Azure Event Hubs
  • ✓ Fully managed PaaS
  • ✓ Kafka-compatible API
  • ✓ Auto-scale with throughput units
  • ✓ Native Azure integration
  • ⚠ Azure ecosystem lock-in

Step-by-Step: Building the Pipeline

1

Define Your Topics & Partitioning Strategy

Design your topic structure around business domains, not technical components. Use partition keys to ensure ordered processing within a business entity (e.g., customer ID, device ID). Target 3-6 partitions per topic for most workloads; scale to hundreds for high-throughput scenarios.

2

Set Up Producers with Schema Registry

Use Apache Avro or Protobuf schemas with a Schema Registry to enforce data contracts between producers and consumers. This prevents breaking changes and enables schema evolution without downtime.

3

Configure Consumer Groups & Processing Guarantees

Choose your processing semantics carefully: at-least-once (default, safe), at-most-once (fast, lossy), or exactly-once (Kafka Transactions / idempotent consumers). Most enterprise workloads need at-least-once with idempotent sinks.

4

Build Stream Processing Logic

Use Apache Flink for complex event processing with windowed aggregations, or Kafka Streams for lightweight, embedded processing. For Azure-native workloads, Azure Stream Analytics provides SQL-like stream processing without infrastructure management.

5

Sink to Your Analytics Layer

Route processed data to your target systems: Azure Cosmos DB for operational queries, Azure Synapse for data warehousing, Power BI for real-time dashboards, or back into another Kafka topic for downstream services.

Sample Producer Configuration

# Kafka Producer - Python
from confluent_kafka import Producer
import json

config = {
    'bootstrap.servers': 'your-cluster.kafka.azure.net:9093',
    'security.protocol': 'SASL_SSL',
    'sasl.mechanism': 'PLAIN',
    'acks': 'all',                    # Durability guarantee
    'retries': 3,
    'linger.ms': 5,                   # Batch for throughput
    'compression.type': 'snappy'      # Reduce network I/O
}

producer = Producer(config)

def publish_event(topic, key, data):
    producer.produce(
        topic=topic,
        key=key.encode('utf-8'),
        value=json.dumps(data).encode('utf-8'),
        callback=delivery_report
    )
    producer.flush()

Monitoring & Observability

Critical metrics to monitor across your streaming pipeline

📈
Consumer Lag
< 1000 msgs
End-to-End
Latency < 500ms
💥
Error Rate
< 0.1%
💾
Throughput
msgs/sec
"Real-time data isn't a luxury anymore - it's the baseline expectation. The organisations that win are the ones that can act on events in seconds, not hours."

Need a real-time data architecture?

Our data engineers design and build production streaming pipelines at scale.

Talk to Our Data Team