Kafka in Production: A Practical Guide for Trading Systems
Kafka in Production: A Practical Guide for Trading Systems
Introduction
Kafka is the backbone of most modern trading data pipelines. This guide covers everything you need to know to design, configure, and operate Kafka reliably at scale — from producer idempotency to consumer group rebalancing to scaling strategies.
1. Cluster Setup
Never run a single broker in production. The minimum viable setup is 3 brokers with a KRaft quorum.
replication.factor = 3 # each partition stored on 3 brokers
min.insync.replicas = 2 # at least 2 must confirm writes
This combination means you can lose 1 broker and still accept writes and serve reads. With min.insync.replicas=2 and replication.factor=3, producers using acks=all will still get confirmation as long as 2 of 3 brokers are alive.
KRaft (no ZooKeeper) requires an odd number of controller nodes for majority voting — 3, 5, or 7. In large clusters, separate controller and broker roles to avoid resource contention.
2. Topic Design — Plan Partitions Upfront
The most important rule in Kafka operations: over-partition at creation. You can increase partitions later, but it changes key routing (more on this in the scaling section) and you can never reduce partitions. Design for 10-20x your current expected volume.
partitions = max_expected_consumers × 2-3
Partition key strategy
The partition key determines which partition a message lands on:
hash(key) % num_partitions = partition number
Same key → always same partition → ordering guaranteed per key. For trading systems:
- Trades: key by
symbol— all BTCUSD trades go to the same partition, preserving time ordering per symbol - Orders: key by
order_id— all events for the same order are ordered - Orderbook: key by
symbol— latest snapshot per symbol stays ordered
Topic design by data type
| Topic | Partitions | Retention | Cleanup Policy |
|---|---|---|---|
orders |
12-24 | 7 days+ | delete |
market-trades |
24-48 | 24 hours | delete |
orderbook |
12-24 | 1 hour | compact |
audit |
6-12 | Years | delete |
Log compaction (cleanup.policy=compact) is useful for orderbook snapshots — the broker keeps only the latest message per key, discarding older snapshots for the same symbol.
3. Producer Configuration
Core reliability settings
acks=all
enable.idempotence=true
retries=2147483647
max.in.flight.requests.per.connection=5
acks=all — the broker leader waits for all in-sync replicas to confirm before ACKing the producer. Combined with min.insync.replicas=2, this ensures the message is on at least 2 brokers before you get a success response.
enable.idempotence=true — the broker assigns each producer a PID (Producer ID) and tracks a sequence number per partition. If a retry arrives with the same sequence number, the broker discards it. This prevents duplicates caused by retried sends within a single producer session.
Note: idempotence is scoped to one producer instance. If the producer process restarts, it gets a new PID and the broker treats it as a new producer.
Throughput settings
linger.ms=5
batch.size=16384
compression.type=lz4
linger.ms controls how long the producer waits to accumulate messages into a batch before sending. The tradeoff: lower = lower latency, higher = larger batches, higher throughput. For market data, 5-20ms is typical.
compression.type=lz4 — compresses batches before sending. LZ4 is fast and reduces network and disk I/O significantly for JSON payloads. Worth enabling for any high-volume topic.
Transactions — exactly-once Kafka-to-Kafka
props.put(TRANSACTIONAL_ID_CONFIG, "producer-1"); // stable across restarts
producer.initTransactions();
try {
producer.beginTransaction();
producer.send(new ProducerRecord<>("output-topic", key, value));
consumer.commitSync(offsets); // atomically commit input offset
producer.commitTransaction();
} catch (Exception e) {
producer.abortTransaction();
}
Transactions wrap producing to an output topic and committing the input offset atomically. Either both happen or neither does. The TRANSACTIONAL_ID_CONFIG is stable across producer restarts — the broker uses it to fence zombie producers and deduplicate replayed transactions.
Important: Kafka transactions only work when the output is also a Kafka topic. They do not extend to external systems like PostgreSQL.
4. Consumer Configuration
Offset management
The most critical decision is when to commit offsets.
props.put("enable.auto.commit", "false"); // always disable for production
At-most-once (commit before processing — avoid for financial data):
consumer.commitSync();
processRecord(record);
// if crash here: message skipped → data loss
At-least-once (commit after processing — standard):
processRecord(record);
consumer.commitSync();
// if crash here: message replayed → duplicate (handled by idempotent sink)
Shutdown pattern — always flush on shutdown:
try {
while (running) {
var records = consumer.poll(Duration.ofMillis(100));
for (var record : records) {
processRecord(record);
}
consumer.commitAsync(); // fast path during normal operation
}
} finally {
consumer.commitSync(); // ensure final commit before exit
consumer.close();
}
Rebalance listener — commit before losing partition ownership:
consumer.subscribe(topics, new ConsumerRebalanceListener() {
public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
consumer.commitSync(); // commit before partition is reassigned
}
public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
// resume from last committed offset on newly assigned partitions
}
});
Performance settings
max.poll.records=500
max.poll.interval.ms=300000
session.timeout.ms=45000
heartbeat.interval.ms=3000
fetch.min.bytes=1
fetch.max.wait.ms=500
max.poll.interval.ms is a common production pitfall. If your processing takes longer than this interval between poll() calls, Kafka assumes the consumer is dead and triggers a rebalance. The fix is to either reduce max.poll.records, increase the interval, or move slow processing to a separate thread.
max.poll.records=500
processing_time_per_record=500ms
total_batch_time=250s → must be < max.poll.interval.ms (300s)
# If processing slows to 700ms/record:
total_batch_time=350s → exceeds 300s → kicked out of group → rebalance
Consumer group assignment strategy
partition.assignment.strategy=org.apache.kafka.clients.consumer.CooperativeStickyAssignor
The default RangeAssignor stops all consumers during rebalance (stop-the-world). CooperativeStickyAssignor only moves partitions that actually need to move — existing assignments continue processing uninterrupted. For trading systems where any pause causes lag spikes, this is essential.
5. Delivery Guarantees
The three guarantees
| Guarantee | Risk | Mechanism |
|---|---|---|
| At-most-once | Data loss | Commit before processing |
| At-least-once | Duplicates | Commit after processing |
| Exactly-once | Neither | Kafka transactions (Kafka→Kafka only) |
The production standard: effectively exactly-once
True exactly-once using Kafka transactions adds significant complexity and overhead, and only works when the output stays within Kafka. Most production systems use a simpler approach:
At-least-once delivery + Idempotent sink = Effectively exactly-once
The sink handles deduplication using a natural unique key:
-- Trades: deduplicate by exchange-assigned trade ID
INSERT INTO trades (trade_id, symbol, source, ...)
VALUES (?, ?, ?, ...)
ON CONFLICT (symbol, source, trade_id) DO NOTHING;
-- OHLCV bars: deduplicate by window
INSERT INTO ohlcv (symbol, source, window_start, ...)
VALUES (?, ?, ?, ...)
ON CONFLICT (symbol, source, window_start) DO UPDATE SET ...;
-- Orders: deduplicate by client-generated order ID
INSERT INTO orders (order_id, ...)
VALUES (?, ...)
ON CONFLICT (order_id) DO NOTHING;
This approach works with any sink (PostgreSQL, TimescaleDB, S3, Cassandra) and has no performance overhead on the Kafka layer.
Unique ID strategy
For the idempotent sink pattern to work, every message needs a stable unique ID that survives retries:
- Exchange trades: use the exchange-assigned trade ID (e.g. Binance
tfield) - OHLCV windows: use
(symbol, source, window_start)— deterministic from the window definition - Client orders: generate a UUID before the first network call, persist it locally, then send
The critical rule: generate the ID before the Kafka send, not after. A retry must carry the same ID as the original attempt.
6. Scaling Strategy
Adding brokers
Kafka does not automatically rebalance data to new brokers. After adding brokers, use the partition reassignment tool:
# Step 1: Generate reassignment plan
kafka-reassign-partitions --generate \
--topics-to-move-json-file topics.json \
--broker-list "1,2,3,4,5,6" \
--bootstrap-server kafka:9092 > reassignment.json
# Step 2: Execute (runs in background, no downtime)
kafka-reassign-partitions --execute \
--reassignment-json-file reassignment.json \
--bootstrap-server kafka:9092
# Step 3: Verify
kafka-reassign-partitions --verify \
--reassignment-json-file reassignment.json \
--bootstrap-server kafka:9092
More brokers = distributed disk I/O = real throughput increase. More partitions on the same broker = more consumer parallelism but no producer throughput gain (same disk).
Increasing partition count
kafka-topics --alter --topic market-trades \
--partitions 24 \
--bootstrap-server kafka:9092
No downtime required. However, key routing changes:
Before: hash("BTCUSD") % 12 = partition 3
After: hash("BTCUSD") % 24 = partition 15 # different partition
Old messages for BTCUSD remain in partition 3. New messages go to partition 15. Ordering across the boundary is lost until old messages are consumed.
Production preference: blue/green topic migration
For critical topics, prefer creating a new topic over altering:
1. Create "market-trades-v2" with 24 partitions
2. Producer writes to both "market-trades" and "market-trades-v2"
3. Consumer drains "market-trades" to completion
4. Consumer switches fully to "market-trades-v2"
5. Decommission "market-trades"
No key routing issues, no ordering concerns, clean cutover.
7. Message Size Configuration
All three sides must be configured consistently — a mismatch causes silent failures or stuck consumers.
# Broker
message.max.bytes=10485760 # max size of a single message
replica.fetch.max.bytes=10485760 # brokers exchanging replicas (most forgotten)
# Producer
max.request.size=10485760 # max size of a produce request (batch of messages)
# Consumer
fetch.max.bytes=52428800 # max bytes in a fetch response (across all partitions)
max.partition.fetch.bytes=10485760 # max bytes per partition per fetch
The most common mistake is forgetting replica.fetch.max.bytes. The message writes successfully but followers can't replicate it — the ISR shrinks silently until min.insync.replicas is violated and producers start failing.
8. Monitoring — What to Watch
Critical alerts (page immediately)
UnderReplicatedPartitions > 0 → broker down or lagging
OfflinePartitionsCount > 0 → data unavailable, producers will fail
ActiveControllerCount != 1 → controller election failure
ISR shrinking below RF-1 → losing fault tolerance
Warning alerts (investigate within hours)
Consumer lag growing steadily → consumer can't keep up with producer
Consumer lag sudden spike → likely rebalance or consumer restart
Producer request rate dropping → batching issue or broker throttling
Fetch request rate dropping → consumer stalled
Operational metrics
# Check consumer lag
kafka-consumer-groups \
--bootstrap-server kafka:9092 \
--describe --group my-consumer-group
# Check ISR status
kafka-topics \
--bootstrap-server kafka:9092 \
--describe --topic market-trades
# List running consumer groups
kafka-consumer-groups \
--bootstrap-server kafka:9092 \
--list
Tools: Prometheus + kafka_exporter for metrics, Burrow (LinkedIn) for consumer lag tracking, Grafana for dashboards and alerting.
Summary
The production Kafka philosophy for trading systems:
- Design topics for peak load upfront — partition count is effectively immutable on hot topics
- Use
acks=all+enable.idempotence=trueon every producer — non-negotiable for financial data - Commit offsets manually, after processing — at-least-once as your baseline guarantee
- Make every sink idempotent with a natural unique key —
ON CONFLICT DO NOTHING/UPDATE - Use CooperativeStickyAssignor — eliminate stop-the-world rebalances
- Add brokers before adding partitions — real throughput comes from distributed I/O
- Monitor ISR and consumer lag — these are your early warning signals for everything that goes wrong