Skip to content

Event-Driven Systems

Event-driven systems decouple producers from consumers by passing facts (events) through a broker. They’re the backbone of microservices, real-time pipelines, and any system where “things happen and other things react.”


TermDefinitionExample
EventAn immutable fact — something that happenedOrderPlaced, PaymentFailed
CommandA request for a specific action to occurPlaceOrder, ProcessPayment
MessageGeneric envelope — events and commands are both messagesAny payload sent over a broker

  • Event Notification — Services emit lightweight events; consumers decide what to do. Loose coupling, but consumers must fetch state if they need details. Use when you want decoupling and don’t need full event history.

  • Event-Carried State Transfer — Events carry enough data so consumers never need to call back. Reduces coupling further but bloats event payloads. Use when consumers need state immediately and round-trips are expensive.

  • Event Sourcing — State is derived by replaying a log of events; the event store is the source of truth. Enables full audit trail and time-travel debugging. Use when auditability, compliance, or temporal queries matter.

  • CQRS (Command Query Responsibility Segregation) — Write model (commands) and read model (queries) are separate. Often paired with Event Sourcing but not required. Use when read/write access patterns diverge significantly.


Sagas manage distributed transactions across services without two-phase commit.

ChoreographyOrchestration
How it worksEach service reacts to events and emits new onesA central coordinator tells each service what to do
CouplingLow — services only know their own eventsHigher — coordinator knows the full flow
ObservabilityHard — flow is implicit across servicesEasy — flow is explicit in one place
Failure handlingCompensating events per serviceCoordinator handles rollback logic
Best forSimple, stable workflowsComplex, long-running, or frequently-changing flows

SemanticGuaranteeCostUse Case
At-most-onceMay lose messagesCheapest — fire and forgetMetrics, telemetry where loss is acceptable
At-least-onceNo message loss; duplicates possibleModerate — requires retry logicMost production systems; pair with idempotent consumers
Exactly-onceNo loss, no duplicatesExpensive — distributed coordinationFinancial transactions, billing (verify broker support)

Apache KafkaAzure Service Bus
ModelDistributed log (pull)Message broker (push/pull)
RetentionConfigurable — messages persist after consumptionDeleted after consumption (or DLQ)
Replay✅ Built-in — rewind consumer offset❌ Not supported
OrderingPer-partition onlyPer session (with sessions enabled)
ThroughputMillions of msgs/secThousands of msgs/sec
DeliveryAt-least-once; exactly-once within clusterAt-least-once; deduplicated window available
Dead-letterManual (separate topic)✅ Built-in DLQ per queue/topic
Best forEvent streaming, log aggregation, replay scenariosWorkflow integration, enterprise messaging, Azure-native apps

In distributed systems, writes propagate asynchronously — there’s a window where different nodes see different data. That’s the trade-off you accept for availability and partition tolerance (CAP theorem).

How to handle it:

  • Idempotent consumers — Process the same message twice safely; use unique event IDs.
  • Outbox pattern — Write to DB and event log atomically; avoids dual-write inconsistency.
  • Saga compensations — If a downstream step fails, emit compensating events to undo prior steps.
  • Read-your-writes consistency — Route reads to the write replica briefly after a mutation if stale reads are unacceptable.

A dead-letter queue (DLQ) is where messages go when they can’t be processed — after max retries, on deserialization failure, or when they expire.

A poison message is one that causes the consumer to crash or loop repeatedly. Without a DLQ, it blocks the whole queue.

Key practices:

  • Always configure a DLQ — never let a broker silently drop failed messages.
  • Alert on DLQ depth; a growing DLQ is a canary for schema or logic bugs.
  • Include original metadata (error reason, retry count) in the DLQ message for debugging.

  • Confusing Event Sourcing with event-driven — They often coexist but are independent concepts. Interviewers use this to separate practitioners from readers.
  • Choreography spaghetti — With enough services, choreography produces invisible workflows. No single place shows the full business process.
  • Schema evolution breaks consumers — Adding a required field or renaming a property can silently break downstream consumers. Use schema registries (Confluent Schema Registry, Azure Schema Registry) and favour additive changes.
  • Kafka ordering is per-partition — Globally ordered processing requires a single partition, which kills parallelism. Design partition keys carefully.
  • Exactly-once is harder than it sounds — Broker-level exactly-once doesn’t cover your database writes. You still need idempotent consumers for true end-to-end guarantees.