Figure 1. Components Diagram - illustrates the core parts of the pattern: the service, the outbox table, the relay process, and the messaging system.
Figure 2. Sequence Diagram - shows the step-by-step lifecycle of a change, from transaction commit to event dispatch.
There are many articles that explain the transactional outbox pattern, and multiple libraries can help to implement it without much effort. For example, in .NET, MassTransit
supports the transactional outbox pattern and can handle events persistence automatically. But in real projects, it’s not always that easy. Sometimes we can’t use a library due to specific requirements or constraints. Other times, the library has limitations, or its features aren’t clearly documented. That’s when it becomes important to understand what to consider when running the transactional outbox pattern in production - how to make it stable, observable, scalable, and what trade-offs and failure modes to be aware of.
At first glance, selecting unsent events might seem simple - just run a SELECT
query to fetch a small batch (e.g. 10 rows) from the outbox table. But once we introduce multiple relay instances for scalability or fault tolerance, things get more complicated.
We need to ensure that:
To achieve this, we must add row-level locking to selection queries. This allows multiple relays to work safely in parallel without picking the same rows.
PostgreSQL
offers a clean solution using FOR UPDATE SKIP LOCKED
:
SELECT * FROM outbox
WHERE sent_at IS NULL
ORDER BY created_at
LIMIT 10
FOR UPDATE SKIP LOCKED;
FOR UPDATE
: locks the selected rows during the transaction.SKIP LOCKED
: skips any rows already locked by other transactions.MySQL 8.0
and newer support the same FOR UPDATE SKIP LOCKED
syntax, so the same approach applies directly.
In SQL Server
, we can achieve similar behavior using locking hints:
SELECT TOP 10 *
FROM outbox WITH (ROWLOCK, UPDLOCK, READPAST)
WHERE sent_at IS NULL
ORDER BY created_at;
ROWLOCK
: enforces row-level locking (instead of page or table).READPAST
: skips rows that are already locked by other transactions.UPDLOCK
: acquires update locks instead of shared locks.Choosing the right row-locking strategy helps keep relay process safe and reliable under load.
Let’s say the relay process grabs a batch of events, starts sending them and crashes halfway through. What then?
The good news is: if we’re using FOR UPDATE SKIP LOCKED
(or equivalent), those unprocessed rows will become visible again once the transaction rolls back or the database connection closes. As long as we only mark events as sent after the broker confirms delivery, we’re safe - nothing gets lost.
The downside? Some events might get retried. That’s fine if consumers handle duplicates (see idempotency), but it’s something to be aware of.
In many implementations, especially those guided by default library configurations like MassTransit
, it’s common to co-host the relay process. The background worker responsible for publishing events from the outbox within the main API service. This setup is convenient and works well for small-scale systems with low traffic or a limited number of replicas (typically fewer than five).
However, this strategy doesn’t scale well.
As service scales horizontally - say, to 10, 15, or even 30 replicas - each instance runs its own copy of the relay logic. This results in redundant and overlapping polling, where all instances query the outbox table at regular intervals (even a modest 25-50ms). What initially seemed like a harmless background task becomes a storm of database queries.
This creates two critical issues:
Database pressure and lock contention. Every relay process instance attempts to acquire locks on the outbox table to safely claim and process events. These locks are not cheap. In one of my services with Amazon Aurora
, I’ve started noticing that the database spends considerable CPU time managing locks, not executing queries.
Mismatched scaling profiles. The core of the problem is architectural. API (REST API) traffic and relay process scale differently. API scales with user demand, which might require many instances to handle load. The outbox relay process, on the other hand, scales with write throughput - that is, how many new events are added to the outbox table. In many real-world systems, a single relay process instance is enough to handle even high volumes of events. Scaling it beyond that typically yields no benefit and only introduces locking and concurrency overhead.
The better approach is to separate the relay process from the API instance and run them as independent deployment units. They can still live in the same codebase but should be deployed with different configurations. For instance, one deployment might have the relay process enabled while another disables it and only handles HTTP traffic. This way, each component scales on its own terms. MassTransit
supports this model. It is possible to control whether the relay process runs at startup using configuration (search for DisableDeliveryService
).
It’s easy to overlook, but the transactional outbox pattern guarantees at-least-once delivery - not exactly-once. This means that due to transient failures (e.g., network issues, broker timeouts, retries), consumers might receive duplicate events. If downstream processing isn’t prepared for this, it risks inconsistent state or triggering actions multiple times (e.g., sending duplicate emails or creating double charges).
To deal with this, there are two main options:
Make logic idempotent. Design consumers so they can safely handle the same event more than once. For example, inserting a record only if it doesn’t exist.
Deduplicate events explicitly. Maintain a separate table (aka inbox
table) to store processed event IDs for a defined retention period (e.g. one hour, six hours, or even a few days). Before handling a event, the consumer checks this store. If the ID is found, the event is skipped; otherwise, it’s processed and the ID is saved.
The event ID in inbox
table must be globally unique - not just the domain entity’s ID. A single entity (e.g., a user or product) can generate multiple distinct events over time, and all of them must be independently tracked.
Besides duplicates, out-of-order events delivery is another common edge case in real-world deployments. This can happen due to:
What can be done to handle this:
Designing for eventual consistency often means tolerating some degree of event disorder, but planning for it upfront ensures a more resilient system.
Outbox (and inbox) tables tend to grow silently in the background. With every new event or processed event, a new row gets added. It’s easy to overlook this at first. In systems with moderate traffic and decent indexing, performance might hold up fine even with hundreds of thousands of records.
But over time, things change. Queries that once ran instantly start to slow down. The database has to sift through more data, manage larger indexes, and handle more storage. Fetching new events or checking for duplicates becomes more expensive. Left unchecked, this can lead to subtle but persistent performance issues.
That’s why it’s important to build in regular cleanup of both outbox and inbox tables. For the outbox, once a event has been successfully dispatched, it can often be removed immediately, unless there’s a need to keep it around for debugging or auditing. For the inbox, the retention window depends on how long duplicate events are expected to arrive. In most systems, keeping processed event IDs for a few hours to a couple of days is enough to cover retries and delayed deliveries.
There’s no one-size-fits-all schedule, some systems might clean up every few hours, others daily. What matters is that cleanup is automatic and consistent. It doesn’t need to be complex. Just enough to keep things running smoothly and prevent silent degradation over time.
The transactional outbox pattern implicitly introduces a secondary queue into system - one that exists within database, independent of messaging system. Most teams already monitor their messaging systems (e.g., RabbitMQ dashboards, Kafka exporters, Azure metrics), but the outbox itself often goes unobserved, especially when it’s implemented manually or using lightweight libraries.
This oversight can be dangerous. The outbox is effectively a queue, and like any queue, it requires monitoring to ensure inflow and outflow stay balanced. If the relay process can’t keep up, queue grows, introducing latency in event propagation, and eventually, customer-facing lag or missed actions.
At a minimum, we should monitor:
When these metrics drift apart - e.g., event age rising, or inflow consistently outpacing outflow - it’s a signal that something is wrong, the relay is failing, overloaded, or misconfigured.
Modern runtimes like .NET make it easy to instrument this kind of telemetry using OpenTelemetry, Prometheus exporters, or Application Insights. Don’t guess - observe. Set up alerts, visualize trends, and establish baselines.
This kind of visibility is especially critical because the outbox pattern operates under eventual consistency. For some domains, a few seconds of delay might be acceptable. For others, even short delays can introduce real risk.
I once worked on a system with an internal currency mechanism. Users could spend internal currency on incentives. The balance deduction logic used an eventually consistent pipeline with an transactional outbox pattern. While we had safeguards, especially to roll back transactions if users didn’t have enough points, timing mattered. A user could trigger multiple transactions before the first slow deduction processed - and effectively “overspend”. Since some incentives involved real goods, this caused a serious incident.
Had we monitored the outbox queue itself, not just the external broker, we could have spotted the delay early, throttled users, or paused processing. Instead, the backlog built silently.
The root cause was the lack of separation between the API and the relay process.
Rule of thumb: when a new queue is introduced (explicit or implicit), we MUST also introduce new monitoring. Without it, we’re flying blind.
The transactional outbox pattern might seem simple at first glance. Just write an event to a table and send it later. But like most things in distributed systems, the devil is in the details. Making it work reliably in production means thinking about:
If these parts are ignored, problems don’t show up immediately. They sneak in gradually - longer delays, subtle inconsistencies, and performance cliffs that are hard to debug when it’s already too late. But with the right approach, the transactional outbox pattern becomes a solid foundation for building resilient, event-driven systems.
For most teams, especially when starting out, using a ready-to-use library is the best move. Libraries like MassTransit
for .NET or Axon Framework
for Java
take care of the boilerplate and edge cases, allowing to focus on domain logic. If you can fit them into your stack and constraints, use them. Just be sure you understand how they behave in production, especially around delivery guarantees, scaling, and observability.
This article focused on the classic outbox setup - write to database, regularly poll the database inside relay process. But this isn’t the only way.
In the next article, I’ll look at other implementation options - ways to reduce lag and scale better. Coming soon.