Deutsche Startseite · English Homepage
STRAVID - Building software collectively.
In the freshly released time tracking area of Tedian a new error popped up in the event store implementation. Some interactions in the GUI sporadically caused PG::UniqueViolation
errors in the event store.
The event store is built upon PostgreSQL and uses an events table. One mechanism to ensure correctness is an unique constraint on stream_name
and position
. In plain english it means that within a stream of events every position can be only occupied once.
As mentioned above this constraint was being violated sporadically. The cause were multiple commands, affecting the identical event stream, being submitted in rapid succession.
The multithreaded application server accepted the commands simultaneously, invoked the command handlers and submitted the resulting events to the event store. Every so often two write transactions to the events
table would overlap and cause the aforementioned error.
This failure scenario immediately reminded me of optimistic locking. A feature that the event store supports but that is not used in this case. The commands are non-destructive and the order doesn't matter. But I think it's not the same problem.
Optimistic locking is supposed to protect against scenarios where a command was handled and the resulting events published during the invocation of another command.
This problem occurs because two writes to the same event stream happen at the same time. Which sounds the same but I think there is a subtle difference I can't articulate yet.
How do I solve this?
I went with option three for the time being but I don't know how to proceed.
On the GUI side I solved the problem by ensuring that the commands are sent sequentially instead of parallel. This solved the production failures and bought us time to dwell on this issue without time pressure. I researched how the Eventide project solves this particular problem and learned about exclusive PostgreSQL advisory locks in the process. (Relevant source code directory, Initial solution and Improved solution.) After making sure this actually solved the concurrency problem I adopted the approach into my own event store implementation.
How does optimistic locking work with multiple events from one command if there is a clash with a different command that resulted also in multiple events?
Possible answer: Use another lock within a transaction to ensure all events are written as batch. At the moment I don't know if this can work and how it interacts with the lock on the stream. I also realised that the event store can only write multiple events to the same stream. My implementation already implicitly requires that but it's good to be explicitly aware of this fact.
I think sending multiple commands at the same time is an anti-pattern. There should be a single bulk command to allow for recovery in failure scenarios.
You have questions, ideas or feedback? I like to hear from you and I'm looking forward to exhange thoughts. Please send an email to david@strauss.io and say “Hello”.