At Robinhood (my previous company), most of the interesting technical challenges I worked on involved a database. But at Mighty, my first 18 months didn't involve anything with databases! After having worked for months on Mighty's streaming team, I recently switched to work on the infrastructure team to help the company scale to serve many more users. The job of the infrastructure team at Mighty is very similar to a subset of the service AWS, GCP, and Azure provide: provide users with secure and performant virtual machines while giving them an easy way to manage them. Part of this job has involved using databases to store virtual machine state.
Race Conditions
One of the trickiest parts of using databases, as I'm sure most people will tell you, is managing race conditions at scale. Race conditions typically occur when you have a read, modify, then write operation. For example:
def assign_user_to_machine(user, machine):
execute_sql("BEGIN")
machine = execute_sql(f"SELECT * FROM machines WHERE id = {machine.id}")
if (machine.state != "open"):
raise Exception("Cant assign a user to non-open machine")
execute_sql("UPDATE users SET machine_id = {machine.id} WHERE id = {user.id}")
execute_sql("UPDATE machines SET state = 'assigned' WHERE id = {machine.id}")
execute_sql("COMMIT")
Say this database transaction is being run on two separate threads of execution. At first glance, it may seem impossible for two users to be assigned to the same machine. However, in an environment with multiple simultaneous threads of execution, it is possible. Say both these functions are executed simultaneously:
Execution #1: assign_user_to_machine(User1, Machine1)
Execution #2: assign_user_to_machine(User2, Machine1)
Say Execution #1 and Execution #2 both get to right before this line:
execute_sql("UPDATE users SET machine_id = {machine.id} WHERE id = {user.id}")
To both executions, it appears that the machine is the open
state. Now, once execution of both the threads completes, two users will have machine_id = [Machine1.id](http://Machine1.id)
and the machine will be put in the assigned
state. This is really bad!
Concurrency Control
There are two ways that I'm aware of to prevent race conditions: pessimistic concurrency control and optimistic concurrency control.
Pessimistic concurrency control is pre-emptively using explicit locking on database tables and rows before modifying the objects. Optimistic concurrency control does not use explicit locking, but rather checks that the row hasn't been altered by a separate concurrent thread of execution when it needs to update it. If it has been altered, then abort the transaction. I will demonstrate some examples of both these methods below.
Pessimistic Concurrency Control
At Robinhood, the primary method of concurrency control was to use pessimistic concurrency control. When we had to solve this problem at Mighty, my gut said to use the same technique. Pessimistic concurrency control involves explicit locking. The example from above has been modified to be race condition free by using a SELECT FOR UPDATE
SQL statement. SELECT FOR UPDATE
acquires a lock explicitly on all rows in the specified table that match the WHERE
condition.
def assign_user_to_machine(user, machine):
execute_sql("BEGIN")
machine = execute_sql(f"SELECT FOR UPDATE * FROM machines WHERE id = {machine.id}")
if (machine.state != "open"):
raise Exception("Cant assign a user to non-open machine")
execute_sql("UPDATE users SET machine_id = {machine.id} WHERE id = {user.id}")
execute_sql("UPDATE machines SET state = 'assigned' WHERE id = {machine.id}")
execute_sql("COMMIT")
Now Execution #1 and Execution #2 of this function must occur serially rather than in parallel. If Execution #1 acquires the lock explicitly via SELECT FOR UPDATE
on Machine1, Execution #2 will be blocked on the SELECT FOR UPDATE
statement till Execution #1 commits the transaction, and as a result, releases the lock. The lock is held till the transaction has either been committed or rolled back, at which point the lock is released. Once the transaction completes, Execution #2 can continue to execute. Execution #2 will notice that the machine in question is now in the assigned
state, so it will throw an exception Can't assign a user to a a non-open machine
. This new code prevents the race condition and resulted in the expected behavior!
Optimistic Concurrency Control
Implementing optimistic concurrency control is best explained by showing an example. The code would look like the following to implement optimistic concurrency control on our initial example:
def assign_user_to_machine(user, machine):
execute_sql("BEGIN")
machine = execute_sql(f"SELECT * FROM machines WHERE id = {machine.id}")
if (machine.state != "open"):
raise Exception("Cant assign a user to non-open machine")
execute_sql("UPDATE users SET machine_id = {machine.id} WHERE id = {user.id}")
num_updated = execute_sql("UPDATE machines SET state = 'assigned' WHERE id = {machine.id} AND state = 'open'")
if num_updated == 0:
execute_sql("ROLLBACK")
raise Exception("Another transaction concurrently modified this machine")
execute_sql("COMMIT")
Execution #1 and Execution #2 can still occur concurrently, however, two users will never be assigned to the same machine. Let's run through our initial example of how a race condition may happen to show that it will no longer happen with our new code. First, say that Execution #1 and Execution #2 both execute up until the following line:
num_updated = execute_sql("UPDATE machines SET state = 'assigned' WHERE id = {machine.id} AND state = 'open'")
Because the UPDATE
statement acquires lock implicitly on the Machine1 row, only one of these statements can occur at a single point in time. Say Execution #2 executes it first. After Execution #2 completes executing the statement, the machine's state is now assigned
. Note that Execution #2 must complete the transaction till Execution #1 can continue. This is because the lock implicitly acquired by the UPDATE
statement in Execution #2 is held from the UPDATE
statement till the transaction is either committed or rolled back. Execution #2 continues, num_updated
is 1 so the transaction commits and the lock on Machine1 is released. Execution #1 can continue since the lock on Machine1 has been released. When executing the UPDATE
statement, Execution #1 finds that no machines match the WHERE
clause of the UPDATE
statement since the machine with id 1 (Machine1) is not in the state open
. num_updated
is 0 and an Exception
is raised and the entire transaction is rolled back. This includes a revert of the UPDATE users SET machine_id = ...
statement! Now you can see that optimistic concurrency control can also prevent race conditions.
When should I use optimistic vs. pessimistic concurrency control?
Optimistic concurrency control is better when there isn't a lot of contention. If you expect that race conditions will be very rare, then optimistic concurrency control may be the most performant since you are minimizing critical sections of your application by not using any explicit locking. At Mighty, we chose to prevent race conditions by using pessimistic concurrency control. Why? Two reasons.
Firstly, I feel that a hidden cost of optimistic concurrency control is maintaining and reasoning about application logic is harder in this world. Because of this, it impedes developer velocity and increases the probability of bugs. Pessimistic concurrency control, on the other hand, is easier to write and reason about. This is because there is a simple heuristic we can use when updating data: explicitly lock everything that you are updating in a pre-defined order (I explain why this is important in the Deadlocks
section).
Secondly, I believe in not solving performance problems before they are an issue! As a small startup, Mighty is likely very far from running into performance problems due to lock contention any time soon.
Database Connections and Holding onto Locks
One concern that we had when using pessimistic concurrency control was: if we are using locks more, does that increase the likelihood that we can hold onto a lock forever in the case of application crashes or network blips? I had a limited understanding of how database connections and transactions worked, and this question forced me to dive a bit deeper into it.
One possible scenario that is helpful to examine is: A client initiates a TCP connection to Postgres, begun a transaction, acquired a lock, and then goes offline. In this situation, Postgres is still waiting for the transaction to terminate and so still has not released the lock in question. Eventually, Postgres will notice that the client has gone offline. How? It will wait for the client TCP connection to terminate. TCP uses keepalives messages from the database to the client (Postgres settings here: https://www.postgresql.org/docs/9.5/runtime-config-connection.html) to determine if the connection to the client is still healthy. When sufficient keepalive messages go un-acked by the offline client, TCP will determine the client is no longer connected and will terminate the connection. When Postgres notices the connection has terminated, all transactions that the specific client connection has been executing will end (rolled back, not committed) and, as a result, all locks that the transaction held will be released.
That answered our question! Our database was not at risk of holding onto locks forever due to application crashes or network blips.
The only risk we hold now is if an active and online client held onto a lock longer than expected. Because we are in control of all clients connecting to our database, this was deemed to be lower priority since it is more easily preventable. Our rule of thumb for reducing the time locks are held are to minimize the amount of time it takes to execute a transaction that holds onto locks. Among other things, this means ensuring that no network I/O happens inside a transaction.
Deadlocks
Another concern the team had about increasing the amount of explicit locking in our application is deadlocks. Deadlock is possible even without explicit locking in our application, this is illustrated in the Deadlock
section in Postgres documentation. As discussed earlier, this is because UPDATE
statements acquire locks implicitly and don't release it till the transaction is complete.
To prevent this form of deadlock, as the documentation mentions, it is important to run UPDATE
queries, always in a consistent order, in all transactions, across all your clients.
Another form of deadlock can happen in a similar way, but when explicit locks (vs. implicit locks acquired by UPDATE
) are called simultaneously in an inconsistent order.
Explicit locking in a consistent manner can prevent both cases of deadlock. Additionally, it turns out that this can be very simple by using the following heuristic across all transactions, across all clients:
- If a row is being updated in a transaction, then acquire an explicit lock for the row at the beginning of the transaction.
- If you are acquiring a lock: First, always lock rows in the
users
table in the order of theirid
. Then, always lock rows in themachines
table in the same order of theirid
. Follow this pattern for all other tables that may require a lock. We made this very simple in our application by adding one function that is used for all Postgres locking. That function always obeys the consistent lock ordering that we want.
This heuristic should help us eliminate all deadlocks in our codebase! Explicit locking is no longer an issue since it is always done in a consistent order. Implicit locking from UPDATE
statements is also not an issue given the heuristic because the locks are explicitly acquired at the beginning of the transaction in the correct order.
Woo!
It was a lot of fun looking into this over the last day or two. It made me feel like I've really leveled up over the last almost three years of being a software engineer. I remember it'd take me a week or more to understand topics of similar difficulty a year ago, but now it only takes a couple hours to have an understanding I feel confident in. Looking back and feeling progress is always motivating!