Load Testing Adventures With SyncServerII: mySQL Locking and Deadlocks

Summary: In doing load testing with SyncServerII, mySQL deadlocks arose under certain conditions. SyncServerII enables multiple users access to shared groups of cloud-stored files. When different users access the same sharing group, a lock needs to be temporarily held to preserve user assumptions about the state of the data. Looking at various methods, I found that mySQL’s GET_LOCK and RELEASE_LOCK distributed locking mechanism was suited to the requirements of locking in SyncServerII and resolved the existing deadlock issue.

The iOS app Neebla was first released to the App Store about three months ago, on March 10, 2019. So it was high time that I did some load testing with its server, SyncServerII. While the server was designed to handle multiple concurrent users, it hasn’t received much testing of that concurrency. For example, the server uses a multi-threaded Kitura-based system, and I need to make sure this design works under load. And just as importantly while it is currently running on AWS (I use Elastic Beanstalk), I have not yet set up auto-scaling on AWS. Scaling is needed so that when more users are signed-in at one time, AWS can dynamically create more server (EC2) instances to accommodate that load.

To do this load testing, I needed a means to simulate variable numbers of users on the system. Currently, I’m using http://locust.io. This enables scripting of simulating users in Python, and varying the number of users in tests. Running the simulation with five users accessing the same sharing group, I started running into problems. The tasks comprised use of several endpoints on the server: Index, file upload, file download, file deletion, and done uploads. Specifically, I started running into deadlocks in the mySQL database used by SyncServerII:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

The deadlocks were related to my use of locks for the done uploads endpoint.

Background: Master versions

Some background may be useful at this point. First, the general data model of SyncServerII revolves around a master version associated with a sharing group. A sharing group comprises a set of individually versioned files. The master version is an integer value, providing an overall version for the set of files in a sharing group, and enables a client to know when the collection of files for a sharing group (e.g., an Album in Neebla) has changed. In general, a user makes changes to data (contained in files) on the basis of data presented to them. And those changes may be made based on the specific data state being presented. In SyncServerII terms, if a sharing group master version is 100 (representing a certain data state), the user may decide to make certain data state changes. If the master version had a different value (e.g., 101), the user might make other certain data state changes.

This concept translates through to the SyncServerII needing to be careful about master version changes and to carefully inform clients about changes to the master version for a sharing group. For example, when a DoneUploads endpoint call is performed — which commits changes to a series of files — the client gives the server its currently known value for the master version. The server checks that client value against the current server database value for the sharing group. If it is the same, the DoneUploads proceeds. If it is different, the DoneUploads call fails, and the client is informed.

Continuing this reasoning, it follows that DoneUploads requests by several clients for the same sharing group must be serialized. For example, for a specific sharing group, suppose the current value of the master version is 100. Suppose that client A and client B do DoneUploads endpoint requests at about the same time, to commit their different changes to same sharing group. These clients are making changes to the sharing group, for their users, in the context of a specific data state — i.e., for the specific master version value of 100. Since client A and client B both want to make changes to the same sharing group, only one of them can be allowed to proceed. We cannot randomly let their changes be applied, because this would violate their incoming assumptions about why their users were making changes to the data in the first place. Thus, the first of clients A and B to have their DoneUploads requests received will be given a lock. And allowed to make its changes. One of these changes, as part of the DoneUploads endpoint operation, is to increment the master version value — to 101. When client B next tries its DoneUploads request, it will fail, and will need to retry its request. This makes sense because client B’s assumptions were violated about the data state conditions under which it was trying to make its changes. For example, the client might want to inform the user of the data state changes on the server, and the user might want to make changes to the data before they are committed.

ShortLocks: The incoming approach

When I started load testing for SyncServerII, I was using the following locking approach. I had a mySQL table called ShortLocks. Each row had a sharingGroupUUID (the identifiers used for sharing groups) and an expiry date as its main attributes. The sharingGroupUUID was a unique key for the table. A client creates a lock by inserting a row into the table, in the context of an SQL transaction. My concept was to rely on a characteristic of transactions and unique key values: Another client trying to insert a row with the same sharingGroupUUID would block. To release a lock, the client deletes the row from the table. As part of a cleanup process, stale locks (based on their expiry date) are removed when creating locks.

While this method worked fine on a lightly loaded system, I started to run into problems when multiple clients were trying to concurrently operate on the same sharing group. What I ran across was something like that found here. Specifically, I ran into mySQL deadlocks. At first I thought this had something to do with the fact that the sharingGroupUUID was a foreign key. However, when I removed the foreign key relation, I kept running into the same thing. It seems that my assumption of purely row level locking in the ShortLocks table was incorrect. My guess is that table level locking starts to be used: “It is likely that the delete statement will affect a large fraction of the total rows in the table. Eventually this might lead to a table lock being acquired when deleting. Holding on to a lock (in this case row- or page locks) and acquiring more locks is always a deadlock risk.” (from prior reference).

“FOR UPDATE”: Another approach

My next approach was to try to use a “Select … FOR UPDATE” on the SharingGroup’s table to create the lock. The SharingGroup’s table in SyncServerII records information about each sharing group. This raised a usage issue that the only way to release the “FOR UPDATE” lock in mySQL is to do a COMMIT. And so the lock becomes intrinsically tied to the transaction which is a little inconvenient in my case.

A far more serious issue however was that I started having mySQL deadlocks arising with my upload endpoint requests. Digging into this, it was apparently occurring because of sharingGroupUUID foreign key usage in the uploads table, which records information about uploads.

Isolation levels

It seems conceptually possible that I could use the isolation levels provided in mySQL databases to resolve this issue. However, not all of the scenarios that I need are satisfied. Briefly, the scenario when
(a) a transaction that updates a master version is
(b) followed by a transaction that reads the same master version
doesn’t have the characteristics that I need. For details see Appendix I.

mySQL distributed locking: Current approach

Digging around more, I learned about a distributed locking mechanism available in mySQL. Specifically, GET_LOCK and RELEASE_LOCK, which enable specifically named locks. It has further attributes:

1) Other sessions trying to get the same lock block until it’s released.
2) You can specify a timeout when acquiring the lock.
3) You can do an explicit release (RELEASE_LOCK), which is independent of a COMMIT.
4) If the locking session ends without unlocking, the lock is released.
5) It’s independent of database keys, so I don’t run into deadlock issues based on foreign keys.

The lock name used for GET_LOCK and RELEASE_LOCK are the sharingGroupUUID for the sharing group.

To add further specifics, in the DoneUploads endpoint, the GET_LOCK and RELEASE_LOCK surround a sequence of mySQL statements including:
a) an update to the MasterVersion table to increment the current master version for the sharing group, and
b)UPDATE’s and INSERT’s into the FileIndex table, and DELETE’s to the Upload table to transfer uploaded file records for the sharing group. The Upload table records uploads from a client, and the FileIndex table records the currently active set of files for sharing groups.

This locking usage for DoneUploads ensures that updates to the FileIndex across multiple clients accessing the same sharing group are serialized. It also ensures that a single master version depicts a specific data state in the FileIndex for a specific sharing group.

For the Index endpoint, the GET_LOCK and RELEASE_LOCK surround the entire set of mySQL select statements used in implementing the endpoint. This ensures that the client gets a consistent snapshot of the FileIndex with a specific master version value for a particular sharing group.

Incorporating this lock change into SyncServerII, and rerunning my tests with Locust, I am now getting my concurrent user load testing operating without errors.

Appendix I: Details of experiments with isolation levels.

CREATE TABLE MasterVersionTest(
id INT,
masterVersion INT,
UNIQUE (id)) ENGINE=InnoDB;

INSERT INTO MasterVersionTest (id, masterVersion) VALUES (1, 100);

# — — — — — — — — — — — — — — — — — — — — — — — — — — —

# Scenario 1
# Transaction 1 updates a row for a given pre-existing masterVersion value

BEGIN;
UPDATE MasterVersionTest SET masterVersion = masterVersion + 1 WHERE masterVersion = 100;

# Transaction 2 attempts the same update (THIS NEEDS TO FAIL)

BEGIN;
UPDATE MasterVersionTest SET masterVersion = masterVersion + 1 WHERE masterVersion = 100;

# Results:
# Trial 1: Transaction 2: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

# Trial 2: I committed the 1st transaction before the time out on the second. The second then “fails”: “Rows matched: 0 Changed: 0 Warnings: 0”

# Results: On a timeout, we naturally get the needed failure. On a non-timeout we can generate the failure by checking the number of rows matched.

# Note that this is being carried out with a REPEATABLE-READ isolation level.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — —

# Scenario 2
# Transaction 1 updates a row for a given pre-existing masterVersion value

BEGIN;
UPDATE MasterVersionTest SET masterVersion = masterVersion + 1 WHERE masterVersion = 101;

# Transaction 2 selects a row for that prior pre-existing masterValue (THIS NEEDS TO FAIL)

BEGIN;
select * from MasterVersionTest WHERE masterVersion = 101;

# Results: Transaction two proceeds *without* failing, finding the previous value of the row in the table prior to the update.

# Note that this is being carried out with a REPEATABLE-READ isolation level.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — —

# Scenario 2 with READ-COMMITTED isolation level

# Transaction 1 updates a row for a given pre-existing masterVersion value

SET GLOBAL TRANSACTION ISOLATION LEVEL READ COMMITTED;

BEGIN;
UPDATE MasterVersionTest SET masterVersion = masterVersion + 1 WHERE masterVersion = 102;

# Transaction 2 selects a row for that prior pre-existing masterValue (THIS NEEDS TO FAIL)

BEGIN;
select * from MasterVersionTest WHERE masterVersion = 102;

# Results: Transaction 2 proceeds *without* failing, finding the previous value of the row in the table prior to the update. Same result as with REPEATABLE READ.

# — — — — — — — — — — — — — — — — — — — — — — — — — — — —

# Scenario 2 with SERIALIZABLE isolation level

# Transaction 1 updates a row for a given pre-existing masterVersion value

SET GLOBAL TRANSACTION ISOLATION LEVEL SERIALIZABLE;

BEGIN;
UPDATE MasterVersionTest SET masterVersion = masterVersion + 1 WHERE masterVersion = 103;

# Transaction 2 selects a row for that prior pre-existing masterValue (THIS NEEDS TO FAIL)

BEGIN;
select * from MasterVersionTest WHERE masterVersion = 103;

# Results: Transaction 2 proceeds *without* failing, finding the previous value of the row in the table prior to the update. Same result as prior isolation levels.

# READ-UNCOMMITTED isolation level isn’t interesting to me because: “That means transactions could be reading data that may not even exist eventually because the other transaction that was updating the data rolled-back the changes and didn’t commit.” (https://mydbops.wordpress.com/2018/06/22/back-to-basics-isolation-levels-in-mysql/)

# — — — — — — — — — — — — — — — — — — — — — — — — — — — —

# Scenario 3

# Transaction 1 selects a row for that a pre-existing masterValue

# Transaction 2 updates a row for a that pre-existing masterVersion value (THIS NEEDS TO FAIL)

# NOTE: I haven’t tried this because of the problems I had with the earlier scenario.