22. Diagnosing Lock Contention
In Chapter 18, you were introduced to the world of locks in MySQL. If you have not read Chapter 18 yet, you are strongly encouraged to do so now as this chapter is closely related. You may even want to refresh your memory if it is a while since you read it. Lock issues is one of the common causes of performance issues, and the impact can be severe. In the worst cases, queries can fail, and connections pile up so no new connections can be made. Therefore, it is important to know how to investigate lock issues and remediate the problems.
Metadata and schema locks
Record-level locks including gap locks
Each category of locks uses different techniques to determine what is the cause of the lock contention. When you read the examples, you should have in mind that similar techniques can be used to investigate lock issues that do not 100% match the example.
The Symptoms: This describes how you identify that you are encountering this kind of lock issue.
The Cause: The underlying reason you encounter this kind of lock issues. This is related to the general discussion of the locks in Chapter 18.
The Setup: This includes the steps to set up the lock issue, if you want to try it yourself. As lock contention requires multiple connections, the prompt, for example, Connection 1>, is used to tell which connection should be used for which statements. If you want to follow the investigation with no more information than you would have in a real-world case, you can skip this section and go back and review it after getting through the investigation.
The Investigation: The details of the investigation. This draws on the “Monitoring Locks” section of Chapter 18.
The Solution: How you resolve the immediate lock problem, so you minimize the outage caused by it.
The Prevention: A discussion of how to reduce the chance of encountering the issue. This is closely related to the section “Reduce Locking Issues” in Chapter 18.
Enough talk, the first lock category that will be discussed is flush locks.
One of the common lock issues encountered in MySQL is the flush lock. When this issue happens, users will typically complain that queries are not returning, and monitoring may show that queries are piling up and eventually MySQL will run out of connections. Issues around the flush lock can also sometimes be one of the most difficult lock issues to investigate.
The query state of new queries is “Waiting for table flush.” This may occur for all new queries or only for queries accessing specific tables.
More and more connections are created.
Eventually, new connections fail as MySQL is out of connection. The error received for new connections is ER_CON_COUNT_ERROR: “ERROR 1040 (HY000): Too many connections” when using the classic MySQL protocol (by default port 3306) or “MySQL Error 5011: Could not open session” when using the X Protocol (by default port 33060).
There is at least one query that has been running later than the oldest request for a flush lock.
There may be a FLUSH TABLES statement in the process list, but this is not always the case.
When the FLUSH TABLES statement has waited for lock_wait_timeout, an ER_LOCK_WAIT_TIMEOUT error occurs: ERROR: 1205: Lock wait timeout exceeded; try restarting transaction. Since the default value for lock_wait_timeout is 365 days, this is only likely to occur if the timeout has been reduced.
If you connect with the mysql command-line client with a default schema set, the connection may seem to hang before you get to the prompt. The same can happen if you change the default schema with a connection open.
The issue that the mysql command-line client is blocking does not occur if you start the client with the -A option which disables collecting the auto-completion information. A better solution is to use MySQL Shell that fetches the auto-completion information in a way that does not block due to the flush lock.
If you see these symptoms, it is time to understand what is causing the lock issue.
When a connection requests a table to be flushed, it requires all references to the table to be closed which means no active queries can be using the table. So, when a flush request arrives, it must wait for all queries using the tables that are to be flushed to finish. Note that unless you explicitly specify which tables to flush, it is just the query and not the entire transaction that must finish. Obviously, the case where all tables are flushed, for example, due to FLUSH TABLES WITH READ LOCK, is the most severe as it means all active queries must finish before the flush statement can proceed.
When the wait for the flush lock becomes a problem, it means that there are one or more queries preventing the FLUSH TABLES statement from obtaining the flush lock. Since the FLUSH TABLES statement requires an exclusive lock, it in turn stops subsequent queries from acquiring the shared lock they need.
This issue is often seen in connection with backups where the backup process needs to flush all tables and get a read lock in order to create a consistent backup.
A special case can occur when the FLUSH TABLES statement has timed out or has been killed, but the subsequent queries are not proceeding. When that happens, it is because the low-level table definition cache (TDC) version lock is not released. This is a case that can cause confusion as it is not obvious why the subsequent queries are still waiting for the table flush.
The use of SLEEP(180) in the first query means you have three minutes (180 seconds) to execute the two other queries and perform the investigation. If you want longer time, you can increase the duration of the sleep. You are now ready to start the investigation.
The investigation of flush locks requires you to look at the list of queries running on the instance. Unlike other lock contentions, there are no Performance Schema tables or InnoDB monitor report that can be used to query for the blocking query directly.
Investigating flush lock contention using sys.session
There are four queries in the output. The sys.session and sys.processlist views by default sort the queries according to the execution time in descending order. This makes it easy to investigate issues like contention around the flush lock where the query time is the primary thing to consider when looking for the cause.
You start out looking for the FLUSH TABLES statement (the case where there is no FLUSH TABLES statement will be discussed shortly). In this case, that is thd_id = 53 (the second row). Notice that the state of the FLUSH statement is “Waiting for table flush.” You then look for queries that have been running for a longer time. In this case, there is only one query: the one with thd_id = 30. This is the query that blocks the FLUSH TABLES WITH READ LOCK from completing. In general, there may be more than one query.
The two remaining queries are a query being blocked by the FLUSH TABLES WITH READ LOCK and the query to obtain the output. Together, the three first queries form a typical example of a long-running query blocking a FLUSH TABLES statement which in turn blocks other queries.
To get the process list report in MySQL Workbench, choose the Client Connections item under Management in the navigator pane to the left of the screen. You cannot choose which columns to include, and to make the text readable, only part of the report is included in the screenshot. The Id column corresponds to conn_id in the sys.session output, and Thread (the rightmost column) corresponds to thd_id. The full screenshot is included in this book’s GitHub repository as figure_22_1_workbench_flush_lock.png.
The Processes report is found under the Metrics menu item for individual instances. You can choose which column you want to include in the output. An example of the report with more details can be found in the book’s GitHub repository as figure_22_2_mem_flush_lock.png.
An advantage of reports like the ones in MySQL Workbench and MySQL Enterprise Monitor is that they use existing connections to create the report. In cases where the lock issue causes all connections to be used, then it can be invaluable to be able to get the list of queries using a monitoring solution.
Flush lock contention without a FLUSH TABLES statement
This situation is identical to the previous except the FLUSH TABLES statement is gone. In this case, find the query that has been waiting the longest with the state “Waiting for table flush.” Queries that have been running longer than this query has been waiting are the ones preventing the TDC version lock being released. In this case, that means thd_id = 30 is the blocking query.
Once you have identified the issue and the principal queries involved, you need to decide what to do about the issue.
There are two levels of solving the issue. First of all, you need to resolve the immediate problem of queries not executing. Second, you need to work at avoiding the issue in the future. This subsection will discuss the immediate solution, and the next will consider how to reduce the chance of the issue occurring.
To resolve the immediate issue, you have the option of waiting for the queries to complete or starting to kill queries. If you can redirect the application to use another instance while the flush lock contention is ongoing, you may be able to let the situation resolve itself by letting the long-running queries complete. If there are data changing queries among those running or waiting, you do in that case need to consider whether it will leave the system in a consistent state after all queries have completed. One option may be to continue in read-only mode with the read queries executed on a different instance.
If you decide to kill queries, you can try to kill the FLUSH TABLES statement . If that works, it is the simplest solution. However, as discussed that will not always help, and in that case the only solution is to kill the queries that were preventing the FLUSH TABLES statement from completing. If the long-running queries look like runaway queries and the application/client that executed them anyway is not waiting for them any longer, you may want to kill them without trying to kill the FLUSH TABLES statement first.
One important consideration when looking to kill a query is how much data has been changed. For a pure SELECT query (not involving stored routines), that is always nothing, and from the perspective of work done, it is safe to kill it. For INSERT, UPDATE, DELETE, and similar queries, however, the changed data must be rolled back if the query is killed. It will usually take longer to roll back changes than making them in the first place, so be prepared to wait a long time for the rollback if there are many changes. You can use the information_schema.INNODB_TRX table to estimate the amount of work done by looking at the trx_rows_modified column. If there is a lot of work to roll back, it is usually better to let the query complete.
When a DML statement is killed, the work it has done must be rolled back. The rollback will usually take longer than creating the change, sometimes much longer. You need to factor that in, if you consider killing a long-running DML statement.
Of course, optimally you prevent the issue from happening at all.
The flush lock contention happens because of the combination of a long-running query and a FLUSH TABLES statement. So, to prevent the issue, you need to look at what you can do to avoid these two conditions to be present at the same time.
Finding, analyzing, and handling long-running queries are discussed in other chapters throughout the book. One option of particular interest is to set a timeout for the query. This is supported for SELECT statements using the max_execution_time system variable and the MAX_EXECUTION_TIME(N) optimizer hint and is a great way to protect against runaway queries. Some connectors also have support for timing out queries.
To avoid long-running SELECT queries, you can configure the max_execution_time option or set the MAX_EXECUTION_TIME(N) optimizer hint. This will make the SELECT statement time out after the specified period and help prevent issues like flush lock waits.
Some long-running queries cannot be prevented. It may be a reporting job, building a cache table, or another task that must access a lot of data. In that case, the best you can do is to try to avoid them running while it is also necessary to flush the tables. One option is to schedule the long-running queries to run at different times than when it is necessary to flush tables. Another option is to have the long-running queries run on a different instance than the jobs that require flushing tables.
A common task that requires flushing the tables is taking a backup. In MySQL 8, you can avoid that issue by using the backup and log locks. For example, MySQL Enterprise Backup (MEB) does this in version 8.0.16 and later, so InnoDB tables are never flushed. Alternatively, you can perform the backup at a period with low usage, so the potential for conflicts is lower, or you can even do the backup while the system is in read-only mode and avoid the FLUSH TABLES WITH READ LOCK altogether.
Another lock type that often causes confusion is the metadata lock.
Metadata and Schema Locks
In MySQL 5.7 and earlier, metadata locks were often a source of confusion. The problem is that it is not obvious who holds the metadata lock. In MySQL 5.7, instrumentation of the metadata locks was added to the Performance Schema, and in MySQL 8.0 it is enabled by default. With the instrumentation enabled, it becomes easy to determine who is blocking the connection trying to obtain the lock.
A DDL statement and possible other queries are stuck in the state “Waiting for table metadata lock.”
Queries may be pilling up. The queries that are waiting all use the same table. (There may potentially be more than one group of queries waiting if there are DDL statements for multiple tables waiting for the metadata lock.)
When the DDL statement has waited for lock_wait_timeout, an ER_LOCK_WAIT_TIMEOUT error occurs: ERROR: 1205: Lock wait timeout exceeded; try restarting transaction. Since the default value for lock_wait_timeout is 365 days, this is only likely to occur if the timeout has been reduced.
There is a long-running query or a long-running transaction. In the latter case, the transaction may be idle or executing a query that does not use the table that the DDL statement acts on.
What makes the situation potentially confusing is the last point: there may not be any long-running queries that are the clear candidates for causing the lock issue. So what is the cause of the metadata lock contention?
Remember that the metadata locks exist to protect the schema definition (as well as being used with explicit locks). The schema protection will exist for as long as a transaction is active, so when a transaction queries a table, the metadata lock will last until the end of the transaction. Therefore, you may not see any long-running queries. In fact, the transaction holding the metadata lock may not be doing anything at all.
In short, the metadata lock exists as one or more connections may rely on the schema for a given table not changing, or they have explicitly locked the table either using the LOCK TABLES or FLUSH TABLES WITH READ LOCK statement .
At this point, you can start the investigation. The situation will not resolve itself (unless you have a low value for lock_wait_timeout or you are prepared to wait a year), so you have all the time you want. When you want to resolve the block, you can start terminating the ALTER TABLE statement in Connection 2 to avoid modifying the world.city table. Then commit or roll back the transaction in Connection 1.
If you have the wait/lock/metadata/sql/mdl Performance Schema instrument enabled (the default in MySQL 8), it is straightforward to investigate metadata lock issues. You can use the metadata_locks table in the Performance Schema to list the granted and pending locks. However, a simpler way to get a summary of the lock situation is to use the schema_table_lock_waits view in the sys schema.
A metadata lock wait issue
Two connections are waiting for a metadata lock (on the world.city table). There is a third connection included (conn_id = 9) which is idle which can be seen from the NULL for the statement latency (in some versions earlier than 8.0.18, you may also see that the current statement is NULL). In this case, the list of queries is limited to those with an active query or an active transaction, but usually you will start out with a full process list. However, to make it easy to focus on the important parts, the output is filtered.
Finding metadata lock contention
The output shows that there are four cases of queries waiting and blocking. This may be surprising, but it happens because there are several locks involved and there is a chain of waits. Each row is a pair of a waiting and blocking connection. The output uses “pid” for the process list id which is the same as the connection id used in earlier outputs. The information includes what the lock is on, details about the waiting connection, details about the blocking connection, and two queries that can be used to kill the blocking query or connection.
The first row shows process list id 7090 waiting on itself. That sounds like a deadlock, but it is not. The reason is that the ALTER TABLE first took a shared lock that can be upgraded and then tried to get the exclusive lock which is waiting. Because there is no explicit information on which existing lock is actually blocking the new lock, this information ends up being included.
The second row shows that the SELECT statement is waiting for process list id 7090 which is the ALTER TABLE. This is the reason that connections can start to pile up as the DDL statement requires an exclusive lock, so it will block requests for shared locks.
Looking for the connection causing the metadata lock block
The first query looks for a wait for an exclusive metadata lock where the blocking process list id is not itself. In this case, that immediately gives the main block contention. The second query determines the number of blocking queries triggered by each process list id. It may not be as simple as shown in this example, but using queries as shown here will help narrow down where the lock contention is.
Investigating a transaction
The first query uses the INNODB_TRX table in the Information Schema. It, for example, shows when the transaction was started, so you can determine how long it has been active. The trx_rows_modified column is also useful to know how much data has been changed by the transaction in case it is decided to roll back the transaction. Note that what InnoDB calls the MySQL thread id (the trx_mysql_thread_id column) is actually the connection id.
If you are using MySQL 8.0.15 or earlier, use the sys.format_time() function instead.
The third query uses the events_statements_history table to find the previous queries executed in the transaction. The NESTING_EVENT_ID column is set to the value of the EVENT_ID from the output of the events_transactions_current table, and the NESTING_EVENT_TYPE column is set to match a transaction. This ensures that only events that are children of the ongoing transaction are returned. The result is ordered by the EVENT_ID (of the statement) to get the statements in the order they were executed. By default, the events_statements_history table will include at most the ten latest queries for the connection.
In this example, the investigation shows that the transaction has executed two queries: one selecting from the world.city table and one selecting from the world.country table. It is the first of these queries causing the metadata lock contention.
The fourth query uses the session_connect_attrs table to find the attributes submitted by the connection. Not all clients and connectors submit attributes, or they may be disabled, so this information is not always available. When the attributes are available, they can be useful to find out where the offending transaction is executed from. In this example, you can see the connection is from MySQL Shell (mysqlsh). This can be useful if you want to commit an idle transaction.
For a metadata lock contention, you essentially have two options to resolve the issue: make the blocking transaction complete or kill the DDL statement. To complete the blocking transaction, you will need to either commit it or roll it back. If you kill the connection, it triggers a rollback of the transaction, so you need to take into consideration how much work will need to be rolled back. In order to commit the transaction, you must find where the connection is executed and commit it that way. You cannot commit a transaction owned by a different connection.
Killing the DDL statement will allow the other queries to proceed, but it does not solve the issue in the long term if the lock is held by an abandoned but still active transaction. For cases where there is an abandoned transaction holding the metadata lock, it can however be an option to kill both the DDL statement and the connection with the abandoned transaction. That way you avoid the DDL statement to continue blocking subsequent queries while the transaction rolls back. Then when the rollback has completed, you can retry the DDL statement.
The key to avoiding metadata lock contention is to avoid a long-running transaction at the same time as you need to execute DDL statements for the tables used by the transaction. You can, for example, execute DDL statements at times when you know there are no long-running transactions. You can also set the lock_wait_timeout option to a low value which makes the DDL statement abandon after lock_wait_timeout seconds. While that does not avoid the lock problem, it mitigates the issue by avoiding the DDL statement stopping other queries from executing. You can then find the root cause without the stress of having a large part of the application not working.
You can also aim at reducing how long transactions are active. One option is to split a large transaction into several smaller transactions, if it is not required that all operations are performed as an atomic unit. You should also make sure that the transaction is not kept open for unnecessarily long time by making sure you are not doing interactive work, file I/O, transferring data to the end user, and so on while the transaction is active.
One common cause of a long-running transaction is that the application or client does not commit or roll back the transaction at all. This is particularly likely to happen with the autocommit option disabled. When autocommit is disabled, any query – even a plain read-only SELECT statement – will start a new transaction when there is not already an active transaction. This means that an innocent-looking query may start a transaction, and if the developer is not aware that autocommit is disabled, then the developer may not think about explicitly ending the transaction. The autocommit setting is enabled by default in MySQL Server, but some connectors disable it by default.
That concludes the discussion about investigating metadata locks. The next level of locks to look at are the record locks.
Record lock contention is the most frequently encountered, but usually also the least intrusive as the default lock wait timeout is just 50 seconds, so there is not the same potential for queries pilling up. That said, there are cases – as will be shown – where record locks can cause MySQL to come to a grinding halt. This section will look into investigating InnoDB record lock issues in general and in more detail lock wait timeout issues. Investigating the specifics of deadlocks is deferred until the next section.
The symptoms of InnoDB record lock contention are often very subtle and not easily recognizable. In severe cases, you will get a lock wait timeout or a deadlock error, but in many cases, there may be no direct symptoms. Rather the symptom is that queries are slower than normal. This may range from being a fraction of a second slower to being many seconds slower.
In the figure, notice how the latency graph for the query increases toward the end of the period and then suddenly drops again. There is also a red icon to the right of the normalized query – that icon means the query has returned errors. In this case the error is a lock wait timeout, but that cannot be seen from the figure. The donut-shaped chart to the left of the normalized query also shows a red area indicating the query response time index for the query at times is considered poor. The large graph at the top shows a small dip showing there were enough issues in the instance to cause a general degradation of the performance of the instance.
InnoDB lock metrics
For this discussion, the innodb_row_lock_% and lock_timeouts metrics are the most interesting. The three time variables are in milliseconds. It can be seen there has been a single lock wait timeout which on its own is not necessarily a cause for concern. You can also see there have been 354 cases when a lock could not be granted immediately (innodb_row_lock_waits) and there have been waits up to more than 51 seconds (innodb_row_lock_time_max). When the general level of lock contention increases, you will see these metrics increase as well.
The graphs show a general increase in locking. The number of lock waits has two periods with increased lock waits and then drops off again. The row lock time graph shows a similar pattern. This is a typical sign of intermittent lock issues.
InnoDB works with shared and exclusive locks on the row data, index records, gaps, and insert intention locks. When there are two transactions that attempt to access the data in conflicting ways, one query will have to wait until the required lock becomes available. In short, two requests for a shared lock can be granted at the same time, but once there is an exclusive lock, no connections can acquire a lock on the same record.
As it is exclusive locks that are the most likely to cause lock contention, it is usually DML queries that change data that are the cause of InnoDB record lock contention. Another source is SELECT statements doing preemptive locking by adding the FOR SHARE (or LOCK IN SHARE MODE) or FOR UPDATE clause.
In this example the lock wait timeout for Connection 2 is set to 300 seconds. The START TRANSACTION for Connection 2 is not required but allows you to roll both transactions back when you are done to avoid making changes to the data.
The investigation of record locks is very similar to investigating metadata locks . You can query the data_locks and data_lock_waits tables in the Performance Schema which will show the raw lock data and pending locks, respectively. There is also the sys.innodb_lock_waits view which queries the two tables to find pairs of locks with one being blocked by the other.
The data_locks and data_lock_waits tables are new in MySQL 8. In MySQL 5.7 and earlier, there were two similar tables in the Information Schema named INNODB_LOCKS and INNODB_LOCK_WAITS. An advantage of using the innodb_lock_waits view is that it works the same (but with some extra information in MySQL 8) across the MySQL versions.
Retrieving lock information from the innodb_lock_waits view
wait_: These columns show some general information around the age of the lock wait.
locked_: These columns show what is locked ranging from the schema to the index as well as the lock type.
waiting_: These columns show details of the transaction that is waiting for the lock to be granted including the query and the lock mode requested.
blocking_: These columns show details of the transaction that is blocking the lock request. Note that in the example, the blocking query is NULL. This means the transaction is idle at the time the output was generated. Even when there is a blocking query listed, the query may not have anything to do with the lock that there is contention for – other than the query is executed by the same transaction that holds the lock.
sql_kill_: These two columns provide the KILL queries that can be used to kill the blocking query or connection.
The column blocking_query is the query currently executed (if any) for the blocking transaction. It does not mean that the query itself is necessarily causing the lock request to block.
The case where the blocking_query column is NULL is a common situation. It means that the blocking transaction is currently not executing a query. This may be because it is between two queries. If this period is an extended period, it suggests the application is doing work that ideally should be done outside the transaction. More commonly, the transaction is not executing a query because it has been forgotten about, either in an interactive session where the human has forgotten to end the transaction or an application flow that does not ensure transactions are committed or rolled back.
The solution depends on the extent of the lock waits. If it is a few queries having short lock waits, it may very well be acceptable to just let the affected queries wait for the lock to become available. Remember locks are there to ensure the integrity of the data, so locks are not inherently a problem. Locks are only a problem when they cause a significant impact on the performance or cause queries to fail to an extent where it is not feasible to retry them.
If the lock situation lasts for an extended period – particularly if the blocking transaction has been abandoned – you can consider killing the blocking transaction. As always you need to consider that the rollback may take a significant amount of time if the blocking transaction has performed a large amount of work.
For queries that fail because of a lock wait timeout error, the application should retry them. Remember that by default a lock wait timeout only rolls back the query that was executing when the timeout occurred. The rest of the transaction is left as it were before the query. A failure to handle the timeout may thus leave an unfinished transaction with its own locks that can cause further lock issues. Whether just the query or the whole transaction will be rolled back is controlled by the innodb_rollback_on_timeout option.
It is very important that a lock wait timeout is handled as otherwise it may leave the transaction with locks that are not released. If that happens, other transactions may not be able to acquire the locks they require.
Preventing significant record-level lock contention largely follows the guidelines that were discussed in the section “Reduce Locking Issues” in Chapter 18. To recapitulate the discussion, the way to reduce lock wait contention is largely about reducing the size and duration of transactions, using indexes to reduce the number of records accessed, and possibly switching the transaction isolation level to READ COMMITTED to release locks earlier and reduce the number of gap locks.
One of the lock issues causing the most concerns for database administrators are deadlocks. This is partly because of the name and partly because they unlike the other lock issues discussed always cause an error. However, there is as such nothing specially worrying about deadlocks compared to other locking issues. On the contrary, that they cause an error means that you know about them sooner and the lock issue resolves itself.
You can also check the LATEST DETECTED DEADLOCK section in the output of the InnoDB monitor, for example, by executing SHOW ENGINE INNODB STATUS. This will show when the last deadlock last occurred, and thus you can use that to judge how frequently deadlocks occur. If you have the innodb_print_all_deadlocks option enabled, the error lock will have many outputs of deadlock information. The details of the InnoDB monitor output for deadlocks will be covered in “The Investigation” after the cause of deadlocks and the setup have been discussed.
The deadlock shown in the figure is due to two record locks on the primary keys of a table. That is one of the simplest deadlocks that can occur. As shown when investigating a deadlock, the circle can be more complex than this.
The key is that the two transactions both update the city and country tables but in opposite order. The setup completes by explicitly rolling back both transactions to ensure the tables are left without changes.
The main tool to analyze deadlocks is the section with information about the latest detected deadlock in the InnoDB monitor output. If you have the innodb_print_all_deadlocks option enabled (OFF by default), you may also have the deadlock information from the error log; however, the information is the same, so it does not change the analysis.
When the deadlock occurred.
Information for the first of the transactions involved in the deadlock.
Information for the second of the transactions involved in the deadlock.
Which of the transactions that was rolled back. This information is not included in the error log when innodb_print_all_deadlocks is enabled.
The numbering of the two transactions is arbitrary, and the main purpose is to be able to refer to one transaction or the other. The two parts with transaction information are the most important ones. They include how long the transaction was active, some statistics about the size of the transactions in terms of locks taken and undo log entries and similar, the query that was blocking waiting for a lock, and information about the locks involved in the deadlock.
The lock information is not as easy to interpret as when you use the data_locks and data_lock_waits tables and the sys.innodb_lock_waits view. However, it is not too difficult once you have tried to perform the analysis a few times.
Create some deadlocks on purpose in a test system and study the resulting deadlock information. Then work your way through the information to determine why the deadlock occurred. Since you know the queries, it is easier to interpret the lock data.
Example of a detected deadlock
The deadlock occurred on November 6, 2019, at 18:29:07 in the server time zone. You can use this information to see if the information is for the same deadlock as the deadlock reported by a user.
This is a case where the deadlock involved multiple tables. While the two queries work on different tables, it cannot on its own prove that there are more queries involved as a foreign key can trigger one query to take locks on two tables. In this case though, the Code column is the primary key of the country table, and the only foreign key involved is from the CountryCode column on the city table to the Code column of the country table (showing this is left as an exercise for the reader using the world sample database). So it is not likely that two queries deadlock on their own.
The deadlock output is from MySQL 8.0.18 which added additional information to the output. This discussion only uses the information that is also available in previous releases as well. However, if you are still using an earlier release, upgrading will make it easier to investigate deadlocks.
InnoDB includes the value in hexadecimal notation, but also tries to decode it as a string, so here it is clear that the value is “AUS”, which is not surprising since that is also in the WHERE clause of the query. It is not always that obvious, so you should always confirm the value from the lock output. You can also see from the information that the column is sorted in ascending order in the index.
So far so good. Since the transactions are waiting for these locks, it can of course be inferred that the other transaction holds the lock. In version 8.0.18 and later, InnoDB includes the full list of locks held by both transactions; in earlier versions, InnoDB only includes this explicitly for one of the transactions, so you need to determine what other queries the transactions have executed.
You can see both connections were made using the root@localhost account. If you ensure to have different users for each application and role, the account may help you to narrow down who executed the transactions.
If the connections still exist, you can also use the events_statements_history table in the Performance Schema to find the latest queries executed by the connection. This may not be those involved in the deadlock, depending on whether the connection has been used for more queries, but may nevertheless provide a clue to what the connection is used for. If the connections no longer exist, you may in principle be able to find the queries in the events_statements_history_long table, but you will need to map the “MySQL thread id” (the connection ID) to the Performance Schema thread ID which there is no trivial way to do. Also, the events_statements_history_long consumer is not enabled by default.
Finding the queries involved in the deadlock
Notice that for connection id 62 (the second of the transactions), the MySQL error number is included, and the third row has it set to 1213 – a deadlock. MySQL Shell automatically executes a SHOW WARNINGS statement when an error is encountered which is the statement in row 4. Notice also that the nesting event is NULL for the ROLLBACK for transaction 2, but not for the ROLLBACK of transaction 1. That is because the deadlock triggered the whole transaction to be rolled back (so the ROLLBACK for transaction 2 did not do anything).
The deadlock was triggered by transaction 1 first updating the population of the city table and then of the country table. Transaction 2 first updated the population of the country table and then tried to insert a new city into the city table. This is a typical example of two workflows updating records in different orders and thus being prone to deadlocks.
Analyze the deadlock information from InnoDB to determine the locks involved in the deadlock and get as much information as possible about the connections.
Use other sources such as the Performance Schema to find more information about the queries in the transactions. Often it is necessary to analyze the application to get the list of queries.
Now that you know what triggered the deadlock, what is required to solve the issue?
This means that for transaction 1, there is nothing to do. After transaction 2 has been rolled back, transaction 1 can continue and complete its work.
For transaction 2, InnoDB has rolled back the whole transaction, so all you need to do is to retry the transaction. Remember to execute all queries again instead of relying on values returned during the first attempt; otherwise, you may be using outdated values.
Always be prepared to handle deadlocks and lock wait timeouts. For deadlocks or when the transaction has been rolled back after a lock wait timeout, retry the entire transaction. For lock wait timeouts where only the query has been rolled back, retry the query possibly adding a delay.
If deadlocks occur relatively rarely, you do not really need to do anything more. Deadlocks are a fact of life, so do not be alarmed by encountering a few of them. If deadlocks cause a significant impact, you need to look at making changes to prevent some of the deadlocks.
Reduce the work done by each transaction by splitting large transactions into several smaller ones and adding indexes to reduce the number of locks taken.
Consider the READ COMMITTED transaction isolation level if it is suitable for your application to reduce the number of locks and how long they are held.
Make sure transactions are only held open for as short time as possible.
Access records in the same order, if necessary by executing SELECT ... FOR UPDATE or SELECT ... FOR SHARE queries to take the locks preemptively.
That concludes the discussion of how to investigate locks. You may encounter lock cases that do not entirely match the cases discussed in this chapter; however, the techniques to investigate the issues will be similar.
This chapter has shown you how you can use the resources available in MySQL to investigate lock-related issues. The chapter included examples of investigating four different types of lock issues: flush locks, metadata locks, record locks, and deadlocks. Each issue type used different features of MySQL including the process list, the lock tables in the Performance Schema, and the InnoDB monitor output.
There are many other lock types that can cause lock wait issues. The methods discussed in this chapter go a long way to investigate issues caused by other lock types as well. In the end, the only way to become an expert on investigating locks is experience, but the techniques from this chapter provide a good starting point.
That concludes Part V about query analysis. Part VI is about improving the queries starting out with a discussion of improving performance through the configuration.