22. Diagnosing Lock Contention – MySQL 8 Query Performance Tuning: A Systematic Method for Improving Execution Speeds

© Jesper Wisborg Krogh 2020
J. W. KroghMySQL 8 Query Performance Tuninghttps://doi.org/10.1007/978-1-4842-5584-1_22

22. Diagnosing Lock Contention

Jesper Wisborg Krogh1 
(1)
Hornsby, NSW, Australia
 

In Chapter 18, you were introduced to the world of locks in MySQL. If you have not read Chapter 18 yet, you are strongly encouraged to do so now as this chapter is closely related. You may even want to refresh your memory if it is a while since you read it. Lock issues is one of the common causes of performance issues, and the impact can be severe. In the worst cases, queries can fail, and connections pile up so no new connections can be made. Therefore, it is important to know how to investigate lock issues and remediate the problems.

This chapter will discuss four categories of lock issues:
  • Flush locks

  • Metadata and schema locks

  • Record-level locks including gap locks

  • Deadlocks

Each category of locks uses different techniques to determine what is the cause of the lock contention. When you read the examples, you should have in mind that similar techniques can be used to investigate lock issues that do not 100% match the example.

For each lock category, the discussion has been split into six parts:
  • The Symptoms: This describes how you identify that you are encountering this kind of lock issue.

  • The Cause: The underlying reason you encounter this kind of lock issues. This is related to the general discussion of the locks in Chapter 18.

  • The Setup: This includes the steps to set up the lock issue, if you want to try it yourself. As lock contention requires multiple connections, the prompt, for example, Connection 1>, is used to tell which connection should be used for which statements. If you want to follow the investigation with no more information than you would have in a real-world case, you can skip this section and go back and review it after getting through the investigation.

  • The Investigation: The details of the investigation. This draws on the “Monitoring Locks” section of Chapter 18.

  • The Solution: How you resolve the immediate lock problem, so you minimize the outage caused by it.

  • The Prevention: A discussion of how to reduce the chance of encountering the issue. This is closely related to the section “Reduce Locking Issues” in Chapter 18.

Enough talk, the first lock category that will be discussed is flush locks.

Flush Locks

One of the common lock issues encountered in MySQL is the flush lock. When this issue happens, users will typically complain that queries are not returning, and monitoring may show that queries are piling up and eventually MySQL will run out of connections. Issues around the flush lock can also sometimes be one of the most difficult lock issues to investigate.

The Symptoms

The main symptom of a flush lock issue is that the database comes to a grinding halt where all new queries using some or all tables end up waiting for the flush lock. The telltale signs to look for include the following:
  • The query state of new queries is “Waiting for table flush.” This may occur for all new queries or only for queries accessing specific tables.

  • More and more connections are created.

  • Eventually, new connections fail as MySQL is out of connection. The error received for new connections is ER_CON_COUNT_ERROR: “ERROR 1040 (HY000): Too many connections” when using the classic MySQL protocol (by default port 3306) or “MySQL Error 5011: Could not open session” when using the X Protocol (by default port 33060).

  • There is at least one query that has been running later than the oldest request for a flush lock.

  • There may be a FLUSH TABLES statement in the process list, but this is not always the case.

  • When the FLUSH TABLES statement has waited for lock_wait_timeout, an ER_LOCK_WAIT_TIMEOUT error occurs: ERROR: 1205: Lock wait timeout exceeded; try restarting transaction. Since the default value for lock_wait_timeout is 365 days, this is only likely to occur if the timeout has been reduced.

  • If you connect with the mysql command-line client with a default schema set, the connection may seem to hang before you get to the prompt. The same can happen if you change the default schema with a connection open.

Tip

The issue that the mysql command-line client is blocking does not occur if you start the client with the -A option which disables collecting the auto-completion information. A better solution is to use MySQL Shell that fetches the auto-completion information in a way that does not block due to the flush lock.

If you see these symptoms, it is time to understand what is causing the lock issue.

The Cause

When a connection requests a table to be flushed, it requires all references to the table to be closed which means no active queries can be using the table. So, when a flush request arrives, it must wait for all queries using the tables that are to be flushed to finish. Note that unless you explicitly specify which tables to flush, it is just the query and not the entire transaction that must finish. Obviously, the case where all tables are flushed, for example, due to FLUSH TABLES WITH READ LOCK, is the most severe as it means all active queries must finish before the flush statement can proceed.

When the wait for the flush lock becomes a problem, it means that there are one or more queries preventing the FLUSH TABLES statement from obtaining the flush lock. Since the FLUSH TABLES statement requires an exclusive lock, it in turn stops subsequent queries from acquiring the shared lock they need.

This issue is often seen in connection with backups where the backup process needs to flush all tables and get a read lock in order to create a consistent backup.

A special case can occur when the FLUSH TABLES statement has timed out or has been killed, but the subsequent queries are not proceeding. When that happens, it is because the low-level table definition cache (TDC) version lock is not released. This is a case that can cause confusion as it is not obvious why the subsequent queries are still waiting for the table flush.

The Setup

The lock situation that will be investigated involves three connections (not including the connection used for the investigation). The first connection executes a slow query, the second flushes all tables with a read lock, and the last connection executes a quick query. The statements are
Connection 1> SELECT city.*, SLEEP(180) FROM world.city WHERE ID = 130;
Connection 2> FLUSH TABLES WITH READ LOCK;
Connection 3> SELECT * FROM world.city WHERE ID = 3805;

The use of SLEEP(180) in the first query means you have three minutes (180 seconds) to execute the two other queries and perform the investigation. If you want longer time, you can increase the duration of the sleep. You are now ready to start the investigation.

The Investigation

The investigation of flush locks requires you to look at the list of queries running on the instance. Unlike other lock contentions, there are no Performance Schema tables or InnoDB monitor report that can be used to query for the blocking query directly.

Listing 22-1 shows an example of the output using the sys.session view. Similar results will be produced using the alternative ways to get a list of queries. The thread and connection ids as well as the statement latencies will vary.
mysql> SELECT thd_id, conn_id, state,
              current_statement,
              statement_latency
         FROM sys.session
        WHERE command = 'Query'\G
*************************** 1. row ***************************
           thd_id: 30
          conn_id: 9
            state: User sleep
current_statement: SELECT city.*, SLEEP(180) FROM city WHERE ID = 130
statement_latency: 49.97 s
*************************** 2. row ***************************
           thd_id: 53
          conn_id: 14
            state: Waiting for table flush
current_statement: FLUSH TABLES WITH READ LOCK
statement_latency: 44.48 s
*************************** 3. row ***************************
           thd_id: 51
          conn_id: 13
            state: Waiting for table flush
current_statement: SELECT * FROM world.city WHERE ID = 3805
statement_latency: 41.93 s
*************************** 4. row ***************************
           thd_id: 29
          conn_id: 8
            state: NULL
current_statement: SELECT thd_id, conn_id, state, ... ession WHERE command = 'Query'
statement_latency: 56.13 ms
4 rows in set (0.0644 sec)
Listing 22-1

Investigating flush lock contention using sys.session

There are four queries in the output. The sys.session and sys.processlist views by default sort the queries according to the execution time in descending order. This makes it easy to investigate issues like contention around the flush lock where the query time is the primary thing to consider when looking for the cause.

You start out looking for the FLUSH TABLES statement (the case where there is no FLUSH TABLES statement will be discussed shortly). In this case, that is thd_id = 53 (the second row). Notice that the state of the FLUSH statement is “Waiting for table flush.” You then look for queries that have been running for a longer time. In this case, there is only one query: the one with thd_id = 30. This is the query that blocks the FLUSH TABLES WITH READ LOCK from completing. In general, there may be more than one query.

The two remaining queries are a query being blocked by the FLUSH TABLES WITH READ LOCK and the query to obtain the output. Together, the three first queries form a typical example of a long-running query blocking a FLUSH TABLES statement which in turn blocks other queries.

You can also get the process list from MySQL Workbench and in some cases also from your monitoring solution. Figure 22-1 shows how to get the process list from MySQL Workbench.
Figure 22-1

Showing the client connections in MySQL Workbench

To get the process list report in MySQL Workbench, choose the Client Connections item under Management in the navigator pane to the left of the screen. You cannot choose which columns to include, and to make the text readable, only part of the report is included in the screenshot. The Id column corresponds to conn_id in the sys.session output, and Thread (the rightmost column) corresponds to thd_id. The full screenshot is included in this book’s GitHub repository as figure_22_1_workbench_flush_lock.png.

Figure 22-2 shows an example of the Processes report from MySQL Enterprise Monitor (MEM) for the same lock situation.
Figure 22-2

The Processes report in MEM for a flush lock investigation

The Processes report is found under the Metrics menu item for individual instances. You can choose which column you want to include in the output. An example of the report with more details can be found in the book’s GitHub repository as figure_22_2_mem_flush_lock.png.

An advantage of reports like the ones in MySQL Workbench and MySQL Enterprise Monitor is that they use existing connections to create the report. In cases where the lock issue causes all connections to be used, then it can be invaluable to be able to get the list of queries using a monitoring solution.

As mentioned, the FLUSH TABLES statement may not always be present in the list of queries. The reason there still are queries waiting for flush tables is the low-level TDC version lock. The principles of the investigation remain the same, but it can seem confusing. Listing 22-2 shows such an example using the same setup but killing the connection executing the flush statement before the investigation (Ctrl+C can be used in MySQL Shell in the connection executing FLUSH TABLES WITH READ LOCK).
mysql> SELECT thd_id, conn_id, state,
              current_statement,
              statement_latency
         FROM sys.session
        WHERE command = 'Query'\G
*************************** 1. row ***************************
           thd_id: 30
          conn_id: 9
            state: User sleep
current_statement: SELECT *, SLEEP(180) FROM city WHERE ID = 130
statement_latency: 24.16 s
*************************** 2. row ***************************
           thd_id: 51
          conn_id: 13
            state: Waiting for table flush
current_statement: SELECT * FROM world.city WHERE ID = 3805
statement_latency: 20.20 s
*************************** 3. row ***************************
           thd_id: 29
          conn_id: 8
            state: NULL
current_statement: SELECT thd_id, conn_id, state, ... ession WHERE command = 'Query'
statement_latency: 47.02 ms
3 rows in set (0.0548 sec)
Listing 22-2

Flush lock contention without a FLUSH TABLES statement

This situation is identical to the previous except the FLUSH TABLES statement is gone. In this case, find the query that has been waiting the longest with the state “Waiting for table flush.” Queries that have been running longer than this query has been waiting are the ones preventing the TDC version lock being released. In this case, that means thd_id = 30 is the blocking query.

Once you have identified the issue and the principal queries involved, you need to decide what to do about the issue.

The Solution

There are two levels of solving the issue. First of all, you need to resolve the immediate problem of queries not executing. Second, you need to work at avoiding the issue in the future. This subsection will discuss the immediate solution, and the next will consider how to reduce the chance of the issue occurring.

To resolve the immediate issue, you have the option of waiting for the queries to complete or starting to kill queries. If you can redirect the application to use another instance while the flush lock contention is ongoing, you may be able to let the situation resolve itself by letting the long-running queries complete. If there are data changing queries among those running or waiting, you do in that case need to consider whether it will leave the system in a consistent state after all queries have completed. One option may be to continue in read-only mode with the read queries executed on a different instance.

If you decide to kill queries, you can try to kill the FLUSH TABLES statement . If that works, it is the simplest solution. However, as discussed that will not always help, and in that case the only solution is to kill the queries that were preventing the FLUSH TABLES statement from completing. If the long-running queries look like runaway queries and the application/client that executed them anyway is not waiting for them any longer, you may want to kill them without trying to kill the FLUSH TABLES statement first.

One important consideration when looking to kill a query is how much data has been changed. For a pure SELECT query (not involving stored routines), that is always nothing, and from the perspective of work done, it is safe to kill it. For INSERT, UPDATE, DELETE, and similar queries, however, the changed data must be rolled back if the query is killed. It will usually take longer to roll back changes than making them in the first place, so be prepared to wait a long time for the rollback if there are many changes. You can use the information_schema.INNODB_TRX table to estimate the amount of work done by looking at the trx_rows_modified column. If there is a lot of work to roll back, it is usually better to let the query complete.

Caution

When a DML statement is killed, the work it has done must be rolled back. The rollback will usually take longer than creating the change, sometimes much longer. You need to factor that in, if you consider killing a long-running DML statement.

Of course, optimally you prevent the issue from happening at all.

The Prevention

The flush lock contention happens because of the combination of a long-running query and a FLUSH TABLES statement. So, to prevent the issue, you need to look at what you can do to avoid these two conditions to be present at the same time.

Finding, analyzing, and handling long-running queries are discussed in other chapters throughout the book. One option of particular interest is to set a timeout for the query. This is supported for SELECT statements using the max_execution_time system variable and the MAX_EXECUTION_TIME(N) optimizer hint and is a great way to protect against runaway queries. Some connectors also have support for timing out queries.

Tip

To avoid long-running SELECT queries, you can configure the max_execution_time option or set the MAX_EXECUTION_TIME(N) optimizer hint. This will make the SELECT statement time out after the specified period and help prevent issues like flush lock waits.

Some long-running queries cannot be prevented. It may be a reporting job, building a cache table, or another task that must access a lot of data. In that case, the best you can do is to try to avoid them running while it is also necessary to flush the tables. One option is to schedule the long-running queries to run at different times than when it is necessary to flush tables. Another option is to have the long-running queries run on a different instance than the jobs that require flushing tables.

A common task that requires flushing the tables is taking a backup. In MySQL 8, you can avoid that issue by using the backup and log locks. For example, MySQL Enterprise Backup (MEB) does this in version 8.0.16 and later, so InnoDB tables are never flushed. Alternatively, you can perform the backup at a period with low usage, so the potential for conflicts is lower, or you can even do the backup while the system is in read-only mode and avoid the FLUSH TABLES WITH READ LOCK altogether.

Another lock type that often causes confusion is the metadata lock.

Metadata and Schema Locks

In MySQL 5.7 and earlier, metadata locks were often a source of confusion. The problem is that it is not obvious who holds the metadata lock. In MySQL 5.7, instrumentation of the metadata locks was added to the Performance Schema, and in MySQL 8.0 it is enabled by default. With the instrumentation enabled, it becomes easy to determine who is blocking the connection trying to obtain the lock.

The Symptoms

The symptoms of metadata lock contention are similar to those of flush lock contention. In a typical situation, there will be a long-running query or transaction, a DDL statement waiting for the metadata lock, and possible queries pilling up. The symptoms to look out for are as follows:
  • A DDL statement and possible other queries are stuck in the state “Waiting for table metadata lock.”

  • Queries may be pilling up. The queries that are waiting all use the same table. (There may potentially be more than one group of queries waiting if there are DDL statements for multiple tables waiting for the metadata lock.)

  • When the DDL statement has waited for lock_wait_timeout, an ER_LOCK_WAIT_TIMEOUT error occurs: ERROR: 1205: Lock wait timeout exceeded; try restarting transaction. Since the default value for lock_wait_timeout is 365 days, this is only likely to occur if the timeout has been reduced.

  • There is a long-running query or a long-running transaction. In the latter case, the transaction may be idle or executing a query that does not use the table that the DDL statement acts on.

What makes the situation potentially confusing is the last point: there may not be any long-running queries that are the clear candidates for causing the lock issue. So what is the cause of the metadata lock contention?

The Cause

Remember that the metadata locks exist to protect the schema definition (as well as being used with explicit locks). The schema protection will exist for as long as a transaction is active, so when a transaction queries a table, the metadata lock will last until the end of the transaction. Therefore, you may not see any long-running queries. In fact, the transaction holding the metadata lock may not be doing anything at all.

In short, the metadata lock exists as one or more connections may rely on the schema for a given table not changing, or they have explicitly locked the table either using the LOCK TABLES or FLUSH TABLES WITH READ LOCK statement .

The Setup

The example investigation of metadata locks uses three connections like in the previous example. The first connection is in the middle of a transaction, the second connection tries to add an index to the table used by the transaction, and the third connection attempts to execute a query against the same table. The queries are
Connection 1> START TRANSACTION;
Query OK, 0 rows affected (0.0003 sec)
Connection 1> SELECT * FROM world.city WHERE ID = 3805\G *************************** 1. row ***************************
         ID: 3805
       Name: San Francisco
CountryCode: USA
   District: California
 Population: 776733
1 row in set (0.0006 sec)
Connection 1> SELECT Code, Name FROM world.country WHERE Code = 'USA'\G
*************************** 1. row ***************************
Code: USA
Name: United States
1 row in set (0.0020 sec)
Connection 2> ALTER TABLE world.city ADD INDEX (Name);
Connection 3> SELECT * FROM world.city WHERE ID = 130;

At this point, you can start the investigation. The situation will not resolve itself (unless you have a low value for lock_wait_timeout or you are prepared to wait a year), so you have all the time you want. When you want to resolve the block, you can start terminating the ALTER TABLE statement in Connection 2 to avoid modifying the world.city table. Then commit or roll back the transaction in Connection 1.

The Investigation

If you have the wait/lock/metadata/sql/mdl Performance Schema instrument enabled (the default in MySQL 8), it is straightforward to investigate metadata lock issues. You can use the metadata_locks table in the Performance Schema to list the granted and pending locks. However, a simpler way to get a summary of the lock situation is to use the schema_table_lock_waits view in the sys schema.

As an example, consider the metadata lock wait issue that can be seen in Listing 22-3 where three connections are involved. The WHERE clause has been chosen to just include the rows of interest for this investigation.
mysql> SELECT thd_id, conn_id, state,
              current_statement,
              statement_latency
         FROM sys.session
        WHERE command = 'Query' OR trx_state = 'ACTIVE'\G
*************************** 1. row ***************************
           thd_id: 30
          conn_id: 9
            state: NULL
current_statement: SELECT Code, Name FROM world.country WHERE Code = 'USA'
statement_latency: NULL
*************************** 2. row ***************************
           thd_id: 7130
          conn_id: 7090
            state: Waiting for table metadata lock
current_statement: ALTER TABLE world.city ADD INDEX (Name)
statement_latency: 19.92 m
*************************** 3. row ***************************
           thd_id: 51
          conn_id: 13
            state: Waiting for table metadata lock
current_statement: SELECT * FROM world.city WHERE ID = 130
statement_latency: 19.78 m
*************************** 4. row ***************************
           thd_id: 107
          conn_id: 46
            state: NULL
current_statement: SELECT thd_id, conn_id, state, ... Query' OR trx_state = 'ACTIVE'
statement_latency: 56.77 ms
3 rows in set (0.0629 sec)
Listing 22-3

A metadata lock wait issue

Two connections are waiting for a metadata lock (on the world.city table). There is a third connection included (conn_id = 9) which is idle which can be seen from the NULL for the statement latency (in some versions earlier than 8.0.18, you may also see that the current statement is NULL). In this case, the list of queries is limited to those with an active query or an active transaction, but usually you will start out with a full process list. However, to make it easy to focus on the important parts, the output is filtered.

Once you know there is a metadata lock issue, you can use the sys.schema_table_lock_waits view to get information about the lock contention. Listing 22-4 shows an example of the output corresponding to the just discussed process list.
mysql> SELECT *
         FROM sys.schema_table_lock_waits\G
*************************** 1. row ***************************
               object_schema: world
                 object_name: city
           waiting_thread_id: 7130
                 waiting_pid: 7090
             waiting_account: root@localhost
           waiting_lock_type: EXCLUSIVE
       waiting_lock_duration: TRANSACTION
               waiting_query: ALTER TABLE world.city ADD INDEX (Name)
          waiting_query_secs: 1219
 waiting_query_rows_affected: 0
 waiting_query_rows_examined: 0
          blocking_thread_id: 7130
                blocking_pid: 7090
            blocking_account: root@localhost
          blocking_lock_type: SHARED_UPGRADABLE
      blocking_lock_duration: TRANSACTION
     sql_kill_blocking_query: KILL QUERY 7090
sql_kill_blocking_connection: KILL 7090
*************************** 2. row ***************************
               object_schema: world
                 object_name: city
           waiting_thread_id: 51
                 waiting_pid: 13
             waiting_account: root@localhost
           waiting_lock_type: SHARED_READ
       waiting_lock_duration: TRANSACTION
               waiting_query: SELECT * FROM world.city WHERE ID = 130
          waiting_query_secs: 1210
 waiting_query_rows_affected: 0
 waiting_query_rows_examined: 0
          blocking_thread_id: 7130
                blocking_pid: 7090
            blocking_account: root@localhost
          blocking_lock_type: SHARED_UPGRADABLE
      blocking_lock_duration: TRANSACTION
     sql_kill_blocking_query: KILL QUERY 7090
sql_kill_blocking_connection: KILL 7090
*************************** 3. row ***************************
               object_schema: world
                 object_name: city
           waiting_thread_id: 7130
                 waiting_pid: 7090
             waiting_account: root@localhost
           waiting_lock_type: EXCLUSIVE
       waiting_lock_duration: TRANSACTION
               waiting_query: ALTER TABLE world.city ADD INDEX (Name)
          waiting_query_secs: 1219
 waiting_query_rows_affected: 0
 waiting_query_rows_examined: 0
          blocking_thread_id: 30
                blocking_pid: 9
            blocking_account: root@localhost
          blocking_lock_type: SHARED_READ
      blocking_lock_duration: TRANSACTION
     sql_kill_blocking_query: KILL QUERY 9
sql_kill_blocking_connection: KILL 9
*************************** 4. row ***************************
               object_schema: world
                 object_name: city
           waiting_thread_id: 51
                 waiting_pid: 13
             waiting_account: root@localhost
           waiting_lock_type: SHARED_READ
       waiting_lock_duration: TRANSACTION
               waiting_query: SELECT * FROM world.city WHERE ID = 130
          waiting_query_secs: 1210
 waiting_query_rows_affected: 0
 waiting_query_rows_examined: 0
          blocking_thread_id: 30
                blocking_pid: 9
            blocking_account: root@localhost
          blocking_lock_type: SHARED_READ
      blocking_lock_duration: TRANSACTION
     sql_kill_blocking_query: KILL QUERY 9
sql_kill_blocking_connection: KILL 9
4 rows in set (0.0024 sec)
Listing 22-4

Finding metadata lock contention

The output shows that there are four cases of queries waiting and blocking. This may be surprising, but it happens because there are several locks involved and there is a chain of waits. Each row is a pair of a waiting and blocking connection. The output uses “pid” for the process list id which is the same as the connection id used in earlier outputs. The information includes what the lock is on, details about the waiting connection, details about the blocking connection, and two queries that can be used to kill the blocking query or connection.

The first row shows process list id 7090 waiting on itself. That sounds like a deadlock, but it is not. The reason is that the ALTER TABLE first took a shared lock that can be upgraded and then tried to get the exclusive lock which is waiting. Because there is no explicit information on which existing lock is actually blocking the new lock, this information ends up being included.

The second row shows that the SELECT statement is waiting for process list id 7090 which is the ALTER TABLE. This is the reason that connections can start to pile up as the DDL statement requires an exclusive lock, so it will block requests for shared locks.

The third and fourth rows are where the underlying issue for the lock contention is revealed. Process list id 9 is blocking for both of the other connections which shows that this is the main culprit that is blocking the DDL statement. So, when you are investigating an issue like this, look for a connection waiting for an exclusive metadata lock that is blocked by another connection. If there is a large number of rows in the output, you can also look for the connection causing the most blocks and use that as a starting point. Listing 22-5 shows an example of how you can do this.
mysql> SELECT *
         FROM sys.schema_table_lock_waits
        WHERE waiting_lock_type = 'EXCLUSIVE'
              AND waiting_pid <> blocking_pid\G
*************************** 1. row ***************************
               object_schema: world
                 object_name: city
           waiting_thread_id: 7130
                 waiting_pid: 7090
             waiting_account: root@localhost
           waiting_lock_type: EXCLUSIVE
       waiting_lock_duration: TRANSACTION
               waiting_query: ALTER TABLE world.city ADD INDEX (Name)
          waiting_query_secs: 4906
 waiting_query_rows_affected: 0
 waiting_query_rows_examined: 0
          blocking_thread_id: 30
                blocking_pid: 9
            blocking_account: root@localhost
          blocking_lock_type: SHARED_READ
      blocking_lock_duration: TRANSACTION
     sql_kill_blocking_query: KILL QUERY 9
sql_kill_blocking_connection: KILL 9
1 row in set (0.0056 sec)
mysql> SELECT blocking_pid, COUNT(*)
         FROM sys.schema_table_lock_waits
        WHERE waiting_pid <> blocking_pid
        GROUP BY blocking_pid
        ORDER BY COUNT(*) DESC;
+--------------+----------+
| blocking_pid | COUNT(*) |
+--------------+----------+
|            9 |        2 |
|         7090 |        1 |
+--------------+----------+
2 rows in set (0.0028 sec)
Listing 22-5

Looking for the connection causing the metadata lock block

The first query looks for a wait for an exclusive metadata lock where the blocking process list id is not itself. In this case, that immediately gives the main block contention. The second query determines the number of blocking queries triggered by each process list id. It may not be as simple as shown in this example, but using queries as shown here will help narrow down where the lock contention is.

Once you have determined where the lock contention originates, you need to determine what the transaction is doing. In this case, the root of the lock contention is Connection 9. Going back to the process list output, you can see that it is not doing anything in this case:
*************************** 1. row ***************************
           thd_id: 30
          conn_id: 9
            state: NULL
current_statement: SELECT Code, Name FROM world.country WHERE Code = 'USA'
statement_latency: NULL
What did this connection do to take the metadata lock? The fact that there is no current statement that involves the world.city table suggests the connection has an active transaction open. In this case, the transaction is idle (as seen by statement_latency = NULL), but it could also be that there was a query executing that is unrelated to the metadata lock on the world.city table. In either case, you need to determine what the transaction was doing prior to the current state. You can use the Performance Schema and Information Schema for this. Listing 22-6 shows an example of investigating the status and recent history of a transaction.
mysql> SELECT *
         FROM information_schema.INNODB_TRX
        WHERE trx_mysql_thread_id = 9\G
*************************** 1. row ***************************
                    trx_id: 283529000061592
                 trx_state: RUNNING
               trx_started: 2019-06-15 13:22:29
     trx_requested_lock_id: NULL
          trx_wait_started: NULL
                trx_weight: 0
       trx_mysql_thread_id: 9
                 trx_query: NULL
       trx_operation_state: NULL
         trx_tables_in_use: 0
         trx_tables_locked: 0
          trx_lock_structs: 0
     trx_lock_memory_bytes: 1136
           trx_rows_locked: 0
         trx_rows_modified: 0
   trx_concurrency_tickets: 0
       trx_isolation_level: REPEATABLE READ
         trx_unique_checks: 1
    trx_foreign_key_checks: 1
trx_last_foreign_key_error: NULL
 trx_adaptive_hash_latched: 0
 trx_adaptive_hash_timeout: 0
          trx_is_read_only: 0
trx_autocommit_non_locking: 0
1 row in set (0.0006 sec)
mysql> SELECT *
         FROM performance_schema.events_transactions_current
        WHERE THREAD_ID = 30\G
*************************** 1. row ***************************
                      THREAD_ID: 30
                       EVENT_ID: 113
                   END_EVENT_ID: NULL
                     EVENT_NAME: transaction
                          STATE: ACTIVE
                         TRX_ID: NULL
                           GTID: AUTOMATIC
                  XID_FORMAT_ID: NULL
                      XID_GTRID: NULL
                      XID_BQUAL: NULL
                       XA_STATE: NULL
                         SOURCE: transaction.cc:219
                    TIMER_START: 12849615560172160
                      TIMER_END: 18599491723543808
                     TIMER_WAIT: 5749876163371648
                    ACCESS_MODE: READ WRITE
                ISOLATION_LEVEL: REPEATABLE READ
                     AUTOCOMMIT: NO
           NUMBER_OF_SAVEPOINTS: 0
NUMBER_OF_ROLLBACK_TO_SAVEPOINT: 0
    NUMBER_OF_RELEASE_SAVEPOINT: 0
          OBJECT_INSTANCE_BEGIN: NULL
               NESTING_EVENT_ID: 112
             NESTING_EVENT_TYPE: STATEMENT
1 row in set (0.0008 sec)
mysql> SELECT EVENT_ID, CURRENT_SCHEMA,
              SQL_TEXT
         FROM performance_schema.events_statements_history
        WHERE THREAD_ID = 30
              AND NESTING_EVENT_ID = 113
              AND NESTING_EVENT_TYPE = 'TRANSACTION'\G
*************************** 1. row ***************************
      EVENT_ID: 114
CURRENT_SCHEMA: world
      SQL_TEXT: SELECT * FROM world.city WHERE ID = 3805
*************************** 2. row ***************************
      EVENT_ID: 115
CURRENT_SCHEMA: world
      SQL_TEXT: SELECT * FROM world.country WHERE Code = 'USA'
2 rows in set (0.0036 sec)
mysql> SELECT ATTR_NAME, ATTR_VALUE
         FROM performance_schema.session_connect_attrs
        WHERE PROCESSLIST_ID = 9;
+-----------------+------------+
| ATTR_NAME       | ATTR_VALUE |
+-----------------+------------+
| _pid            | 23256      |
| program_name    | mysqlsh    |
| _client_name    | libmysql   |
| _thread         | 20164      |
| _client_version | 8.0.18     |
| _os             | Win64      |
| _platform       | x86_64     |
+-----------------+------------+
7 rows in set (0.0006 sec)
Listing 22-6

Investigating a transaction

The first query uses the INNODB_TRX table in the Information Schema. It, for example, shows when the transaction was started, so you can determine how long it has been active. The trx_rows_modified column is also useful to know how much data has been changed by the transaction in case it is decided to roll back the transaction. Note that what InnoDB calls the MySQL thread id (the trx_mysql_thread_id column) is actually the connection id.

The second query uses the events_transactions_current table from the Performance Schema to get more transaction information. You can use the TIMER_WAIT column to determine the age of the transaction. The value is in picoseconds, so it can be easier to understand what the value is by using the FORMAT_PICO_TIME() function:
mysql> SELECT FORMAT_PICO_TIME(5749876163371648) AS Age;
+--------+
| Age    |
+--------+
| 1.60 h |
+--------+
1 row in set (0.0003 sec)

If you are using MySQL 8.0.15 or earlier, use the sys.format_time() function instead.

The third query uses the events_statements_history table to find the previous queries executed in the transaction. The NESTING_EVENT_ID column is set to the value of the EVENT_ID from the output of the events_transactions_current table, and the NESTING_EVENT_TYPE column is set to match a transaction. This ensures that only events that are children of the ongoing transaction are returned. The result is ordered by the EVENT_ID (of the statement) to get the statements in the order they were executed. By default, the events_statements_history table will include at most the ten latest queries for the connection.

In this example, the investigation shows that the transaction has executed two queries: one selecting from the world.city table and one selecting from the world.country table. It is the first of these queries causing the metadata lock contention.

The fourth query uses the session_connect_attrs table to find the attributes submitted by the connection. Not all clients and connectors submit attributes, or they may be disabled, so this information is not always available. When the attributes are available, they can be useful to find out where the offending transaction is executed from. In this example, you can see the connection is from MySQL Shell (mysqlsh). This can be useful if you want to commit an idle transaction.

The Solution

For a metadata lock contention, you essentially have two options to resolve the issue: make the blocking transaction complete or kill the DDL statement. To complete the blocking transaction, you will need to either commit it or roll it back. If you kill the connection, it triggers a rollback of the transaction, so you need to take into consideration how much work will need to be rolled back. In order to commit the transaction, you must find where the connection is executed and commit it that way. You cannot commit a transaction owned by a different connection.

Killing the DDL statement will allow the other queries to proceed, but it does not solve the issue in the long term if the lock is held by an abandoned but still active transaction. For cases where there is an abandoned transaction holding the metadata lock, it can however be an option to kill both the DDL statement and the connection with the abandoned transaction. That way you avoid the DDL statement to continue blocking subsequent queries while the transaction rolls back. Then when the rollback has completed, you can retry the DDL statement.

The Prevention

The key to avoiding metadata lock contention is to avoid a long-running transaction at the same time as you need to execute DDL statements for the tables used by the transaction. You can, for example, execute DDL statements at times when you know there are no long-running transactions. You can also set the lock_wait_timeout option to a low value which makes the DDL statement abandon after lock_wait_timeout seconds. While that does not avoid the lock problem, it mitigates the issue by avoiding the DDL statement stopping other queries from executing. You can then find the root cause without the stress of having a large part of the application not working.

You can also aim at reducing how long transactions are active. One option is to split a large transaction into several smaller transactions, if it is not required that all operations are performed as an atomic unit. You should also make sure that the transaction is not kept open for unnecessarily long time by making sure you are not doing interactive work, file I/O, transferring data to the end user, and so on while the transaction is active.

One common cause of a long-running transaction is that the application or client does not commit or roll back the transaction at all. This is particularly likely to happen with the autocommit option disabled. When autocommit is disabled, any query – even a plain read-only SELECT statement – will start a new transaction when there is not already an active transaction. This means that an innocent-looking query may start a transaction, and if the developer is not aware that autocommit is disabled, then the developer may not think about explicitly ending the transaction. The autocommit setting is enabled by default in MySQL Server, but some connectors disable it by default.

That concludes the discussion about investigating metadata locks. The next level of locks to look at are the record locks.

Record-Level Locks

Record lock contention is the most frequently encountered, but usually also the least intrusive as the default lock wait timeout is just 50 seconds, so there is not the same potential for queries pilling up. That said, there are cases – as will be shown – where record locks can cause MySQL to come to a grinding halt. This section will look into investigating InnoDB record lock issues in general and in more detail lock wait timeout issues. Investigating the specifics of deadlocks is deferred until the next section.

The Symptoms

The symptoms of InnoDB record lock contention are often very subtle and not easily recognizable. In severe cases, you will get a lock wait timeout or a deadlock error, but in many cases, there may be no direct symptoms. Rather the symptom is that queries are slower than normal. This may range from being a fraction of a second slower to being many seconds slower.

For cases where there is a lock wait timeout, you will see an ER_LOCK_WAIT_TIMEOUT error like the one in the following example:
ERROR: 1205: Lock wait timeout exceeded; try restarting transaction
When the queries are slower than they would be without lock contention, the most likely way to detect the issue is through monitoring, either using something similar to the Query Analyzer in MySQL Enterprise Monitor or detecting lock contention using the sys.innodb_lock_waits view. Figure 22-3 shows an example of a query in the Query Analyzer. The sys schema view will be used when discussing the investigation of record lock contention. The figure is also available in full size in this book’s GitHub repository as figure_22_3_quan.png.
Figure 22-3

Example of a lock contention detected in the Query Analyzer

In the figure, notice how the latency graph for the query increases toward the end of the period and then suddenly drops again. There is also a red icon to the right of the normalized query – that icon means the query has returned errors. In this case the error is a lock wait timeout, but that cannot be seen from the figure. The donut-shaped chart to the left of the normalized query also shows a red area indicating the query response time index for the query at times is considered poor. The large graph at the top shows a small dip showing there were enough issues in the instance to cause a general degradation of the performance of the instance.

There are also several instance-level metrics that show how much locking is occurring for the instance. These can be very useful to monitor the general lock contention over time. Listing 22-7 shows the available metrics using the sys.metrics view.
mysql> SELECT Variable_name,
              Variable_value AS Value,
              Enabled
         FROM sys.metrics
        WHERE Variable_name LIKE 'innodb_row_lock%'
              OR Type = 'InnoDB Metrics - lock';
+-------------------------------+--------+---------+
| Variable_name                 | Value  | Enabled |
+-------------------------------+--------+---------+
| innodb_row_lock_current_waits | 0      | YES     |
| innodb_row_lock_time          | 595876 | YES     |
| innodb_row_lock_time_avg      | 1683   | YES     |
| innodb_row_lock_time_max      | 51531  | YES     |
| innodb_row_lock_waits         | 354    | YES     |
| lock_deadlocks                | 0      | YES     |
| lock_rec_lock_created         | 0      | NO      |
| lock_rec_lock_removed         | 0      | NO      |
| lock_rec_lock_requests        | 0      | NO      |
| lock_rec_lock_waits           | 0      | NO      |
| lock_rec_locks                | 0      | NO      |
| lock_row_lock_current_waits   | 0      | YES     |
| lock_table_lock_created       | 0      | NO      |
| lock_table_lock_removed       | 0      | NO      |
| lock_table_lock_waits         | 0      | NO      |
| lock_table_locks              | 0      | NO      |
| lock_timeouts                 | 1      | YES     |
+-------------------------------+--------+---------+
17 rows in set (0.0203 sec)
Listing 22-7

InnoDB lock metrics

For this discussion, the innodb_row_lock_% and lock_timeouts metrics are the most interesting. The three time variables are in milliseconds. It can be seen there has been a single lock wait timeout which on its own is not necessarily a cause for concern. You can also see there have been 354 cases when a lock could not be granted immediately (innodb_row_lock_waits) and there have been waits up to more than 51 seconds (innodb_row_lock_time_max). When the general level of lock contention increases, you will see these metrics increase as well.

Even better than monitoring the metrics manually, ensure your monitoring solution record the metrics and can plot them over time in timeseries graphs. Figure 22-4 shows an example of the metrics plotted for the same incident that was found in Figure 22-3.
Figure 22-4

Timeseries graphs for InnoDB row lock metrics

The graphs show a general increase in locking. The number of lock waits has two periods with increased lock waits and then drops off again. The row lock time graph shows a similar pattern. This is a typical sign of intermittent lock issues.

The Cause

InnoDB works with shared and exclusive locks on the row data, index records, gaps, and insert intention locks. When there are two transactions that attempt to access the data in conflicting ways, one query will have to wait until the required lock becomes available. In short, two requests for a shared lock can be granted at the same time, but once there is an exclusive lock, no connections can acquire a lock on the same record.

As it is exclusive locks that are the most likely to cause lock contention, it is usually DML queries that change data that are the cause of InnoDB record lock contention. Another source is SELECT statements doing preemptive locking by adding the FOR SHARE (or LOCK IN SHARE MODE) or FOR UPDATE clause.

The Setup

This example requires just two connections to set up the scenario that is being investigated with the first connection having an ongoing transaction and the second trying to update a row that the first connection holds a lock for. Since the default timeout waiting for InnoDB locks is 50 seconds, you can optionally choose to increase this timeout for the second connection that will block to give you more time to perform the investigation. The setup is
Connection 1> START TRANSACTION;
Query OK, 0 rows affected (0.0002 sec)
Connection 1> UPDATE world.city
                 SET Population = 5000000
               WHERE ID = 130;
Query OK, 1 row affected (0.0005 sec)
Rows matched: 1  Changed: 1  Warnings: 0
Connection 2> SET SESSION innodb_lock_wait_timeout = 300;
Query OK, 0 rows affected (0.0003 sec)
Connection 2> START TRANSACTION;
Query OK, 0 rows affected (0.0003 sec)
Connection 2> UPDATE world.city SET Population = Population * 1.10 WHERE CountryCode = 'AUS';

In this example the lock wait timeout for Connection 2 is set to 300 seconds. The START TRANSACTION for Connection 2 is not required but allows you to roll both transactions back when you are done to avoid making changes to the data.

The Investigation

The investigation of record locks is very similar to investigating metadata locks . You can query the data_locks and data_lock_waits tables in the Performance Schema which will show the raw lock data and pending locks, respectively. There is also the sys.innodb_lock_waits view which queries the two tables to find pairs of locks with one being blocked by the other.

Note

The data_locks and data_lock_waits tables are new in MySQL 8. In MySQL 5.7 and earlier, there were two similar tables in the Information Schema named INNODB_LOCKS and INNODB_LOCK_WAITS. An advantage of using the innodb_lock_waits view is that it works the same (but with some extra information in MySQL 8) across the MySQL versions.

In most cases, it is easiest to start the investigation using the innodb_lock_waits view and only dive into the Performance Schema tables as needed. Listing 22-8 shows an example of the output from innodb_lock_waits for a lock wait situation.
mysql> SELECT * FROM sys.innodb_lock_waits\G
*************************** 1. row ***************************
                wait_started: 2019-06-15 18:37:42
                    wait_age: 00:00:02
               wait_age_secs: 2
                locked_table: `world`.`city`
         locked_table_schema: world
           locked_table_name: city
      locked_table_partition: NULL
   locked_table_subpartition: NULL
                locked_index: PRIMARY
                 locked_type: RECORD
              waiting_trx_id: 3317978
         waiting_trx_started: 2019-06-15 18:37:42
             waiting_trx_age: 00:00:02
     waiting_trx_rows_locked: 2
   waiting_trx_rows_modified: 0
                 waiting_pid: 4172
               waiting_query: UPDATE city SET Population = P ... 1.10 WHERE CountryCode = 'AUS'
             waiting_lock_id: 1999758099664:525:6:131:1999728339632
           waiting_lock_mode: X,REC_NOT_GAP
             blocking_trx_id: 3317977
                blocking_pid: 9
              blocking_query: NULL
            blocking_lock_id: 1999758097920:525:6:131:1999728329336
          blocking_lock_mode: X,REC_NOT_GAP
        blocking_trx_started: 2019-06-15 18:37:40
            blocking_trx_age: 00:00:04
    blocking_trx_rows_locked: 1
  blocking_trx_rows_modified: 1
     sql_kill_blocking_query: KILL QUERY 9
sql_kill_blocking_connection: KILL 9
1 row in set (0.0145 sec)
Listing 22-8

Retrieving lock information from the innodb_lock_waits view

The columns in the output can be divided into five sections based on the prefix of the column name. The groups are
  • wait_: These columns show some general information around the age of the lock wait.

  • locked_: These columns show what is locked ranging from the schema to the index as well as the lock type.

  • waiting_: These columns show details of the transaction that is waiting for the lock to be granted including the query and the lock mode requested.

  • blocking_: These columns show details of the transaction that is blocking the lock request. Note that in the example, the blocking query is NULL. This means the transaction is idle at the time the output was generated. Even when there is a blocking query listed, the query may not have anything to do with the lock that there is contention for – other than the query is executed by the same transaction that holds the lock.

  • sql_kill_: These two columns provide the KILL queries that can be used to kill the blocking query or connection.

Note

The column blocking_query is the query currently executed (if any) for the blocking transaction. It does not mean that the query itself is necessarily causing the lock request to block.

The case where the blocking_query column is NULL is a common situation. It means that the blocking transaction is currently not executing a query. This may be because it is between two queries. If this period is an extended period, it suggests the application is doing work that ideally should be done outside the transaction. More commonly, the transaction is not executing a query because it has been forgotten about, either in an interactive session where the human has forgotten to end the transaction or an application flow that does not ensure transactions are committed or rolled back.

The Solution

The solution depends on the extent of the lock waits. If it is a few queries having short lock waits, it may very well be acceptable to just let the affected queries wait for the lock to become available. Remember locks are there to ensure the integrity of the data, so locks are not inherently a problem. Locks are only a problem when they cause a significant impact on the performance or cause queries to fail to an extent where it is not feasible to retry them.

If the lock situation lasts for an extended period – particularly if the blocking transaction has been abandoned – you can consider killing the blocking transaction. As always you need to consider that the rollback may take a significant amount of time if the blocking transaction has performed a large amount of work.

For queries that fail because of a lock wait timeout error, the application should retry them. Remember that by default a lock wait timeout only rolls back the query that was executing when the timeout occurred. The rest of the transaction is left as it were before the query. A failure to handle the timeout may thus leave an unfinished transaction with its own locks that can cause further lock issues. Whether just the query or the whole transaction will be rolled back is controlled by the innodb_rollback_on_timeout option.

Caution

It is very important that a lock wait timeout is handled as otherwise it may leave the transaction with locks that are not released. If that happens, other transactions may not be able to acquire the locks they require.

The Prevention

Preventing significant record-level lock contention largely follows the guidelines that were discussed in the section “Reduce Locking Issues” in Chapter 18. To recapitulate the discussion, the way to reduce lock wait contention is largely about reducing the size and duration of transactions, using indexes to reduce the number of records accessed, and possibly switching the transaction isolation level to READ COMMITTED to release locks earlier and reduce the number of gap locks.

Deadlocks

One of the lock issues causing the most concerns for database administrators are deadlocks. This is partly because of the name and partly because they unlike the other lock issues discussed always cause an error. However, there is as such nothing specially worrying about deadlocks compared to other locking issues. On the contrary, that they cause an error means that you know about them sooner and the lock issue resolves itself.

The Symptoms

The symptoms are straightforward. The victim of a deadlock receives an error, and the lock_deadlocks InnoDB metric increments. The error that will be returned to the transaction that InnoDB chooses as the victim is ER_LOCK_DEADLOCK:
ERROR: 1213: Deadlock found when trying to get lock; try restarting transaction
The lock_deadlocks metric is very useful to keep an eye on how often deadlocks occur. A convenient way to track the value of lock_deadlocks is to use the sys.metrics view:
mysql> SELECT *
         FROM sys.metrics
        WHERE Variable_name = 'lock_deadlocks'\G
*************************** 1. row ***************************
 Variable_name: lock_deadlocks
Variable_value: 42
          Type: InnoDB Metrics - lock
       Enabled: YES
1 row in set (0.0087 sec)

You can also check the LATEST DETECTED DEADLOCK section in the output of the InnoDB monitor, for example, by executing SHOW ENGINE INNODB STATUS. This will show when the last deadlock last occurred, and thus you can use that to judge how frequently deadlocks occur. If you have the innodb_print_all_deadlocks option enabled, the error lock will have many outputs of deadlock information. The details of the InnoDB monitor output for deadlocks will be covered in “The Investigation” after the cause of deadlocks and the setup have been discussed.

The Cause

Deadlocks are caused by locks being obtained in different orders for two or more transactions. Each transaction ends up holding a lock that the other transaction needs. This lock may be a record lock, gap lock, predicate lock, or insert intention lock. Figure 22-5 shows an example of a circular dependency that triggers a deadlock.
Figure 22-5

A circular lock dependency triggering a deadlock

The deadlock shown in the figure is due to two record locks on the primary keys of a table. That is one of the simplest deadlocks that can occur. As shown when investigating a deadlock, the circle can be more complex than this.

The Setup

This example uses two connections as the previous example, but this time both make changes before Connection 1 ends up blocking until Connection 2 rolls back its changes with an error. Connection 1 updates the population of Australia and its cities with 10%, whereas Connection 2 updates the Australian population with that of the city of Darwin and adds the city. The statements are
Connection 1> START TRANSACTION;
Query OK, 0 rows affected (0.0001 sec)
Connection 1> UPDATE world.city SET Population = Population * 1.10 WHERE CountryCode = 'AUS';
Query OK, 14 rows affected (0.0010 sec)
Rows matched: 14  Changed: 14  Warnings: 0
Connection 2> START TRANSACTION;
Query OK, 0 rows affected (0.0003 sec)
Connection 2> UPDATE world.country SET Population = Population + 146000 WHERE Code = 'AUS';
Query OK, 1 row affected (0.0317 sec)
Rows matched: 1  Changed: 1  Warnings: 0
-- Blocks
Connection 1> UPDATE world.country SET Population = Population * 1.1 WHERE Code = 'AUS';
Connection 2> INSERT INTO world.city VALUES (4080, 'Darwin', 'AUS', 'Northern Territory', 146000);
ERROR: 1213: Deadlock found when trying to get lock; try restarting transaction
Connection 2> ROLLBACK;
Query OK, 0 rows affected (0.0003 sec)
Connection 1> ROLLBACK;
Query OK, 0 rows affected (0.3301 sec)

The key is that the two transactions both update the city and country tables but in opposite order. The setup completes by explicitly rolling back both transactions to ensure the tables are left without changes.

The Investigation

The main tool to analyze deadlocks is the section with information about the latest detected deadlock in the InnoDB monitor output. If you have the innodb_print_all_deadlocks option enabled (OFF by default), you may also have the deadlock information from the error log; however, the information is the same, so it does not change the analysis.

The deadlock information contains four parts describing the deadlock and the result. The parts are
  • When the deadlock occurred.

  • Information for the first of the transactions involved in the deadlock.

  • Information for the second of the transactions involved in the deadlock.

  • Which of the transactions that was rolled back. This information is not included in the error log when innodb_print_all_deadlocks is enabled.

The numbering of the two transactions is arbitrary, and the main purpose is to be able to refer to one transaction or the other. The two parts with transaction information are the most important ones. They include how long the transaction was active, some statistics about the size of the transactions in terms of locks taken and undo log entries and similar, the query that was blocking waiting for a lock, and information about the locks involved in the deadlock.

The lock information is not as easy to interpret as when you use the data_locks and data_lock_waits tables and the sys.innodb_lock_waits view. However, it is not too difficult once you have tried to perform the analysis a few times.

Tip

Create some deadlocks on purpose in a test system and study the resulting deadlock information. Then work your way through the information to determine why the deadlock occurred. Since you know the queries, it is easier to interpret the lock data.

For this deadlock investigation, consider the deadlock section from the InnoDB monitor that is shown in Listing 22-9. The listing is rather long and the lines wide, so the information is also available in this book’s GitHub repository as listing_22_9_deadlock.txt, so you can open the output in a text editor of your choice.
mysql> SHOW ENGINE INNODB STATUS\G
...
------------------------
LATEST DETECTED DEADLOCK
------------------------
2019-11-06 18:29:07 0x4b78
*** (1) TRANSACTION:
TRANSACTION 6260, ACTIVE 62 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 6 lock struct(s), heap size 1136, 30 row lock(s), undo log entries 14
MySQL thread id 61, OS thread handle 22592, query id 39059 localhost ::1 root updating
UPDATE world.country SET Population = Population * 1.1 WHERE Code = 'AUS'
*** (1) HOLDS THE LOCK(S):
RECORD LOCKS space id 160 page no 14 n bits 1368 index CountryCode of table `world`.`city` trx id 6260 lock_mode X locks gap before rec
Record lock, heap no 652 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 3; hex 415554; asc AUT;;
 1: len 4; hex 800005f3; asc     ;;
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 161 page no 5 n bits 128 index PRIMARY of table `world`.`country` trx id 6260 lock_mode X locks rec but not gap waiting
Record lock, heap no 16 PHYSICAL RECORD: n_fields 17; compact format; info bits 0
 0: len 3; hex 415553; asc AUS;;
 1: len 6; hex 000000001875; asc      u;;
 2: len 7; hex 0200000122066e; asc     " n;;
 3: len 30; hex 4175737472616c6961202020202020202020202020202020202020202020; asc Australia                     ; (total 52 bytes);
 4: len 1; hex 05; asc  ;;
 5: len 26; hex 4175737472616c696120616e64204e6577205a65616c616e6420; asc Australia and New Zealand ;;
 6: len 4; hex 483eec4a; asc H> J;;
 7: len 2; hex 876d; asc  m;;
 8: len 4; hex 812267c0; asc  "g ;;
 9: len 4; hex 9a999f42; asc    B;;
 10: len 4; hex c079ab48; asc  y H;;
 11: len 4; hex e0d9bf48; asc    H;;
 12: len 30; hex 4175737472616c6961202020202020202020202020202020202020202020; asc Australia                     ; (total 45 bytes);
 13: len 30; hex 436f6e737469747574696f6e616c204d6f6e61726368792c204665646572;asc Constitutional Monarchy, Feder; (total 45 bytes);
 14: len 30; hex 456c69736162657468204949202020202020202020202020202020202020; asc Elisabeth II                  ; (total 60 bytes);
 15: len 4; hex 80000087; asc     ;;
 16: len 2; hex 4155; asc AU;;
*** (2) TRANSACTION:
TRANSACTION 6261, ACTIVE 37 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 2
MySQL thread id 62, OS thread handle 2044, query id 39060 localhost ::1 root update
INSERT INTO world.city VALUES (4080, 'Darwin', 'AUS', 'Northern Territory', 146000)
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 161 page no 5 n bits 128 index PRIMARY of table `world`.`country` trx id 6261 lock_mode X locks rec but not gap
Record lock, heap no 16 PHYSICAL RECORD: n_fields 17; compact format; info bits 0
 0: len 3; hex 415553; asc AUS;;
 1: len 6; hex 000000001875; asc      u;;
 2: len 7; hex 0200000122066e; asc     " n;;
 3: len 30; hex 4175737472616c6961202020202020202020202020202020202020202020; asc Australia                     ; (total 52 bytes);
 4: len 1; hex 05; asc  ;;
 5: len 26; hex 4175737472616c696120616e64204e6577205a65616c616e6420; asc Australia and New Zealand ;;
 6: len 4; hex 483eec4a; asc H> J;;
 7: len 2; hex 876d; asc  m;;
 8: len 4; hex 812267c0; asc  "g ;;
 9: len 4; hex 9a999f42; asc    B;;
 10: len 4; hex c079ab48; asc  y H;;
 11: len 4; hex e0d9bf48; asc    H;;
 12: len 30; hex 4175737472616c6961202020202020202020202020202020202020202020; asc Australia                     ; (total 45 bytes);
 13: len 30; hex 436f6e737469747574696f6e616c204d6f6e61726368792c204665646572; asc Constitutional Monarchy, Feder; (total 45 bytes);
 14: len 30; hex 456c69736162657468204949202020202020202020202020202020202020; asc Elisabeth II                  ; (total 60 bytes);
 15: len 4; hex 80000087; asc     ;;
 16: len 2; hex 4155; asc AU;;
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 160 page no 14 n bits 1368 index CountryCode of table `world`.`city` trx id 6261 lock_mode X locks gap before rec insert intention waiting
Record lock, heap no 652 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 3; hex 415554; asc AUT;;
 1: len 4; hex 800005f3; asc     ;;
*** WE ROLL BACK TRANSACTION (2)
Listing 22-9

Example of a detected deadlock

The deadlock occurred on November 6, 2019, at 18:29:07 in the server time zone. You can use this information to see if the information is for the same deadlock as the deadlock reported by a user.

The interesting part is the information for the two transactions. You can see that transaction 1 was updating the population of the country with Code = 'AUS':
UPDATE world.country SET Population = Population * 1.1 WHERE Code = 'AUS'
Transaction 2 was attempting to insert a new city:
INSERT INTO world.city VALUES (4080, 'Darwin', 'AUS', 'Northern Territory', 146000)

This is a case where the deadlock involved multiple tables. While the two queries work on different tables, it cannot on its own prove that there are more queries involved as a foreign key can trigger one query to take locks on two tables. In this case though, the Code column is the primary key of the country table, and the only foreign key involved is from the CountryCode column on the city table to the Code column of the country table (showing this is left as an exercise for the reader using the world sample database). So it is not likely that two queries deadlock on their own.

Note

The deadlock output is from MySQL 8.0.18 which added additional information to the output. This discussion only uses the information that is also available in previous releases as well. However, if you are still using an earlier release, upgrading will make it easier to investigate deadlocks.

The next thing to observe is what locks are being waited on. Transaction 1 waits for an exclusive lock on the primary key of the country table:
RECORD LOCKS space id 161 page no 5 n bits 128 index PRIMARY of table `world`.`country` trx id 6260 lock_mode X locks rec but not gap waiting
The value of the primary key can be found in the information that follows this information. It can seem a little overwhelming as InnoDB includes all the information related to the record. Since it is a primary key record, the whole row is included. This is useful to understand what data is in the row, particularly if the primary key does not carry that information on its own, but it can be confusing when you see it the first time. The primary key of the country table is the first column of the table, so it is the first line of the record information that contains the value of the primary key the lock is requesting:
 0: len 3; hex 415553; asc AUS;;

InnoDB includes the value in hexadecimal notation, but also tries to decode it as a string, so here it is clear that the value is “AUS”, which is not surprising since that is also in the WHERE clause of the query. It is not always that obvious, so you should always confirm the value from the lock output. You can also see from the information that the column is sorted in ascending order in the index.

Transaction 2 waits for an insert intention lock on the CountryCode index of the city table:
RECORD LOCKS space id 160 page no 14 n bits 1368 index CountryCode of table `world`.`city` trx id 6261 lock_mode X locks gap before rec insert intention waiting
You can see the lock request involves a gap before record. The lock information is simpler in this case as there are only two columns in the CountryCode index – the CountryCode column and the primary key (ID column) since the CountryCode index is a nonunique secondary index. The index is effectively (CountryCode, ID), and the values for the gap before record are as follows:
 0: len 3; hex 415554; asc AUT;;
 1: len 4; hex 800005f3; asc     ;;
This shows that the value of the CountryCode is “AUT” which is not all that surprising given it is the next value after “AUS” when sorting in alphabetical ascending order. The value for the ID column is the hex value 0x5f3 which in decimal is 1523. If you query for cities with CountryCode = AUT and sort them in order of the CountryCode index, you can see that ID = 1523 is the first city found:
mysql> SELECT *
         FROM world.city
        WHERE CountryCode = 'AUT'
        ORDER BY CountryCode, ID
        LIMIT 1;
+------+------+-------------+----------+------------+
| ID   | Name | CountryCode | District | Population |
+------+------+-------------+----------+------------+
| 1523 | Wien | AUT         | Wien     |    1608144 |
+------+------+-------------+----------+------------+
1 row in set (0.0006 sec)

So far so good. Since the transactions are waiting for these locks, it can of course be inferred that the other transaction holds the lock. In version 8.0.18 and later, InnoDB includes the full list of locks held by both transactions; in earlier versions, InnoDB only includes this explicitly for one of the transactions, so you need to determine what other queries the transactions have executed.

From the information available, you can make some educated guesses. For example, the INSERT statement is blocked by a gap lock on the CountryCode index. An example of a query that would take that gap lock is a query using the condition CountryCode = 'AUS'. The deadlock information also includes information about the two connections owning the transactions which may help you:
MySQL thread id 61, OS thread handle 22592, query id 39059 localhost ::1 root updating
MySQL thread id 62, OS thread handle 2044, query id 39060 localhost ::1 root update

You can see both connections were made using the root@localhost account. If you ensure to have different users for each application and role, the account may help you to narrow down who executed the transactions.

If the connections still exist, you can also use the events_statements_history table in the Performance Schema to find the latest queries executed by the connection. This may not be those involved in the deadlock, depending on whether the connection has been used for more queries, but may nevertheless provide a clue to what the connection is used for. If the connections no longer exist, you may in principle be able to find the queries in the events_statements_history_long table, but you will need to map the “MySQL thread id” (the connection ID) to the Performance Schema thread ID which there is no trivial way to do. Also, the events_statements_history_long consumer is not enabled by default.

In this particular case, both connections are still present, and they have not done anything other than rolling back the transactions. Listing 22-10 shows how you can find the queries involved in the transactions. Be aware that the queries may return more rows than shown here depending on which client you are using and which other queries have been executed in the connection.
mysql> SELECT SQL_TEXT, NESTING_EVENT_ID,
              NESTING_EVENT_TYPE
         FROM performance_schema.events_statements_history
        WHERE THREAD_ID = PS_THREAD_ID(61)
        ORDER BY EVENT_ID\G
*************************** 1. row ***************************
          SQL_TEXT: START TRANSACTION
  NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
*************************** 2. row ***************************
          SQL_TEXT: UPDATE world.city SET Population = Population * 1.10 WHERE CountryCode = 'AUS'
  NESTING_EVENT_ID: 37
NESTING_EVENT_TYPE: TRANSACTION
*************************** 3. row ***************************
          SQL_TEXT: UPDATE world.country SET Population = Population * 1.1 WHERE Code = 'AUS'
  NESTING_EVENT_ID: 37
NESTING_EVENT_TYPE: TRANSACTION
*************************** 4. row ***************************
          SQL_TEXT: ROLLBACK
  NESTING_EVENT_ID: 37
NESTING_EVENT_TYPE: TRANSACTION
4 rows in set (0.0007 sec)
mysql> SELECT SQL_TEXT, MYSQL_ERRNO,
              NESTING_EVENT_ID,
              NESTING_EVENT_TYPE
         FROM performance_schema.events_statements_history
        WHERE THREAD_ID = PS_THREAD_ID(62)
        ORDER BY EVENT_ID\G
*************************** 1. row ***************************
          SQL_TEXT: START TRANSACTION
       MYSQL_ERRNO: 0
  NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
*************************** 2. row ***************************
          SQL_TEXT: UPDATE world.country SET Population = Population + 146000 WHERE Code = 'AUS'
       MYSQL_ERRNO: 0
  NESTING_EVENT_ID: 810
NESTING_EVENT_TYPE: TRANSACTION
*************************** 3. row ***************************
          SQL_TEXT: INSERT INTO world.city VALUES (4080, 'Darwin', 'AUS', 'Northern Territory', 146000)
       MYSQL_ERRNO: 1213
  NESTING_EVENT_ID: 810
NESTING_EVENT_TYPE: TRANSACTION
*************************** 4. row ***************************
          SQL_TEXT: SHOW WARNINGS
       MYSQL_ERRNO: 0
  NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
*************************** 5. row ***************************
          SQL_TEXT: ROLLBACK
       MYSQL_ERRNO: 0
  NESTING_EVENT_ID: NULL
NESTING_EVENT_TYPE: NULL
10 rows in set (0.0009 sec)
Listing 22-10

Finding the queries involved in the deadlock

Notice that for connection id 62 (the second of the transactions), the MySQL error number is included, and the third row has it set to 1213 – a deadlock. MySQL Shell automatically executes a SHOW WARNINGS statement when an error is encountered which is the statement in row 4. Notice also that the nesting event is NULL for the ROLLBACK for transaction 2, but not for the ROLLBACK of transaction 1. That is because the deadlock triggered the whole transaction to be rolled back (so the ROLLBACK for transaction 2 did not do anything).

The deadlock was triggered by transaction 1 first updating the population of the city table and then of the country table. Transaction 2 first updated the population of the country table and then tried to insert a new city into the city table. This is a typical example of two workflows updating records in different orders and thus being prone to deadlocks.

Summarizing the investigation, it consists of two steps:
  1. 1.

    Analyze the deadlock information from InnoDB to determine the locks involved in the deadlock and get as much information as possible about the connections.

     
  2. 2.

    Use other sources such as the Performance Schema to find more information about the queries in the transactions. Often it is necessary to analyze the application to get the list of queries.

     

Now that you know what triggered the deadlock, what is required to solve the issue?

The Solution

Deadlocks are the easiest lock situation to resolve as InnoDB automatically chooses one of the transactions as the victim and rolls it back. In the deadlock examined in the previous discussion, transaction 2 was chosen as the victim which can be seen from the deadlock output:
*** WE ROLL BACK TRANSACTION (2)

This means that for transaction 1, there is nothing to do. After transaction 2 has been rolled back, transaction 1 can continue and complete its work.

For transaction 2, InnoDB has rolled back the whole transaction, so all you need to do is to retry the transaction. Remember to execute all queries again instead of relying on values returned during the first attempt; otherwise, you may be using outdated values.

Tip

Always be prepared to handle deadlocks and lock wait timeouts. For deadlocks or when the transaction has been rolled back after a lock wait timeout, retry the entire transaction. For lock wait timeouts where only the query has been rolled back, retry the query possibly adding a delay.

If deadlocks occur relatively rarely, you do not really need to do anything more. Deadlocks are a fact of life, so do not be alarmed by encountering a few of them. If deadlocks cause a significant impact, you need to look at making changes to prevent some of the deadlocks.

The Prevention

Reducing deadlocks is very similar to reducing record lock contention in general with the addition that acquiring the locks in the same order throughout the application is very important. It is recommended to read the section “Reduce Locking Issues” in Chapter 18 again. The main points to reduce deadlocks are to reduce the number of locks and how long they are held and to take them in the same order:
  • Reduce the work done by each transaction by splitting large transactions into several smaller ones and adding indexes to reduce the number of locks taken.

  • Consider the READ COMMITTED transaction isolation level if it is suitable for your application to reduce the number of locks and how long they are held.

  • Make sure transactions are only held open for as short time as possible.

  • Access records in the same order, if necessary by executing SELECT ... FOR UPDATE or SELECT ... FOR SHARE queries to take the locks preemptively.

That concludes the discussion of how to investigate locks. You may encounter lock cases that do not entirely match the cases discussed in this chapter; however, the techniques to investigate the issues will be similar.

Summary

This chapter has shown you how you can use the resources available in MySQL to investigate lock-related issues. The chapter included examples of investigating four different types of lock issues: flush locks, metadata locks, record locks, and deadlocks. Each issue type used different features of MySQL including the process list, the lock tables in the Performance Schema, and the InnoDB monitor output.

There are many other lock types that can cause lock wait issues. The methods discussed in this chapter go a long way to investigate issues caused by other lock types as well. In the end, the only way to become an expert on investigating locks is experience, but the techniques from this chapter provide a good starting point.

That concludes Part V about query analysis. Part VI is about improving the queries starting out with a discussion of improving performance through the configuration.