4. Test Data
Testing is a very important part of the performance tuning work as it is important that you have verified that your changes work before you apply them to your production system. The best data for verification of your changes closely relates to your production data; however, for exploring how MySQL works, it can be better to use some generic test data. This chapter introduces four standard data sets with installation instructions as well as a few other data sets that are available.
The world, world_x, and sakila databases are used as test data in the remainder of this book.
First, however, you need to know how you can download the databases.
Downloading the Example Databases
The employee data (the employees database) is downloaded from Giuseppe Maxia’s (also known as The Data Charmer) GitHub repository, whereas the other databases are downloaded from Oracle’s MySQL’s site. The download with the employee data also includes a copy of the sakila database. For the employee data, the world database, and the sakila database, there is also documentation available.
If you are not using the latest version of the data, you may see warnings about deprecated features, when you install the test databases. You can ignore those warnings, however, it is recommended to get the latest version of the data.
The menagerie database is a tiny two-table database with a total of fewer than 20 rows that was created for the tutorials section in the MySQL manual. It will not be discussed further.
The world Database
The world sample database is one of the most commonly used databases for simple tests. It consists of three tables with a few hundred to a few thousand rows. This makes it a small data set which means it can easily be used even on small test instances.
The country table includes information about 239 countries and serves as the parent table in foreign keys from the city and countrylanguage tables. There are a total of 4079 cities in the database and 984 combinations of country and language.
The downloaded file consists of a single file named world.sql.gz or world.sql.zip depending on whether you chose the Gzip or Zip link. In either case, the downloaded archive contains a single file world.sql. The installation of the data is straightforward as all that is required is to execute the script.
In either case, add the path to the world.sql file if it is not located in the directory where you started MySQL Shell or mysql.
A related database is world_x which contains the same data as world, but it is organized differently.
The world_x Database
While there are no foreign keys from the city and countryinfo tables, they can be joined to the country table using the CountryCode column and doc->>'$.Code' value, respectively. The _id column of the countryinfo table is an example of a stored generated column where the value is extracted from the JSON document in the doc column.
The installation of the world_x database is very similar to the world database. You download either the world_x-db.tar.gz or world_x-db.zip file and extract it. The extracted files include a file named world_x.sql as well as a README file. The world_x.sql file includes all the statements required to create the schema.
Add the path to the world_x.sql file if it is not located in the current directory.
The world and world_x databases are very simple which makes them easy to use; however, sometimes you will need something a little more complex which the sakila database can deliver.
The sakila Database
The sakila database is a realistic database that contains a schema for a film rental business with information about the films, inventory, stores, staff, and customers. It adds a full text index, a spatial index, views, and stored programs to provide a more complete example of using MySQL features. The database size is still very moderate making it suitable for small instances.
The tables with customer-related data (plus addresses for staff and stores) are in the area in the top-left corner. The area in the lower left includes data related to the business, and the area in the top right contains information about the films and inventory. The lower right is used for the views and stored routines.
You can view the entire diagram (though formatted differently) by opening the sakila.mwb file included with the installation in MySQL Workbench. This is also a good example of how you can use enhanced entity-relationship (EER) diagrams in MySQL Workbench to document your schema.
There are four tables with data related to the customers. The customer table is the main table, and the address information is stored in the address, city, and country tables.
The business tables contain information about the stores, staff, rentals, and payments. The store and staff tables have foreign keys in both directions with staff belonging to a store and a store having a manager that is part of the staff. Rentals and payments are handled by a staff member and thus indirectly linked to a store, and payments are for a rental.
The main table in the inventory group is the film table which contains the metadata about the films the stores offer. Additionally, there is the film_text table with the title and description with a full text index.
There is a many-to-many relationship between the film and the category and actor tables. Finally, there is a foreign key from the inventory table to the store table in the business group.
The views can be used like reports and can be divided into two categories. The film_list, nicer_but_slower_film_list, and actor_info views are related to the films stored in the database. The second category contains information related to the stores in the sales_by_store, sales_by_film_category, staff_list, and customer_list views.
The film_in_stock() and film_not_in_stock() procedures return a result set consisting of the inventory ids for a given film and store based on whether the film is in stock or not. The total number of inventory entries found is returned as an out parameter. The rewards_report() procedure generates a report based on minimum spends for the last month.
The get_customer_balance() function returns the balance for a given customer on a given data. The two remaining functions check the status of an inventory id with inventory_held_by_customer() returning customer id of the customer currently renting that item (and NULL if no customer is renting it), and if you want to check whether a given inventory id is in stock, you can use the inventory_in_stock() function.
The downloaded file expands into a directory with three files, of which two create the schema and data and the last file contains the ETL diagram in the format used by MySQL Workbench.
This section and the examples later in the book use the copy of the sakila database that is downloaded from MySQL’s homepage.
Add the path to the files if they are not located in the current directory.
Common for the three data sets thus far is that they contain little data. While this is in many cases a nice feature as it makes it easier to work with, in some cases you need a bit more data to explore the difference in query plans. The employees database is an option with more data.
The employees Database
The employees database (called employee data on the MySQL documentation download page; the name of the GitHub repository is test_db) was originally created by Fusheng Wang and Carlo Zaniolo and is the largest of the test data sets linked from MySQL’s homepage. The total size of the data files is around 180 MiB for the non-partitioned version and 440 MiB for the partitioned version.
The optional partitioning of the salaries and titles tables
The size of each table in the employees database
By today’s standards, it is still a relatively small amount of data, but it is big enough that you can start to see some performance differences for different query plans.
The built-in help for the stored routines in the employees database
You can download a ZIP file with the files required for the installation, or you can clone the GitHub repository at https://github.com/datacharmer/test_db. At the time of writing, there is only a single branch named master. If you have downloaded the ZIP file, it will unzip into a directory named test_db-master.
There are several files. The two relevant for installing the employees database in MySQL 8 are employees.sql and employees_partitioned.sql. The difference is whether the salaries and titles tables are partitioned. (There is also employees_partitioned_5.1.sql which is meant for MySQL 5.1 where the partitioning scheme used in employees_partitioned.sql is not supported.)
In addition to the data sets discussed here, there are some other choices to obtain example data to work with.
It can happen that you need to perform testing that requires data with some requirements that are not fulfilled by the standard example databases discussed thus far. Fortunately, there are other options available.
Do not discount the possibility to create your own custom example database, for example, by using data masking on your production data.
If you are looking for a very large real-world example, then you can download the Wikipedia database as described at https://en.wikipedia.org/wiki/Wikipedia:Database_download. The English Wikipedia dump from September 20, 2019, is 16.3 GiB in bzip2 compressed XML format.
If you are looking for JSON data, then an option is the earthquake information from the United States Geological Survey (USGS) which is provided in GeoJSON format with options to download information for earthquakes for the last hour, day, week, or month optionally filtered by the strength of the earthquake. The format description and links to the feeds can be found at https://earthquake.usgs.gov/earthquakes/feed/v1.0/geojson.php. Since the data includes geographic information in the GeoJSON format, it can be useful for testing that requires spatial indexes.
The benchmark tools described in the previous chapter also include test data or support creating test data. This data may also be useful for your own testing.
There are other example databases available if you search the Internet. In the end, the important things to consider are whether the data has a good size for your testing and whether it uses the features you require.
This chapter has introduced four standard example databases and some other examples of test data. The four standard databases that were discussed were world, world_x, sakila, and employees. These can all be found through the MySQL manual at https://dev.mysql.com/doc/index-other.html. Except for employees, these databases are used for the examples in this book unless stated otherwise.
The world and world_x databases are the simplest with the difference that world_x uses JSON to store some of the information, whereas the world database is purely relational. These databases do not contain much data, but because of their small size and simplicity, they are useful for simple tests and examples. Particularly the world database is used extensively in this book.
The sakila database has a much more complex schema including different index types, views, and stored routines. This makes it more realistic and allows for more complex tests. Yet, the size of the data is still small enough to use it on even small MySQL instances. It is also used extensively in this book.
The employees database has a schema that is in between the world and sakila databases in complexity, but has significantly more data making it better for testing the difference between various query plans. It is also useful if you need to generate some load on the instance, for example, using table scans. The employees database is not directly used in this book, but if you want to reproduce some of the examples where some load is required, then this is the best of the four standard test databases to use.
You should not limit yourself to consider the standard test databases. You may be able to create your own, create one using a benchmark tool, or find data made available on the Internet. Wikipedia’s database and the earthquake data from the United States Geological Survey (USGS) are examples of data that can be downloaded.
This completes the introduction to MySQL query performance tuning. Part II goes through the common sources of information in connection with diagnosing performance problems starting with the Performance Schema.