2. The SQL Engine – The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform

© Matt How 2020
M. HowThe Modern Data Warehouse in Azurehttps://doi.org/10.1007/978-1-4842-5823-1_2

2. The SQL Engine

Matt How1 
(1)
Alton, UK
 

The focus of this chapter is to break open the mysteries of each SQL storage engine and understand why a particular flavor of Azure SQL technology suits one scenario over another. We will analyze the underlying architecture of each service so that development choices can be well informed and well reasoned. Once we understand how each implementation of the SQL engine in Azure processes and stores data, we can look at the direction Microsoft is taking that technology and forecast whether the same choice would be made in the future. The knowledge gained in this chapter should provide you with the capability to understand your source data and therefore to choose which SQL engine should be used to store and process that data.

Later in this book, we will move out of the structured SQL world and discuss how we can utilize Azure data lake technology to more efficiently work with our data; however, those services are agnostic to the SQL engine that we decide best suits our use case and therefore can be decided upon later. As a primary focus, we must understand our SQL options, and from there, we can tailor our metadata, preparation routines, and development tools to suit that engine.

The Four Vs

The Microsoft Azure platform has a wealth of data storage options at the user’s disposal, each with different features and traits that make them well suited for a given type of data and scenario. Given the flexible and dynamic nature of cloud computing, Microsoft has built a comprehensive platform that ensures all varieties of data can be catered for. The acknowledgment of the need to cater to differing types of data gets neatly distilled into what is known in the data engineering world as “The 3 Vs” – volume, variety, and velocity.

Any combination of volume, variety, and velocity can be solved using a storage solution in the Azure platform. Often people refer to a fourth V being “value” which I think is a worthy addition as the value can often get lost in the volume.

As the volume increases, the curation process to distil value from data becomes more complex, and therefore, specific tools and solutions can be used to help that process, validating the need for a fourth V. When attempting to tackle any one or combination of the four Vs, it is important to understand the full set of options available so that a well-informed decision can be made. Understanding the reasons why a certain technology should be chosen over another is essential to any development process, as this can then inform the code, structure, and integration of that technology.

To use an example, if you needed to store a large amount of enterprise data that was a complete mix of file types and sizes, you would use an Azure Storage account. This would allow you to organize your data into a clear structure and efficiently increase your account size as and when you need. The aspects of that technology help to reduce the complexities of dealing with large-scale data and remove any barriers to entry. Volume, check. Variety, check.

Alternatively, if the requirement was to store JavaScript Object Notation (JSON) documents so that they can be efficiently queried, then the best option would be to utilize Cosmos DB. While there is nothing stopping JSON data being stored in Blob Storage, the ability to index and query JSON data using Cosmos DB make this an obvious choice. The guaranteed latency and throughput options of Cosmos DB mean that high-velocity data is easily ingested. When the volume begins to increase, then Cosmos DB will scale with it. Velocity, check. Volume, check.

Moving to a data warehouse, we know we will have a large amount of well-structured, strongly typed data that needs to rapidly serve up analytical insight. We need a SQL engine. Crucially, this is where the fourth V, “value,” comes into play. Datasets being used to feed a data warehouse may contain many attributes that are not especially valuable, and good practice dictates that these attributes are trimmed off before arriving in the data warehouse. The golden rule is that data stored in a data warehouse should be well curated and of utmost value. A SQL engine makes surfacing that valuable data easy, and further to that, no other storage option can facilitate joining of datasets to produce previously uncovered value as effortlessly as a SQL engine can. Value, check.

However, a wrinkle in the decision process is that Azure provides two types of SQL engine to choose from; each can tackle any challenge in the four Vs; however, it is wise to understand which engine solves which “V” best. Understanding the nuances of each flavor of Azure SQL will help developers make informed decisions about how to load, query, and manage the data warehouse.

The first SQL engine we will examine in this chapter is Azure Synapse Analytics (formerly Azure SQL Data Warehouse). This massively parallel processing (MPP) service provides scalability, elasticity, and concurrency, all underpinned by the well-loved Microsoft SQL server engine. The clue is certainly in the former title; this is a good option for data warehousing. However, there are other factors that mean this may not be the right choice in all scenarios. While Azure Synapse Analytics has a wealth of optimizations targeted at data warehousing, there are some reasons why the second SQL option, Azure SQL Database, may be more suitable.

Azure SQL Database is an OLTP type system that is optimized for reads and writes; however, it has some interesting features that make it a great candidate for a data warehouse environment. The recent advent of Azure SQL Database Hyperscale means that Azure SQL Database can scale up to 100 TB and provide additional read-only compute nodes to serve up analytical data. A further advantage is that Azure SQL Database has intelligent query processing and can be highly reactive to changes in runtime conditions allowing for peak performance to be maintained at critical times. Finally, there are multiple deployment options for Azure SQL Database that include managed instances and elastic pools. In essence, a managed instance is a full-blown SQL server instance deployed to the cloud and provides the closest match to an existing on-premises Microsoft SQL server implementation in Azure. Elastic pool databases utilize a single pool of compute resource to allow for a lower total cost of ownership as databases can consume more and less resources from the pool rather than having to be scaled independently.

Azure Synapse Analytics

When implementing an on-premises data warehouse, there are many constraints placed upon the developer. Initially there is the hassle of setting up and configuring the server, and even if this is taken care of already, there is always a maintenance and management overhead that cannot be ignored. Once the server is set up, further thought needs to be applied to file management and growth. In addition, the data warehouse itself is limited to the confines of the physical box, and often large databases have to utilize complex storage solutions to mitigate this issue.

However, if you are reading this book, then it is clear you are no longer interested in this archaic and cumbersome approach to data warehousing. By making the move up to the Azure cloud, you can put the days of server management behind you, safe in the knowledge that Microsoft will take care of all that. And what’s more, Azure does not just provide a normal SQL instance that is purely serverless; they have restructured the underlying architecture entirely so that it is tailored for the cloud environment. This is then extended further to the point that Azure Synapse Analytics is not only purpose-built for the cloud but purpose-built for large-scale data warehousing.

Understanding Distributions

A key factor that needs to be understood when working with Azure Synapse Analytics is that of distributions. In a standard SQL server implementation, you are working in a symmetric multi-processing (SMP) environment which means there is a single storage point coupled to a set of CPUs and queries are parallelized across those CPUs using a service bus. The main problem here is that all the CPUs need to access the same storage and this can become a bottleneck, especially when running large analytical queries.

When you begin using Azure Synapse Analytics, you are now in a massively parallel processing (MPP) environment.

There are a number of key differences between SMP and MPP environments, and they are illustrated in Figure 2-1. The most important is that storage is now widely distributed and coupled to a specific amount of compute. The benefit here is that each node of the engine is essentially a separate SQL database and can access its own storage separately from all the other nodes without causing contention.
Figure 2-1

Diagram of SMP vs. MPP

Figure 2-1 shows how in an SMP environment, there can be contention for storage resources due to the single point of access; however, this problem is alleviated in the MPP environment as each compute node is coupled to its own storage.

In an MPP environment, when a query gets executed, the control node sends a copy of the query to each compute node in the engine. From here the compute node can access its allotted storage, perform the query, and return the results back to control node to be aggregated with the other result sets.

The First Problem

The concept of separating compute and storage is fundamental to Azure Synapse Analytics, and while this produces an ideal platform to run blazing fast analytical queries, it can also begin to pose problems. As the service is built to run in the cloud, the notion of scaling the resource to meet needs comes into play, and while it is simple enough to add and subtract computation resource, scaling up distributed data storage is trickier.

Let’s imagine we have ten glasses of water – these are our storage distributions. Now let’s add two athletes that need that water as fast as possible – these are our compute nodes. An essential consideration here is that we are only as fast as our slowest athlete; if the water is poorly distributed and contains skew, then one athlete will have to become idle and wait for the other to finish. Now, it would be easy for us to introduce two additional athletes and clearly the water would be consumed twice as fast. However, as the glasses get more and more full, we decide that we actually need 20 glasses to hold all the water to avoid any overflow and so place 10 more glasses on the table. To avoid skew and unbalanced consumption, we would now need to completely redistribute our water across all 20 glasses, and this action becomes very inefficient when we want to do this regularly.

Bringing this back to the warehouse scenario, you can see why scaling storage can become problematic when the data needs to be evenly distributed. To get around the issue, Microsoft has fixed the number of distributions at 60. Whatever the size your data is, you will have to distribute it over 60 storage nodes. This ensures that the compute can be scaled up to further parallelize the processing, but the storage layer does not need to change at all. It is worth mentioning here that distributed tables are presented as a single table, as if they were stored in an SMP type system.

However, now that we know our data will be distributed 60 ways regardless of the compute size, we are faced with the next question. How do we distribute our data? The key thing to remember is that we want to minimize skew. To define skew more clearly, it is the imbalance of data being stored on one storage node vs. another. Thankfully Microsoft has made it easy for us to monitor skew with some handy Data Management Views (DMVs), but I will introduce these fully, later. First let’s understand how we can mitigate skew.

ROUND ROBIN Distribution

The first way to mitigate skew is to use the ROUND ROBIN approach . At the point of ingesting your data, Azure Synapse Analytics will assign each row to the next available storage node in the system. Figure 2-2 shows how each new incoming row is distributed to each compute node sequentially.
Figure 2-2

Diagram of data begin distributed row by row onto each distribution

The syntax to write an Azure Synapse Analytics table that uses Round Robin distribution is documented in Listing 2-1.
CREATE TABLE myTable
  (
    id int NOT NULL,
    firstName varchar(20),
    lastName varchar(20)
  )
WITH (
    DISTRIBUTION = ROUND_ROBIN,
    CLUSTERED COLUMNSTORE INDEX
     );
Listing 2-1

Data definition language (DDL) statement to create a table with Round Robin distribution

This approach eliminates skew as it is completely removed from the context of your data. You guarantee an even distribution. A simple sum of a column grouped by another column would perform fine because each node can determine its result and pass it back to the control node to be aggregated. However, at some point the data will need to be joined back together, only now your data is spread far and wide across the warehouse and importantly the server does not know which storage node holds each record.

To analyze the problem further, we can use the scenario of joining a fact table to a dimension table. To perform the join, each node needs to obtain the dimension rows from the other nodes in the warehouse and store that data on its own storage. Once it has those rows, it can perform the join and return the result. This process is called data movement and is a large cost on the query plan. Further, this movement is conducted at query runtime, and therefore you must wait for these additional steps to take place before any results can be obtained. Unfortunately, this movement is performed for each query that requires it, and the result is removed once the query completes.

HASH Distribution

If we are to avoid the problems of data movement, we need to distribute our data more intelligently. The method for this is to use HASH distribution , which will create a hash of a columns value and locate matching values on the same node. As shown in Figure 2-3, when Hash distribution is used, each row is hashed using a set key and then grouped with other rows that have the same hashed value.
Figure 2-3

Diagram of data being distributed using Hash keys

The syntax to write an Azure Synapse Analytics table that uses Hash distribution is shown in Listing 2-2.
CREATE TABLE myTable
  (
    id int NOT NULL,
    firstName varchar(20),
    lastName varchar(20)
  )
WITH (
    DISTRIBUTION = HASH (id),
    CLUSTERED COLUMNSTORE INDEX
    );
Listing 2-2

DDL code to create a table with HASH distribution on the “id” column

To use the fact and dimension scenario again, if all the dimension and fact rows are stored on the same storage node, then no data movement is required. All the joining can be performed in isolation. For this to work, however, the following things need to be considered:
  • Which column to distribute on?

  • What is the cardinality of that column?

The Distribution Column

If we are to use the Hash distribution approach, then we must plan which column we will use to distribute our data. For a column to be considered as a Hash distribution column, it should contain the following properties:
  • Low cardinality

  • Even distribution

  • Often used in joins

  • Not used in filters

To expand on each of these points, a column with a very low cardinality (less than 60 unique values) will not use our entire storage allocation as the server will not have enough values to distribute the data on. To avoid this and maximize performance, an ideal number would be over 600, but really the more the better. Secondly, an even distribution means that we can still eliminate the problem of skew. It is unlikely to be as smooth as a Round Robin distribution, but by analyzing the data upfront, we should get an idea of whether there is a strong favor for some particular values over others, and if there is, then it would not be a good distribution column. If the chosen hash column is often used in joins, for example, customer or product, then the likelihood of the server being able to avoid data movement increases dramatically. Finally, if the column is commonly used as a search predicate, then you will be limiting the opportunity for parallelism as the filter could remove the need to run the query on certain nodes of the warehouse.

If none of your columns have more than 60 unique values, then you should explore the possibility of creating a new column that can be a composite of several columns in the table, thereby gaining a higher cardinality. To remove the need for data movement, you should use this column in the join arguments. You may also notice that in order to understand the joins and filters that will be commonly used, you will need to establish the types of queries being run on the warehouse by your users. Once you have this knowledge, then you can plan your distribution accordingly.

How to Check if You Have the Right Column

Ultimately, if you are designing your warehouse with Hash distribution in mind, you will choose a column to distribute on. Once you have this and have loaded your data, you will need to determine if the distribution played out like you expected or whether some unforeseen aspect of the data has made it not a good column for distribution. To check your skew and distribution, there are Data Management Views (DMVs). These are system views put together by Microsoft that provide easy insight into the inner workings of your server. The following SQL code can be used to show one of these DMVs:
DBCC PDW_SHOWSPACEUSED('dbo.myTable');

From the information returned, you may determine that the designated column is not the most appropriate, and in that case, you can easily redistribute the data by redefining your table with a CREATE TABLE AS SELECT (CTAS) statement.

REPLICATED Distribution

A third option for data distribution is to utilize the REPLICATED distribution . Rather than distributing data across the server, a full copy of the table is placed on each compute node of the engine, not storage node. When a query is executed that requires joining to that table, data movement can be spared as the data is already in the right place. In the context of a warehouse, replicated tables can be very effective when used for smaller dimension tables (less than 2 GB on disk – more on how to determine this later). When designing the warehouse, there is likely to be some tables that will be joined using a column that is not used for distribution. In these instances, data movement would be required unless one of the tables was replicated, in which case the data is already accessible to the compute node. The syntax to write an Azure Synapse Analytics table that uses Replicated Distribution is as shown in Listing 2-3.
CREATE TABLE myTable
  (
    id int NOT NULL,
    firstName varchar(20),
    lastName varchar(20)
  )
WITH (
    DISTRIBUTION = REPLICATE,
    CLUSTERED INDEX (lastName)
    );
Listing 2-3

DDL statement to create a table using replicate distribution

Note that in the preceding DDL statement, a clustered index is chosen over a clustered column store index. This is because a clustered index is more performant than a heap; however, a table that is being replicated is not likely to have enough rows to gain any real benefit from the clustered column store index used for the HASH and ROUND_ROBIN tables.

As with all design decisions, however, there are considerations that need to be made. In the case of replicated tables, it is important to consider the logistics of replicating data across each compute node. The goal is to reduce the number of rebuilds for that table, and the operations that cause rebuilds are the following:
  • Data is inserted, deleted, or updated.

  • The warehouse is scaled up or down.

  • The definition of the table is changed.

The rebuild itself is twofold. When the data is first updated, then the table is copied to a master version of the table. This ensures that the insert, delete, or update operation can be completed most efficiently. Only once the replicated table is selected from will the data be further copied onto the compute nodes in the server. On the first read of the data, the query will run against the master table while the data is copied asynchronously to the compute nodes. After this, any subsequent queries will run against the replicated copy of the data.

Hopefully this explanation of the different distribution types available through Azure Synapse Analytics offers some insight into the benefits of massively parallel processing and some of the challenges that need to be overcome. Without doubt, one of the benefits of the cloud computing model is the separation of compute and storage and the flexibility this can provide.

Resource Management

Understanding how to manage the resources allocated to an Azure Synapse Analytics instance is vital to ensuring the engine performs well for the users it serves but also does not cost the Earth to run. Ultimately the amount of compute assigned to your server is determined by the number of Compute Data Warehouse Units (cDWUs). This setting is a blended metric that comprises CPU, memory, and I/O into a normalized figure that can be used to determine performance and is also known as the service objective . As a starter, the smallest cDWU setting for an Azure Synapse Analytics instance is cDWU 100. This equates to one compute node with 60 GB of memory and is therefore responsible for all 60 storage distributions. This could be scaled up to a cDWU 500, meaning that you still have a single compute node in charge of 60 storage distributions but now has 300 GB of memory. As you get past cDWU 500, you begin to increase the number of compute nodes, for example, a cDWU 5000 is 10x more powerful than the 500, meaning you would have 10 compute nodes aligned with 6 storage nodes, each with 300 GB of memory. The highest setting is cDWU 30000, meaning that each of your 60 compute nodes is attached to a single storage node with 18,000 GB of memory available.

Resource Classes

Given the amount of resource allocated by the service objective, it is up to you to further tweak how this is utilized in the server to ensure maximum performance. The first concept to grasp is that of resource classes. The purpose of resource classes is to pre-assign the amount of compute that is assigned to each query so that you can plan the load on your server more accurately. The two levers that are controlled by resource classes are that of concurrency and resource utilization, and the interaction between the two is such that a larger resource class will increase the resource utilization per query but limit the amount of concurrency available to the server. A smaller resource class does the opposite and will limit the amount of resource provided to a query but will increase the concurrency, meaning more queries can be run at the same time. Concurrency slots is the name given to the amount of concurrency available to the server, and this is explained later in the chapter.

The implementation of resource classes is done though user security roles which have been preconfigured on the server for you to use. In practice there are two types of resource class:
  • Static resource classes

  • Dynamic resource classes

Static Resource Classes

A static resource class provides a fixed amount of compute to a query regardless of the service objective, meaning that as an Azure Synapse Analytics Cluster scales up, the amount of concurrency available to run queries is directly increased. There is a range of sizes to choose from ranging from staticrc10 up to staticrc80, and each level assigns an increasing amount of concurrency slots to a query. Note that the amount of concurrency slots assigned to a query does change as you scale up an Azure Synapse Analytics. Within Azure Synapse Analytics, concurrency slots are akin to reserving seats at a busy restaurant. Each query “books” a determined amount of concurrency slots, and that number directly affects the number of other queries that can be run at the same time. As soon as that query completes, the concurrency slot goes back in the pool. The static type of resource class is tailored for scenarios where the data volumes are well understood and consistent. Let’s look at a few scenarios.

If you are using an Azure Synapse Analytics that is scaled to cDWU1000c, you will have 40 total concurrency slots. This means your maximum number of queries run at any one time is 32. This could be 32 analyst type users running queries under the staticrc10 resource class which, at DWU1000c level, assigns 1 concurrency slot per staticrc10 query. However, not all of your users will be analysts, and some may be “load” users – specific user accounts configured to run batch loads within the warehouse. These loads may be large, and for the query to execute in good time, you can assign your load user to a larger static resource class so that more memory is assigned for the query. If we use a staticrc60, then our query will be gifted 32 concurrency slots, taking up a lot more of the available resources. While this query is running, all queries that require more than eight concurrency slots will be queued until the query completes.

Now let’s say you have a second load that needs to be processed regularly and efficiently alongside your first load. To allow this to happen, you must scale the warehouse up. If we were to choose a DWU3000c setting, then we now a have 100 concurrency slots to play with, and because a staticrc60 query consistently assigns 32 concurrency slots, we know we can safely run two of these queries side by side with some additional head room for user queries on top.

Dynamic Resource Classes

Dynamic resource classes work very differently, and rather than assigning the same amount of concurrency regardless of the service objective, they actually increase the amount of concurrency per resource class as the Azure Synapse Analytics instance scales. As a result, there are only four dynamic resource classes:
  • smallrc: 3% of available concurrency (the default for all users)

  • mediumrc: 10% of available concurrency

  • largerc: 22% of available concurrency

  • xlargerc: 70% of available concurrency

To use the same example, an Azure Synapse Analytics scaled to DWU1000c will allow 22% of concurrency slots to a largerc workload. Given that there are 40 available slots, this equates to 8 being assigned to the query. However, if we found that the queries being run under the largerc were becoming slow, we could again increase the service objective to DWU3000c and now our query will be granted over 3x the amount of resource with 26 concurrency slots. While this will ensure our query completes faster, it does not mean that more queries can run at the same time.

Obviously in a full implementation of an Azure Synapse Analytics, you would expect to see a mix of both types of resource class being used – some static for predictable and consistent workloads and perhaps some dynamic for less routine, occasional workloads.

Pausing and Resuming the Warehouse

Because the compute and storage resources of the warehouse are not tightly coupled, it means that you can have full control over the scale and even status of the warehouse. As a user, you can scale the data warehouse at peak times to ensure maximum processing power and then scale the server back down when processing is completed. You can then turn the server off completely at night and weekends if required so that the cost of your warehouse can be dramatically reduced. Bear in mind that while compute can be paused, storage cannot, and this will be charged for regardless of compute scale. As mentioned previously in this chapter, scale operations can have side effects and so should be planned for in advance and not done on a whim. Additionally, pausing and resuming the warehouse can take time, and this should be planned for when designing the warehouse.

The operation itself can be performed either through the Azure portal or by using a REST API call from Azure Data Factory. Figure 2-4 points out the Pause/Resume button.
Figure 2-4

Diagram of portal button to pause and resume warehouse

The Data Factory pipeline shown in Figure 2-5 demonstrates a method to pause or resume an Azure Synapse Analytics instance using Data Factory orchestration of REST API calls.
Figure 2-5

A Data Factory pipeline showing how to pause a Synapse Analytics instance

  1. 1.

    Activity one is a Web activity that obtains a bearer token. This is required in order to authenticate the request to pause or resume the warehouse. The configuration of this activity is shown as follows:

    URL: https://login.microsoftonline.com/<tenant-id>/oauth2/token

    Method: POST

    Headers:

     
"Content-Type": "application/x-www-form-urlencoded"
Body: grant_type=client_credentials&client_id=<service-principal-client-id>&client_secret=<service-principal-secret-key>&resource=https%3A%2F%2Fmanagement.azure.com
  1. 2.

    Activity 2 is also a Web activity and posts the request to the Azure management API, using the previously fetched bearer token for authentication. The configuration for this activity is shown as follows:

    URL: https://management.azure.com/subscriptions/<subscription-id>/resourceGroups/<resource-group-name>/providers/Microsoft.Sql/servers/<sql-server-name (without .database.windows.net)>/databases/<database-name>/pause?api-version=2017-10-01-preview

    Method: POST

    Headers:

     
"Content-Type": "application/json"
"Authorization": @concat('Bearer ', activity('Obtain Bearer Token').output.access_token)
Note

The “Authorization” header must be entered as dynamic content so as to use the value from the previous activity. You can access this pane using “Alt + P”.

Body: {} (A valid body is needed to validate Data Factory but not for the actual request.)

This pipeline could then be triggered using a wall clock type schedule or even by a custom invocation. These types of invocations will be discussed later in the book.

A further option for scaling your Azure Synapse Analytics instance is to use Azure Automation. This provides a service that allows you to execute run books – read PowerShell scripts, on a schedule or using a web hook API. It is important to consider that the warehouse can take a short while to come back online, and while it is paused, no queries can be run against the data, nor can you access the data by other means.

Workload Management

Another feature of Azure Synapse Analytics is that of workload management and importance. Importance is a feature that allows specific users to be tagged as higher priority, and therefore this affects the order in which the server processes queries. As mentioned earlier, if there are no concurrency slots remaining, then queries enter into a queue and this queue is built up in a first in, first out (FIFO) manner, meaning that the first query to queue will ordinarily be the first query to be processed.

However, let’s imagine a scenario where you have two users querying your busy warehouse and are waiting in the queue. Let’s say one user is an analyst and the other is the CEO who has been tagged with high importance. In this scenario, even if the analyst submitted their query before the CEO, the CEO’s query will be pulled off the queue and executed first.

Additionally, this importance feature can affect how the server handles locking. Locking is used throughout the warehouse to ensure consistent reads and is a fundamental concept in any database engine. If we have a table that is regularly updated, then there will often be a lock in place on this table, thereby blocking other processes until the lock is released. Without importance in place, queries will be handed the lock in a chronological order. However, this can be changed with importance, ensuring that the important queries obtain the lock prior to the normal queries, thereby ensuring the important queries complete quicker.

Finally, importance will even permeate into the optimizer, as this is the part of the Azure Synapse Analytics instance that estimates the size of each job and decides when to execute them. Usually the optimizer prioritizes throughput and will therefore execute jobs as soon as a sufficient amount of resource is available; however, in some scenarios, this can cause big delays for larger processes. If there are a number of small queries running and some further small queries in the queue, then as the small queries complete, the small queued queries will be executed because the required resources match up. If a large query enters the queue, it will have to wait until enough resource is available at any given time before it can execute. However, if the query is tagged with importance, then the optimizer is aware that it needs to make room for the larger query and will avoid continually pulling smaller jobs off the queue. This ensures that required amount of resource is available quicker.

PolyBase

PolyBase is a technology that provides a seamless interface between your data lake platform and your Azure Synapse Analytics instance. The data in your data lake can be exposed to your Azure Synapse Analytics instance as an external table, meaning the data within the file or files can be operated on as if it were a table in the database.

To do this, there are a couple of additional components required which are
  • Database scoped credential

  • External data source

  • File format

The database scoped credential is used to authenticate the Azure Synapse Analytics instance into the data lake. The permissions here should be tightly controlled and well planned out. Multiple versions of these credentials may be needed to ensure the correct granularity. The pieces of required information are the service principal id (and its accompanying authentication end point) and the secret or key that is created for that service principal. The syntax for the credential creation is shown in Listing 2-4.
CREATE DATABASE SCOPED CREDENTIAL DataLakeCredential
  WITH
    IDENTITY = {service principal id}{OAuth End Point},
    SECRET = {service principal secret key};
Listing 2-4

The syntax used to create a database scoped credential

The next requirement is the external data source. This now makes the connection to your data lake and is used to describe the type of external source, as data lake is one of a number of options, while also supplying the root path of the data. It is important that the path specified here only goes as far as the top level required as further directory navigation can be added on when using the data source from a query. This data source also references the database scoped credential, so it is important to ensure that the service principal you use has the access that is required for the external data source. The syntax for the external data source is shown in Listing 2-5.
CREATE EXTERNAL DATA SOURCE [DataLakeSource]
    WITH (
        TYPE = HADOOP,
        LOCATION = N'abfss://{container name}@{account name}.dfs.core.windows.net',
        CREDENTIAL = [DataLakeCredential])
Listing 2-5

The syntax used to create an external data source for the data lake

Finally, we need to specify a file format so that our Azure Synapse Analytics understands how to read the data it finds in the lake. Here we can set a number of options about the files we want to read. A key point to bear in mind here is that the file format cannot be parameterized, so it is important to read from a standardized layer in your data lake so that you can reduce the number of file formats needed. The syntax for creating the file format is shown in Listing 2-6.
CREATE EXTERNAL FILE FORMAT PipeDelimitedText
WITH (
    FORMAT_TYPE = DELIMITEDTEXT,
    FORMAT_OPTIONS (
        FIELD_TERMINATOR = '|',
        DATE_FORMAT = 'dd/MM/yyyy',
        STRING_DELIMITER = '"'
    )
);
Listing 2-6

The syntax used to create a file format that reads pipe delimited data and formats dates into the UK standard format

We can now very easily ingest data into our warehouse by using these components to access data in the lake in a secure and robust way. Given these three elements, we can utilize external tables to expose the data in the lake as if it were a standard SQL table. Alongside this external table, we can also determine what happens to rows that do not fit the definition of our external table and where they should be landed. The feature allows us to easily handle bad rows, whether they be caused by data type violation or additional columns. Listing 2-7 shows how you can define the external table so that it uses the previous three components to access data in the lake.
CREATE EXTERNAL TABLE [dbo].[ExternalTable]
(
    [Col_one]   TINYINT         NULL,
    [Col_two]   VARCHAR(100)    NULL,
    [Col_three] NUMERIC(2,2)    NULL
)
WITH
(
     DATA_SOURCE = DataLakeSource
    ,LOCATION = '/Read_directory'
    ,FILE_FORMAT = PipeDelimitedText
    ,REJECT_TYPE = VALUE
    ,REJECT_VALUE = 100
    ,REJECTED_ROW_LOCATION= '/Reject_directory'
)
Listing 2-7

The syntax to define an external table that reads from the “Read_directory” which is a subfolder of the root defined in the external data source. Additionally, we have specified that PolyBase should fail the ingestion if 100 or more rows are invalid, writing the bad rows into the “Reject_directory.”

The preceding REJECT_TYPE argument can be defined as “VALUE” or “PERCENTAGE.” The value reject type means that PolyBase will fail reads from this table if the absolute number of rows specified in the REJECT_VALUE argument is exceeded. Alternatively, if the type is set to percentage, then the read will fail if the percentage of rows set are invalid. Additionally, you must set the REJECT_SAMPLE_VALUE which tells Azure Synapse Analytics how many rows to attempt to read as a batch before moving on to the next batch. If the batch size is set at 1000 and the reject value is 10 (note, not 0.1), then Azure Synapse Analytics will read in the first 1000 rows, and if more than 100 of those rows fail, the batch will be failed. If less than 100 rows fail, then Azure Synapse Analytics will complete the batch and begin reading the next 1000 rows.

In order to finally persist this data into the warehouse, we need to land the data in an internal table. An internal table can be treated exactly the same as a regular SQL table; however, the data is of course distributed across the 60 storage nodes as defined in the table definition. The way to do this is to utilize the CREATE TABLE AS SELECT statement which allows you to create a table as the output of a select statement. The syntax in Listing 2-8 shows how you can select the contents of the external table defined previously (this is a file in the data lake) and land it in an internal table.
CREATE TABLE dbo.InternalTable
        WITH
        (
            DISTRIBUTION = ROUND_ROBIN
        )
        AS
        SELECT
             Col_one
             ,Col_two
             ,Col_three
        FROM dbo.ExternalTable
Listing 2-8

The CTAS syntax to read data from the external table

An important point to emphasize here is that this method to ingest data into the warehouse is the only to load data in bulk in a minimally logged manner. All other methods, such as SSIS, Data Factory, and others, push data through the control node which thereby causes a bottleneck. As a result, this route for loading data should be used before all others to ensure data is processed as efficiently as possible.

Azure SQL Database

Azure SQL Database (Azure SQL DB) is a major cloud-hosted database technology offering from Microsoft and can be thought of as a Platform as a Service version of a traditional on-premises SQL database. There are of course major alterations to the way the service is deployed so that as a user, you get the much beloved SQL engine combined with the benefits of it being cloud hosted. The point to make clear upfront is that an Azure SQL DB is a single database ONLY, there is no server instance surrounding the database, and this means no access to the SQL Agent, PolyBase, cross database queries, and others; however, there are alternative deployment options that make some of those things available. When creating an Azure SQL Database, you will see a logical server will be created; however, this is a namespace only and holds none of the items mentioned previously. This book is focusing on cloud data warehousing, and it may seem confusing why a developer would not just choose Azure Synapse Analytics when designing their architecture. This section will outline the reasons why an Azure SQL DB may be a better fit for some scenarios and speak about the features that make it so.

The Cloud-Based OLTP Engine

Many data warehouse developers will be familiar with the difference between online transactional processing (OLTP) systems and online analytical processing (OLAP) systems and when to use which system. Generally speaking, an OLAP engine would be preferable for a SQL data warehouse, particularly one used for decision support, because most queries will be using aggregations and grouping to compute large-scale calculations, and therefore the engine is tuned for enhanced query performance over transactional inserts and updates. An OLTP type database would be more commonly used as a source of data for a warehouse and may be the focal point of a great number of transactions, often at very large scale and volume. That said, beneath a threshold, there is no reason why a standard OLTP type system cannot handle the analytical queries presented by a user; in fact for smaller data warehouses, this may be a more appropriate option.

The Benefits of Azure SQL Database

When designing a data platform solution, there are several points that need to be considered, and, in a number of categories, Azure Synapse Analytics falls short when compared to Azure SQL Database. This is not to say that workarounds cannot be created; however, some of the following benefits may be a really critical requirement. The rest of this section discusses those concepts that may nudge Azure SQL DB in front of Azure Synapse Analytics when designing a data platform for analytics.

Improved Concurrency

One element that is perhaps taken for granted in an on-premises SQL server implementation is that of a high level of concurrency. Having the ability to process a great number of queries at any given time is often essential, given the nature of a database. However, Azure Synapse Analytics has a limit on the number of concurrent requests, and even at the highest service objectives, this limit is 128 queries at one time. Just to be clear, this means that no more than 128 queries can be run at the same time on Azure Synapse Analytics and often this number is smaller, for example, a DWU1000c data warehouse has a concurrency limit of 32! If there is a large analytical community looking to use the warehouse alongside a host of report and load users, these concurrency slots will quickly run out and processes will be throttled unless you can afford to scale up. Alternatively, you could review the option of using Azure SQL DB which, due to the nature of being a write optimized OLTP engine, is designed to process a high level of transactions concurrently. Because the engine is built using a traditional SMP architecture, the processing route for queries is much simpler; they are evaluated by the optimizer and then passed to the execution service instead of, as in the case of Azure Synapse Analytics, being distributed across a network of compute nodes to then be aggregated back together once all nodes have completed.

Trickle-Fed Data Warehouses

Before embarking on any data warehouse project, it is important to understand the data that will be loaded and the queries that will be performed. Once you have a good understanding of this, you can begin to make justified decisions about how you will load and process the data in your warehouse. In some cases, there may be the need to ingest very large files regularly and blend this with equally if not larger tables of existing data; this is where Azure Synapse Analytics comes in handy. However, there can also be instances where smaller more frequent files are common, and this is where an Azure SQL DB may become a more desired option.

In Azure Synapse Analytics, the CREATE TABLE AS SELECT (CTAS) statement is the go-to method for loading tables. This approach means that you literally recreate the table every time using the result of a SELECT statement. To produce the effect of UPDATES and INSERTS, you produce the data in multiple SELECT statements and union them together to create the entire table in one query. While this is very efficient for blending large datasets, it becomes very inefficient if you only need to add a few records – a simple INSERT and UPDATE would suffice. While an INSERT and UPDATE both exist in Azure Synapse Analytics, they do not automatically create statistics and therefore any stored procedures using these need to do that manually and this additional complexity makes the pattern cumbersome and difficult to maintain. In the case of Azure SQL DB, we can easily reuse existing logic if it exists or create procedures using common patterns and well-understood processes such as upserts or merges.

Further to this, data warehousing is no longer just about processing regular, batched up source files; a warehouse should be able to accept event-driven or streaming data and often these records can arrive in micro batches (one or so records at a time). Were we to use an Azure Synapse Analytics, this would mean rebuilding the entire table every few seconds or so just to incorporate a handful of records. Obviously, this approach is completely inefficient; however, an Azure SQL DB would handle these micro batches easily and allow for a wider variety of data ingestions patterns.

Managing Slowly Changing Dimensions

Often the issue of slowly changing dimensions is one to be tackled prior to ingesting any data because it is important to document which dimensions will have slowly changing elements and what type is needed. To allow for conformed understanding, the three main types are
  • Type 0: Data is not affected at all and no updates are made. The dimension is append only.

  • Type 1: Data is simply overwritten so that the latest state of the record is maintained in the dimension table.

  • Type 2: The latest version of the record is inserted into the dimension table and the historical record is marked to indicate it is no longer current. This can either be with a set of data bounds or an “is current” flag.

While there are additional types that can be implemented, the logic can be derived from one of the preceding three options. When using Azure SQL DB, the implementation logic of slowly changing dimensions becomes simple because very often we can write a single statement that can take care of the update in the case of Type 1 or a collection of insert and updates to cover off Type 2. Conversely, in Azure Synapse Analytics, the statement needs to be comprised of a number of SELECT statements that then get unioned together to form a final result which is the entire table. This means that even records that are not changing need to form part of the SELECT statement. Additionally, this becomes more awkward to debug and report on as part of a warehouse processing routine because the logic is not broken out into steps as is the case in Azure SQL DB.

Intelligent Query Processing and Tuning

Another feature of Azure SQL DB that makes this technology stand head and shoulders above others is that artificial intelligence has been integrated directly into the SQL engine to allow for adaptive query processing and automated performance tuning. The primary reason for this feature is to compensate for poor statistics in the database and ensure that a query is as performant as can be, even once the plan has been sized and handed off to the executer. Given a warehouse implementation is all about the ability to query and read data, this feature helps to compensate for the fact that there are no multiple compute nodes processing the query and instead allows for the warehouse to be proactively pursuing the best possible performance. The intelligent query processing and tuning is manifested in a number of features of the SQL engine which are available in all deployments of Azure SQL DB.

Automatic Tuning

The first feature is automatic tuning, which learns from the collective pool of Azure SQL Databases and feeds the insights gathered back into your target databases at times of low activity. The feature itself can be turned off entirely or applied to a level where recommendations are generated but not applied. However, in the full implementation of automatic tuning, the service will generate tuning suggestions and automatically apply them for you with the additional benefit that the engine will verify the benefit of the recommendation and, if there is no discernible improvement, will roll back the change. The recommendations will be made up of CREATE and DROP INDEX suggestions and FORCE LAST GOOD PLAN suggestions. The CREATE INDEX element will identify missing indexes and create them while also verifying the improvement to the workload, whereas DROP INDEX will actively remove surplus or duplicate indexes. The FORCE LAST GOOD PLAN element will identify queries that are using a query plan that is not as performant as a previous plan and will query using the better plan instead of the more recent one.

Adaptive Query Processing

Adaptive query processing is a major change to the way a query is executed in SQL server. In a usual query process, the plans are produced and sized with the smallest one being chosen and executed; however, there can be times when poor statistics mean that the query was incorrectly sized and is therefore not the most efficient. Despite this, the optimizer continues to run the query based on the plan. Adaptive query processing allows for the engine to adjust the subsequent plan based on the row counts that are accumulated throughout execution and becomes effective through a number of individual features which are
  • Batch mode memory grant feedback

  • Row mode memory grant feedback

  • Batch mode adaptive join

  • Interleaved execution

Batch Mode Memory Grant Feedback

The memory grant controls the amount of memory that is given to a query to process and is estimated prior to the execution of the query by the optimizer. The reason for this is to ensure that the query has enough memory to execute efficiently but not too much to drastically reduce concurrency within the database. The value is then stored alongside the plan in the plan cache. However, if the memory grant has not been correctly estimated, then the performance hit to your query can be devastating. A grant that is too low will cause spills onto disk which becomes very expensive compared to reading directly from memory. Alternatively, an oversized estimate will unnecessarily reduce the amount of parallelism and resource available to other activities in the database.

With this feature enabled, the SQL engine will review the estimated memory grant vs. the actual required to read all rows into memory and update the number attached to the plan in the cache. This means that subsequent queries will use the updated estimate rather than the initial one that was incorrect.

The same feature is also available for row mode queries; however, at the time of writing, this is in preview.

Adaptive Joins

The adaptive join feature allows the SQL engine to choose a join mode after the first input has been scanned, meaning that there is a realistic evaluation of rows before deciding on the type of join to be performed. The types in question are Hash mode, which is the default, and Nested loop mode. With this feature enabled, a threshold is put in place to determine whether the number of rows is small enough to be executed better by a Nested loop type join or whether the plan should continue to use Hash mode. If the process does in fact switch from Hash mode to Nested loops but has already read in rows from the input, then these rows are preserved and do not have to be read again; although there is still a slight overhead in the use of adaptive joins, this is still a very useful feature for workloads that often vary in size.

Interleaved Execution

As mentioned previously, a standard query plan will be produced by the optimizer and then run by the executor; however, this linear mode of planning and running queries can cause performance issues when the estimates are not correct. Currently, without interleaved execution, Multi-statement Table-Valued Functions would always use a fixed cardinality estimate of 100, regardless of the actual number. This often means there can be large discrepancies between that estimate and the actual number of rows; however, interleaved execution allows for the optimization process to be paused, a better estimate to be gathered and then resumed with that estimate in hand, thereby informing the optimizer of how to write the subsequent plan in the best way. This means that subsequent join algorithms are more efficient and memory spills are far less likely to occur.

Hyperscale

By this point, I am sure you can see that there are many reasons why an Azure SQL Database may provide a richer feature set than Azure Synapse Analytics and certainly an on-premises solution. However, a standard deployment of Azure SQL DB does have an upper limit on the size of your database which is currently set at 4 TB, not tiny, but not enough by many standards, and that is why Microsoft has completely redesigned the architecture from the ground up to be entirely tailored to the cloud. The new approach is termed Azure SQL Database Hyperscale and is the latest addition to the V-Core purchasing tier. The technology has been tested with databases up to 100 TB although this is not a technical limitation and Microsoft actively encourages customers with larger databases to push that limit further, claiming confidently that the Hyperscale technology will cope with it.

The reason that Hyperscale databases can scale to such large capacities is because the entire architecture of the resource has been adapted to exploit the cheap storage and flexible compute resources that are made available when working in a cloud-based platform. In much the same way that Azure Synapse Analytics separates storage from compute, Azure SQL DB Hyperscale does the same. This means that storage can scale linearly, but the compute power used to process that data can grow and shrink as required. Despite this similarity, the data in Azure SQL DB Hyperscale is not distributed like in Azure Synapse Analytics. The architecture still facilitates an SMP approach to data access which means that storage is essentially held in one place and only written to using a single master compute node.

The Hyperscale Architecture

To start from the top of the Hyperscale stack, we have the compute nodes. The compute nodes house the relational engine, SQL server, and control all interaction with the rest of the Hyperscale service. There will always be a single primary compute node that handles read and write transactions for the database; however, this can be supported by multiple read-only secondaries that can be used as hot secondaries for failover functionality but can also handle read-only workloads – such as hefty analytical queries. Additionally, these compute nodes utilize SSD caches, named Resilient Buffer Pool Extensions (RBPEX), so that the time to fetch page data can be minimized. A key point of interest relating to the purpose of this book is the concept of read-only secondaries. These can be utilized by specifying the Application Intent parameter as true in the connection string, indicating to the service that this is a read-only query and can therefore be routed to the read-only secondary nodes rather than the read-write master node.

Supporting the compute nodes is a set of page servers, which are really what allow Hyperscale to reach the scale that it does because there is no finite number of page servers in a given Hyperscale implementation. As the database continues to grow, more page servers are allocated to the service. Each page server handles a 1 TB subset of the data pages and delivers them to the compute nodes on demand, additionally making use of the RBPEX caching to avoid network round trips and support the low latency guarantees made by Microsoft. Importantly, the page servers are allocated 1 TB at a time, so each time a new page server is created, it will handle the next 1 TB of data; however, the service itself is billed in 1 GB increments so you do not pay for excessive storage although it is allocated to your service anyway. The other role of the page servers is to ensure the pages are kept up-to-date by replaying log transactions from the log service.

At the lowest level is remote storage, which is updated by the page servers and is the final place for data storage and is therefore used to support the snapshots that are created for backups and to enable Accelerated Disaster Recovery.

The final piece of the Hyperscale puzzle is the log service which again is implemented very differently to an on-premises transaction log. In an on-premises implementation of SQL server, the server itself will maintain a log file that continues to populate until it reaches a certain threshold and then begins to overwrite the previous log items, giving the impression that the log is circular. With Hyperscale, this is not the case. Because cloud storage is cheap, the log can easily be portioned off and stored in long-term cold storage, meaning that the log storage is practically infinite. The other key role of the log service is to accept transactions from the primary compute node and apply those changes to the secondary compute nodes and the pages stored on the page servers. As you can imagine, having to wait for the log service to complete that level of activity would add significant latency to a query response so the log service is designed so that there is essentially a landing area that persists the transaction record into a cache. Once persisted, the transaction is considered to be logged and then the replication of the transaction to compute nodes and page servers is done in the background, without delaying the query response.

With this architecture in mind, the flow of data through the Hyperscale service can be somewhat convoluted. In the first instance, data would be stored in an RBPEX cache on the compute nodes and therefore accessed very quickly. Alternatively, if the data is not on the compute node, then the read may have to go back to the page servers to fetch the data from there. When doing writes, the transaction is passed from the primary compute node to the log service. It is then the role of the log service to apply the transaction to the secondary compute nodes and the page servers; finally the page servers apply the change to the remote Azure Storage files.

Accelerated Disaster Recovery

A key concern for anyone managing a large database is “how long will it take to restore were it to go offline.” In Hyperscale, this operation can be done very efficiently regardless of the size of data. It makes no difference to the restore activity whether the data is 1 TB or 100 TB which is an incredible level of comfort to provide for whoever must answer that question. Were the database to go offline and require a restore, the only activity that is needed is to repopulate the page servers with the data stored in Azure remote storage. Given that this operation can be scaled out by the number of page servers in the instance, it means that only a single TB must be restored onto any given page server regardless of the size of the database. To put this into perspective, a restore operation of a 50 TB Hyperscale database would mean that 50 page servers are created and populated with a TB of data from the remote storage; Microsoft has demoed this 50 TB restore completing in just 8 minutes.

While it is of course possible to manage a large, multi-terabyte database on-premises, a restore of that database would take considerably longer than 8 minutes. These kinds of disaster recovery options simply could not be achieved with the box version of SQL server because of the scale out operations required to facilitate them.

Azure SQL Deployment Options

When assessing the features that set Azure SQL Database and Azure Synapse Analytics apart, a key consideration is the deployment options . Often this can drive a number of conversations, that of cost, maintainability, management overhead, and alike. As both are cloud native solutions, scalability and compute size can be tailored with ease. Even if an initial deployment is very small and lightweight, a production scale up can easily be planned and implemented. Further to this, the size and scale of each solution can then be further tailored to meet the needs of users/processes throughout the day or week using Azure Automation scripts.

Both Azure SQL DB and Azure Synapse Analytics have support from Visual Studio SQL Server Data Tools (although Azure Synapse Analytics is in preview currently) allowing for seamless deployment and schema compare via Visual Studio. This means that from a development perspective, there should be little change between current on-premises practices and cloud practices; both are maintained and source controlled through Visual Studio.

Even though the development experience may be roughly the same between the two Azure SQL options, the target deployment platforms can vary greatly. Azure Synapse Analytics has a single deployment option as a stand-alone resource managed through the Azure portal. The deployment can be automated through the use of ARM templates; however, this only makes the deployment of that single Azure Synapse Analytics instance more efficient. Conversely, Azure SQL DB has a variety of options that can make the move to the cloud easier due to the flexibility of the platform.

Azure SQL Database Managed Instances

A managed instance is the closest cloud alternative to a traditional on-premises deployment of SQL server. Without a managed instance, you would create a logical SQL server that is no more than a namespace to group individual databases; however, with a managed instance, there is a real SQL server instance that hosts the databases and therefore access to the SQL Agent, Database Mail, Linked Servers, cross database queries, change data capture, and others. While this offers a level of comfort and the ability to reduce the amount of application rework, you also benefit from the Platform as a Service gains that Azure has to offer. Features such as automatic patching, automated backups, and v-nets are all configured out of the box without any management overhead for the business. For users looking to simply migrate to Azure with minimal disruption, this can be a very useful deployment option; however, significant cost optimizations are available if the stand-alone database deployment option can be used.

Azure SQL Database Elastic Pools

A second deployment option is that of elastic pools. Here, a pool of resources is created and shared between a multitude a single databases so that there is a single cost to pay and also a lot more ability to deal with sporadic spiking in database usage. Elastic pools work well when multiple databases need occasional high levels of performance but generally average at quite a low eDTU setting especially when the peaks are at varying times. In the scenario that you are supporting multiple databases that occasionally require high performance, without elastic pools, you would need to trade off between scaling to a tier that can handle peak usage and overpay the rest of the time and scaling to a lower tier and sacrificing performance, particularly at peak times. When designing an Azure SQL DB deployment, if elastic pools seem like a good option, then it is important to plan the size of the pool, the service tier of the contained databases, and the times at which those databases peak. You will need to know how many databases can spike at any given time while still remaining within your elastic pool size but also how you ensure that you have enough activity in the pool to make it more cost efficient than scaling the databases separately.

Azure SQL Database V-Core Tiers

When Azure SQL Database first arrived, the scale, and therefore pricing, of your database was configured using DTUs (Database Transaction Units). A single DTU is an abstracted metric that comprises storage, memory, and CPU to provide an easy single figure that is directly related to the overall performance of the database. However, the arrival of the V-Core option allows you to scale storage and compute separately, meaning the database can be completely tailored to your individual needs. When creating the database, you would choose the number of V-Cores to instantiate and then set a max storage size. The V-Core purchasing model is also available at different tiers, offering different performance characteristics and high availability/disaster recovery options.
  • The lowest tier is General Purpose, being the standard for most business workloads.

  • Next is Hyperscale which offers compatibility for databases above 4 TB while also guaranteeing high performance even at very high scale.

  • Last is the Business Critical tier that offers the highest level of performance and reliability although still limited to a 4 TB maximum.

A point worth mentioning is that Hyperscale databases use the V-Core purchasing model but vacillate between the General Purpose and Business Critical tiers in terms of performance. When data is stored directly on the compute node’s local RBPEX, then the performance will be at Business Critical scale without the cost overhead. Only when the Hyperscale service gets a cache miss on the compute node’s local RBPEX would it have to go back to the page server, and this performance would replicate that of a General Purpose tier.

Inside of the V-Core tier is the ability to choose a “provisioned” deployment and a “serverless” deployment. The provisioned deployed means that the deployed resource is always active and therefore chargeable. Alternatively, a serverless deployment allows the service to be paused and resumed as needed, meaning you would only pay for what you actively use. This can provide a huge cost saving in development and test environments but may not be suitable for a production deployment. The base reason for this is that once the database is paused, the first query issued to the service will resume it but not complete successfully. Once resumed, all other queries will complete as expected unless the specified inactivity threshold is reached, and the service will pause again automatically. If this deployment option is of interest, it is possible to orchestrate a dummy query as an early part of the ETL process so that the service is running when needed.

Azure Synapse Analytics vs. Azure SQL Database

Now that the fundamentals of each technology option have been outlined, it is important to understand the attributes about your data that may drive you to use a particular Azure SQL engine over another.

The Right Type of Data

The first thing to confirm in this design process is that your data is going to be structured in a tabular format. These two SQL options only support tabular data and therefore should not be used to store non-/semi-structured data such as documents, JSON data, or multimedia files directly, unless stored as text in a tabular column. For JSON data, you could consider Azure Cosmos DB, and non-structured data and multimedia can be stored in the data lake. Of course, there may be scenarios where you need to process JSON from a source system into the data warehouse in which case you can load the JSON into a NVARCHAR (MAX) column and then read it using the OPENJSON table-valued function. If the data you need to store cannot be loaded and queried using a SQL database engine, then neither of these options are for you.

The Size of the Data

When choosing your SQL engine, the size of data plays a key role. If your database is less than 1 TB and not likely to increase beyond that point, then Azure Synapse Analytics is not a good option and you should look to use Azure SQL DB. Conversely, if the database is already 1 TB or bigger and is expected to grow, then Azure Synapse Analytics is firmly back on the table. If your data volumes are between 1 and 4 TB, then the cheaper option sits with Azure SQL Database – here we see a 2 TB database costing roughly £1.3k per month vs. an Azure Synapse Analytics at the same size costing £3.9k. When we scale this up to 100 TB, then there are a number of changes to be aware of. Firstly, only an Azure Hyperscale SQL Database can support a database that large, so your options are limited to using Hyperscale if you want to use an Azure SQL Database. Alternatively, you could swap to using an Azure Synapse Analytics as at 100 TB; you are able to really benefit from the massively parallel nature of the architecture. Full disclosure, the Azure Synapse Analytics instance is still more expensive but importantly will likely perform large-scale analytical queries better than a Hyperscale database due to the distributed nature of the database, especially when the data is correctly spread across distributions ensuring that common joins are heavily optimized given that the Hyperscale database cannot store data in this way. Ultimately, an Azure SQL Database will always be cheaper than Azure Synapse Analytics instance; however, it is also not optimized for analytical loads and does not contain features such as PolyBase, and so at small scales of data, a SQL DB will almost always be a better option. However, as the volumes increase, performance becomes more critical and this is where Azure Synapse Analytics earns its place.

The Frequency of the Data

Given this book is focused around data warehousing, I am discounting the need for traditional OLTP workloads; however, there are very often scenarios where a data warehouse needs to be trickle fed. In these scenarios, the patterns that are often used in Azure Synapse Analytics become inefficient and cumbersome; however, when the opposite is true, and data arrives at massive scale at more regular intervals, then the PolyBase and CTAS pattern make Azure Synapse Analytics a much more efficient processing option. When planning the ingestion process for your warehouse, it is essential to understand the needs of your users and the availability of your data. If you need to have rapidly refreshing dashboards that can be loaded from an event-based source system, micro transactions are needed and therefore an Azure SQL Database is likely a better option. Should you only need to refresh a dashboard once or twice a day with data that arrives with row counts in the billions, Azure Synapse Analytics will be able to ingest and process that data much faster. Should you need to combine approaches, then you could experiment with a SQL DB that processes your micro transactions into batches and loads them in Azure Synapse Analytics or explore the lambda architecture that is detailed later in this book.

The Availability of the Data

Any data warehouse project comes with a bunch of nonfunctional requirements, things that are required to satisfy the brief but don’t necessarily deliver a functional advantage to the solution. Often these requirements include the recovery point objective (RPO), the amount of data lost after an incident, and the recovery time objective (RTO), the time it takes to get a system back up and operational. In Azure Synapse Analytics, regular automatic restore points are taken throughout the day and kept for a default of 7 days; however, you can also manually create restore points after significant events in the warehouse to ensure the maximum granularity of restore options and therefore minimal RPO. Conversely, Azure SQL Database also has very good options for RPO and RTO, and particularly within Hyperscale, giant databases (e.g., 50 TB+) can be restored in under 10 minutes with a 0-minute RPO due to the limitless page servers that simply need to be populated from the snapshots in Azure Storage. In addition to the RPO and RTO requirements, concurrency can heavily affect the availability of your data, and in Azure Synapse Analytics, availability is limited depending on the cDWU setting you have configured, whereas Azure SQL DB has a much higher concurrency given that it is an SMP system. If there are a very large number of concurrent users looking to query the warehouse, then an Azure Synapse Analytics may struggle to cope with this requirement without the use of Azure Analysis Services or another database on top.

The Integration of Data

Both flavors of Azure SQL integrate seamlessly with Azure Data Factory – the cloud integration tool of choice when working in Azure. However, Azure Synapse Analytics can make use of PolyBase providing a seamless layer between the data lake and the data warehouse.

In summary, Azure SQL Database is a cheaper option and potentially more flexible to a number of scenarios; however, there are specific features of Azure Synapse Analytics that make it a candidate for any data warehousing scenario assuming the data volumes are larger than 1 TB. When designing the warehouse, a worthwhile exercise is to write down all the pooled knowledge of the incoming data, incoming queries, ingestion patterns, and others and determine where each one of those attributes would be served better. From there, you can begin to discuss the features that mean the most to you and your organization and ignore those that are not essential. A final point to touch upon, and a pretty fundamental one, is that while the core concepts of each technology remain consistent, the features do change and improve over time, and it is important to keep up with each technology in case a really key feature comes about that changes the way you think about a particular technology.