5. The Role of the Data Lake – The Modern Data Warehouse in Azure: Building with Speed and Agility on Microsoft’s Cloud Platform

© Matt How 2020
M. HowThe Modern Data Warehouse in Azurehttps://doi.org/10.1007/978-1-4842-5823-1_5

5. The Role of the Data Lake

Matt How1 
(1)
Alton, UK
 

As the data needs of a business change, the methods to store, organize, and audit data need to change as well. Big data is the buzz word of the day, and big data needs a scalable storage platform. Multimedia files such as images, videos, and audio files need to be co-located and reported against, and so a platform that can accommodate such diverse data types is required. A modern data platform may also need to ingest data at incredibly high speeds, and having a platform that can cope with streaming and scale accordingly is essential. There is such a variety of requirements for data storage with modern businesses that managing and maintaining storage systems specifically for each would be impossible. What is needed is a simple option that implements a “less is more” approach to offer the scalability and diversity required. What is needed is a data lake.

The term data lake was first used in 2010 by founder and former chief technical officer of Pentaho, James Dixon, who was speaking about the inherent restrictions of a regular data mart. These of course are size, time to deliver value, and research/experimentation capabilities.

If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.

—James Dixon 2010 https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

The way the data lake overcomes these restrictions is by being a much more generic store for raw data, meaning that users can easily deliver data of any type into the lake while rapidly deriving insight from it because the data does not need to be coerced and bound to the schema of a data mart. No longer will analysts have to wait for months to even begin exploring a dataset, only to discover that the essential data they need has been aggregated away into the ether. Now they can dive straight into the data lake, doing as much cleaning as necessary, and once a proven value has been asserted, a proper process can be built to funnel the data into a warehouse. In practice, the high-value datasets may well go via the data lake and more or less immediately into a data mart; however, with the limitless storage capabilities of a data lake, there is never a reason to throw data away. In fact, these datasets often can hold unprecedented insight that can only be discovered when enough of the data is held in the same place and in its raw, low-level format.

Additionally, users can benefit from unstructured datasets such as images and videos that could never be represented in a traditional data mart. This capability is of particular interest to data science teams looking to extrapolate tags or metadata about images before blending that data with some other dataset such as customer or product. What’s more, in a data lake environment, the data can be nicely co-located so that a semi-structured JSON file can easily be joined to a Parquet file which can then be updated to hold the output of some AI algorithm – the possibilities are truly endless when data storage is not a barrier.

The key point here though is that the data in a data lake is in a raw, untranslated state and cannot easily be read or evaluated using traditional SQL-based methods. Depending on the user and their intention this can be beneficial, often data science teams prefer to do all cleaning and loading from raw to model stage themselves; however, to fuel a data warehouse, a degree of structure is required. In order to use a data lake in conjunction with a data warehouse, we must use the lake as a raw storage area that is used as a landing and staging platform. Crucially, we need a structure for the lake that allows us to properly segment business areas for security or logical reasons. Without this kind of structure, we would find ourselves in charge of a data swamp – a place where data comes to die and insights cannot be discovered.

The Modern Enterprise and Its Data Lake

Any organization will likely have a data lake although they may just not call it that. They may call it SharePoint, or “The Intranet,” or even just the shared network drive. Branded data lake technologies such as Azure Data Lake Gen 1 and Gen 2 are flagship products that specialize in being data lakes; however, these other systems can also compete in some areas. Just because a data lake is not called a data lake doesn’t mean it doesn’t do the same job. Often however, a cloud-based data lake holds a special place between these technologies – it is not quite so user friendly as to be used daily by a nontechnical user, yet it is much easier to access and load data to than an SFTP site. What’s more, the Azure Data Lake technologies make use of AD integrated security and can be closely tied in with existing security configurations.

So, if your organization does not have an Azure data lake, but you do already use some sort of large-scale file repository, do you still need a data lake? The answer is yes. Whereas systems like an Intranet or SharePoint are built to maximize collaboration, the data lake should be a developer lead initiative so that the structure is conducive to warehouse loading and data science research if required. This may mean breaking apart data silos; where data was previously kept together in isolation, files should be relocated so that they can be loaded more efficiently, and because the lake is easily accessed and defined by users with the right permission, development of the lake in this way can be rapid and agile, lest we forget that the lake is also scalable to almost limitless capacities. It requires very little maintenance or up keep as there are no servers that you need to worry about, Microsoft takes care of all of that for you; the only concern to the business is the structure and quality of the data in the lake. This founding feature of cloud data lakes means that there is never a reason to throw data away without a very good reason (GDPR, etc.). Any data stream that is identified in the enterprise should be directed to output data into the lake in some capacity. Even if there is no actual processing or defined purpose for the data, the fact that it is captured means that it can be profiled, analyzed, and built upon when the time is right.

Azure Data Lake Technology

The Azure platform has three offerings that can be considered candidates for a Data Lake which are
  • Azure Data Lake Gen 1

  • Azure Data Lake Gen 2

  • Azure Storage

From a functional perspective, these products are obviously fairly similar; however, there are aspects about them that are different, and these distinguishing features are important to understand. Additionally, Microsoft are fully bought into the concept of the data lake and therefore continually develop their offerings to ensure they are competitive products that lead the market.

Azure Data Lake Gen 1

The initial data lake offering, Azure Data Lake Gen 1 (ADL Gen1) is a well-matured product at this point and has been the go-to data lake technology in Azure for a good number of years. While this has been moved on in the form of Azure Data Lake Gen 2 (ADL Gen2), it is worth a few sentences to explain why this product was beneficial and how the architecture was structured. ADL Gen1 is built using an Apache Hadoop file system (HDFS) and exposes the WebHDFS REST API Layer to calling applications. This means it is easily integrated into other technologies that understand those APIs such as Spark and Hive. A feature of an HDFS type file system is that it can store files of any type and size; there are no restrictions whatsoever. Files can range from bytes to petabytes in size, and ADL Gen1 will have no problems storing, reading, and writing them. In fact, when files are deposited into the Azure Data Lake Gen 1, they are split across a number of storage servers to offer maximum resiliency but also parallel reading capabilities. This splitting of data means that analytical compute resources that run on top of the lake, such as Spark, Hive, and Azure Synapse Analytics, are able to run as efficiently as possible. Lastly, ADL Gen1 implements Active Directory integrated security, so that access to folders and files can be managed through groups to a high degree of granularity.

Azure Blob Storage

Before Azure Data Lake Gen 2 became generally available in February 2019, the only alternative to Azure Data Lake Gen 1 was Azure Storage or Blob Storage as it is commonly known. Azure Storage also uses the HDFS-based file system client and therefore offers optimizations for parallel reads and analytical queries; however, it exposes its own set of Azure Storage APIs rather than the more generic WebHDFS APIs using its own Windows Azure Storage Blob (WASB) driver. One more major difference between the two technologies is the way that files and folders are implemented. In Azure Data Lake Gen 1, folders are true folders in that they are stand-alone objects in the system, and this is known as a hierarchical file system. In Blob Storage, the files are stored as objects in a container which is a flat namespace. The concept of folders does not really exist; however, virtual directories can be implemented using part of the object name. Despite this, all the tools to work with Blob Storage use the name “folder” to describe levels in the system; however, if you create an empty “folder” and navigate away from it, you will notice that the folder does not appear to exist and this is because there is no object that has that folder as part of its name. Therefore, the “folder” does not exist either. This can be confusing at first although in practice this is rarely an issue. A further benefit that Blob Storage has over the Azure Data Lake Gen 1 is the concept of redundancy. In ADL Gen1, data is locally redundant, meaning copies are stored within the same Azure region. However, Blob Storage can offer locally redundant, zone redundant, and globally redundant levels of geo-redundancy making the recovery options a bit more flexible.

Azure Data Lake Gen 2

Finally, we have Azure Data Lake Gen 2, which is essentially the marriage of ADL Gen1 and Azure Storage. Mostly the technology is based on Azure Blob Storage so that costs are low and features such as geo-redundancy are implemented by default. There are, of course, a couple of differentiating factors that make this a true data lake technology optimized for big data analytics instead of a generic object storage engine. The first is hierarchical namespaces. This feature allows a directory structure to be realized physically rather than being mimicked as is the case in Azure Blob Storage. As mentioned earlier, the folders do not technically exist in Blob Storage, and so any changes to the directory structure incur the need to iterate each object and perform an update. With the implementation of hierarchical namespaces, a directory update becomes a simple metadata change in the storage engine and data access is simplified greatly, thereby improving query performance. Another addition is that of the ABFS driver, which is a driver that is available in all Apache Hadoop environments such as Azure Databricks and Azure Synapse Analytics and is specifically optimized for big data analytics. Previously, the WASB driver was used to complete the complex mappings between the HDFS semantics and the object store interface used in Azure Blob Storage. However, due to the arrival of hierarchical namespaces, the system semantics are now aligned and therefore the mapping exercise is no longer required making reads much more efficient. The security implementation for Azure Data Lake Gen 2 is very similar to that of Gen 1 now that folders are no longer virtualized. Azure Active Directory is fully integrated and permissions can be set for each file and folder. The permissions themselves can be assigned through the Azure Portal and also using Azure Storage Explorer.

Each of these data lake technologies interacts excellently with Azure Synapse Analytics as the PolyBase engine and can make full use of the distributed storage structure to read data into the instance in parallel. However, Azure SQL DB on the other hand is not a distributed system that has support for HDFS type file storage, meaning that data moving from any of the preceding data lake options into Azure SQL DB must be loaded via an integration engine such as Azure Data Factory.

Planning the Enterprise Data Lake

When you first begin using an operating system (OS), such as Windows, the first thing you probably take for granted is that all the files are neatly packaged into folders for you and the OS provider will usually have already created some empty ones that are preconfigured for things that you use regularly such as downloads, music, images, and others. Despite this, there will always be an area where you are encouraged to put in your own organization structure – this is your documents folder for users of Windows. Here, the file system has been well thought out so that when new data arrives in your system, there is a clear place for it to go and files can easily be located when needed. Imagine firing up your PC to find that every file on your machine was stored in a single folder. You would be completely lost! Additionally, some folders are purposefully locked down to avoid you accidentally deleting something that it critical to your system, again, good planning of the file structure.

Moving out of the realm of a user’s PC and into that of cloud data lakes, the same principles still very much apply. A data lake without a folder structure is a data swamp and is of very little use to anyone, in much the same way as a machine with all its files stored in one place would be. The first step is to define the purpose of your lake and determine which parts of the data processing pipeline will be hosted in the data lake. Commonly, the data lake is used to store data in its rawest form. However, there are tools that can perform complex cleaning and relational logic to data, all within the data lake. This next section will explore when to use a data lake to fulfil various needs that are common in data warehouse scenarios.

Storing Raw Data

The primary usage of the data lake should be storing files in their raw format and so a specific directory should be defined for that purpose. Once data arrives in this directory, it should be immutable (never overwritten or changed) so that you can always roll back to a previous point in time if needed. Additionally, keeping all files in their raw state means that future solutions developed outside of your data warehouse do not have a dependence on your cleaning and transformation logic, thereby reducing the need for regular changes to the ETL processing. It is also best practice to group data by source system, again to ensure that future solutions can easily be developed without interfering with warehouse processing.

In terms of security, the data in this directory should be tightly locked down so that files cannot be deleted or overwritten, and only new files can be added. Often, the Azure Data Factory responsible for copying the data into this directory would have write-only access, and a separate Data Factory (or at least linked service) with read-only access would be used to move data out of this directory. Generally, only an administrator would have both read and write access, and this configuration ensures that there is isolation of concerns for each Data Factory that is working with the files. This directory could be called “RAW” and an example structure is shown in Figure 5-1.
Figure 5-1

A folder hierarchy showing the RAW directory with one source system and two datasets

Storing Cleaned Data

Often a SQL engine is used for all processing once RAW data has been ingested because it has all the cleaning capabilities available out of the box. Azure Synapse Analytics and Azure SQL DB could be used to clean and standardize your data from its RAW state into a prepared state, and this is a recommended approach for most integration scenarios; however, there are some exceptions. If, for example, your data is particularly large, it may be much more efficient to leave it in the data lake and use a compute engine, such as Spark or Hive, that can operate on data that is stored in the lake without the need for data movement. Additionally, if your data is a complex semi-structured file or completely unstructured media files, the logic to read and standardize that data may be easier to implement using Spark or Hive. In these scenarios, I recommend cleaning your data within the data lake, and as files are cleaned, sterilized, and perhaps batched or split, they should be stored in a new area that indicates that the aforementioned activity has taken place. It is important to separate this data because it is no longer true to its source and therefore may obscure some detail that is required by another team or process perhaps now or in the future. This could be called “Clean” and is the first step to distilling value. The security here could be more relaxed as there may be analysts wishing to access this cleaned but still relatively untouched data. As with Raw, this space should continue to group data by source system and will very closely resemble the structure of Raw so that the path from Raw to Clean is easily followed. Figure 5-2 shows how both Raw and Clean could be laid out.
Figure 5-2

A folder hierarchy showing the RAW and CLEAN directories with one source system and two datasets

Storing Transformed Data

As with the clean directory, simple tabular data that needs to be transformed in some way should be loaded into a SQL engine. Again, all the capability is built into the engine, and often development teams have pre-agreed patterns or methods for transforming data so that it is ready for ingestion into a data warehouse. However, the same exceptions are still valid because at this point your large or less structured files are cleaned and prepared but not necessarily any smaller or more structured. If they were not tabular to begin with, then they most likely are still not tabular, and therefore it may again be easier to read and transform the data using an HDFS-based engine such as Spark or Hive. The goal of this processing step however is to coerce the data into a tabular format so that it can live in a SQL table as part of your data warehouse. That said, avoid the temptation to transform and load your data into the data warehouse in one step. While this might seem more efficient, having the process split out makes maintenance much simpler and provides a clear checkpoint for data before it arrives in your warehouse. As such, a new directory should be created to store these transformed files separately to your raw and clean files. This new directory will no longer follow the source system-based structure, as nearly all data transformation steps alter the files schema or join rows across datasets, and we should now start to group data by its logical usage. An example would be the processing of customer records from multiple source systems into a single conformed dimension. This file no longer belongs to any single source system and therefore should be grouped under “Customer.” As you can see, the data stored here would closely resemble facts and dimensions and so this directory should be called “Warehouse.” Here you may have both analysts and applications consuming your data, so security needs to be heavily considered here. Figure 5-3 shows how the data lake may support the Warehouse directory.
Figure 5-3

A folder hierarchy showing the RAW, CLEAN, and WAREHOUSE directories

Facilitating Experimentation

If you have users that want to experiment in the lake, by perhaps transforming data in new, unexplored ways, this may also require a separate “Experiments” area so that the data arriving here will not affect the more defined movement of data through the lake. The security here is very much dependent on the scenario; however, you could have user-specific folders where the security is set up as such. Generally, the usage of this area varies although most organizations that use this concept successfully have analysts or developers pull data from Raw into their own defined spaces and build proof or concept reports to whet the appetite of the business. When a report is considered valuable, then the processing created in the Experiments area of the lake can be replicated easily on top of Raw because the data is in the same state. Figure 5-4 shows how user-specific folders can be used to copy Raw data from RAW that can be used for experimentation.
Figure 5-4

An implementation of an EXPERIMENT area where Joe Bloggs can experiment with Raw data without affecting the warehouse processing

Implementing the Enterprise Data Lake

When implementing an enterprise data lake on Azure, it is important to remember that the lake should be for the benefit of every single employee – even those users without requirements currently may well have business-critical data in the future, and the goal is to create a solution that is generic and future proof enough to ensure these scenarios can be implemented with the least amount of developer effort. This can easily be achieved with proper planning of directory structure and security, but as the lake develops over time, this principle is important to keep in mind. Despite this point, we are looking at data warehousing and how a warehouse can be feed from a data lake. Therefore, we will begin to discuss the specifics of this approach, but in practice, these steps should only form part of the lake and not dictate the entirety of its purpose.

Another key attribute of the lake is its relative cost compared to its value. A data lake is a way to store immense amounts of data while paying very little to do so, with the added benefit that the implementation of the HDFS APIs means data is very efficiently read from the lake. For any solutions that have a tight budgetary constraint, a data lake means that the expensive compute resource is only spent deriving actual analytical value and not cleaning and transforming data, a task that can be done by a lower-level technology. In this section, we will discuss cleaning opportunities in the lake and how these activities can be completed without the use of an expensive SQL engine.

Security Configuration in Azure Data Lake

Before explaining the details for each directory in the lake, it is important to discuss the nature in which permissions are applied in Azure Data Lake Gen 1 and Gen 2. The permissions that can be set are either
  • Read: The ability to read a file or list the contents of a folder

  • Write: The ability to add, delete, and overwrite folders and files

  • Execute: The ability to iterate through a folder and access the subfolders within it

Each folder and file are treated as a separate object in the hierarchy and therefore have their own set of permissions. A common “gotcha” for people new to administrating a data lake is that of the EXECUTE configuration. This permission is essential when a process needs to navigate through the data lake, as the READ permission alone does not permit access to any subfolders in a directory structure. Figure 5-5 shows how to correctly configure permissions for file access.
Figure 5-5

The correct permission setup to allow for file access

A key aspect to keep in mind is that permissions do not inherit from their parent folders. Where a principal has access to a folder, new files and folders added to that folder will not, by default, be accessible to that principal. This sounds problematic at first because any new file or folder incurs the need to update permissions; however, if we know that a principal will need access to every new addition into the parent folder, we can create a default permission entry that ensures the security configuration of the parent is applied to every new object for that principal. This concept is illustrated in Figure 5-6.
Figure 5-6

Default permissions are configured on the “Sales” folder to allow for a new file to be accessed by a principal

If there is a chance that folders will be added to the “Sales” folder, then the “Sales” folder would need READ and EXECUTE configured as Default. Figure 5-7 shows how to configure these permissions correctly.
Figure 5-7

A diagram showing the correct configuration for folders that may become parent folders

Applying Security in Azure Data Lake Gen 2

In order to apply security configurations to Azure Data Lake Gen 2, you must have the Storage Explorer application downloaded. Assuming you have this application, follow these steps to configure security for either a service principal or AD group:
  1. 1.
    The key piece of information you need is the AD object id of the group or principal. This can be found by accessing the Azure Active Directory resource via the Azure portal. See Figure 5-8 for reference.
    Figure 5-8

    Highlighting where Azure Active Directory can be accessed

     
  2. 2.
    From here you can locate any of the key principles that you may need to configure security for. Figure 5-9 shows the main areas of interest.
    Figure 5-9

    Emphasizing the key areas to configure security in Azure Active Directory

     
  3. 3.

    Within each of these areas is the ability to search for a group or principal by name or application id, and the Object id is then easily located either within the search result itself or by clicking the application and locating the object id item. For reference, an object id is a GUID that could resemble the following: 1abc6475-79cd-4292-8203-c6c926b3b679.

     
  4. 4.
    Once you have your Object id for the object you want to configure permissions for, open the Azure Storage Explorer application and locate your Data Lake Gen2 instance from the tree menu on the left-hand side. Click it to open the folder view in the main window and right-click the first folder to see the dialog box shown in Figure 5-10.
    Figure 5-10

    An image showing the “Manage Access” dialog in Azure Storage Explorer

     
  5. 5.

    In the bottom text box, copy the object id in and click Add. Once the object id is validated, you will see the new object id highlighted and the boxes below unchecked. From here you can check the “Access” boxes to determine what permissions are applied directly to the folder and the “Default” boxes to determine what permissions are applied to new files and folders that are created underneath the selected object.

     

Implementing a Raw Directory

As mentioned previously, the first area for inbound data should be the Raw directory of the data lake. The route into this area should be simplistic and low maintenance so that there is little to no barrier to entry and data can quickly be consumed and stored securely away in a place where it will not be lost. Immediately within the Raw folder should be top-level source system folders that group related data together. By operating in this way, security can be configured to meet any requirement. If you have a source system that is capable of writing data directly into the lake, then this system can be granted access to write into this single folder without the ability to affect any other source systems. Conversely, if you need to obtain data yourself using Azure Data Factory, then you could allow this Data Factory the ability to write into each of the folders as required. Further to this, with the advent of GDPR, there is the need to understand and process sensitive data separately to nonsensitive data. For reference, sensitive data includes attributes such as race, ethnic origin, politics, religion, genetics, and others that can be linked to specific individuals by either a unique identifier or more natural aspects such as name, email address, and phone number. As such it may be prudent to subcategorize Raw into Sensitive and Non-Sensitive, also ensuring that any processes that are writing into the lake are only able to do so into the correct folders. This could therefore mandate the need for two Data Factories, one that operates with Sensitive and one that operates with Non-Sensitive.

A further key benefit of the Raw directory is the resilience that it offers to the overall solution. By storing data redundantly in the lake, you can ensure that you always have the ability to rehydrate your data warehouse should the need arise. Of course, the larger the data volumes, the more difficult a full hydration may be, but at least with the data stored in Raw, you always have the option.

Partitioning

A common pattern in any file system that is updated daily is to partition the data by a batch id or arrival date. This is so that deltas can be easily derived and lineage accurately tracked. A data lake is no exception to this, and it is encouraged that any writing processes can create a daily folder or batch folder for each of the loads. Were the frequency to be even higher, for example, hourly or minutely, then you could weigh up the pros and cons for partitioning the structure to that level or grouping data by date. If you are receiving data from a source provider, then it should be mandated that they create date folders within the lake, underneath their source system folder. Alternatively if you have to obtain data yourself with Data Factory, then these folders can easily be created using an expression. Listing 5-1 shows some code that could be inserted into the sink dataset directory to create folders for the year, month, and day.
@concat(
    'raw/Source System 1/',
    formatDateTime(utcnow(),'yyyy/MM/dd'),
    '/'
)
Listing 5-1

Data Factory expression to concatenate current datetime values with directory paths

Once this pattern is in place, most tools that operate over a data lake (including Azure Synapse Analytics) can begin reading data at the table root level which in this case would be /raw/Source System 1/. All of the date partitioning and subfolders underneath are completely transparent to the engine, and the data can therefore be treated as a single dataset, regardless of which year, month, or day partition the file is stored under. It does not matter how many files are included within the hierarchy; Azure Synapse Analytics will have access to them all. A key point to understand here is that over-partitioning of data can be a bad thing. This is known as the small file problem and generally arises when files are split up to a point that the overhead to read multiple files exceeds the benefit that is generated through parallelism. Essentially the engine has too many files to read, and because the files are so small, the engine reads them too quickly and then has to go through the overhead of reading the next file. Depending on the scale of your data warehouse, you can achieve different amounts of parallelism when reading data from your data lake. For example, an Azure Synapse Analytics running at 500 cDWU can have a maximum of 40 external readers, meaning that 40, 512 MB chunks of data can be read at once. Be aware that compressed files can bottleneck performance because although there may be less data to retrieve from disk, PolyBase cannot open multiple threads on a compressed file.

An additional consideration when implementing an enterprise data lake that needs to feed Azure Synapse Analytics is that the PolyBase engine cannot push filtering predicates down onto the data lake layer. This means that an external table that is pointed at the table root will have to read the full dataset every single time, and this will gradually degrade performance over time. In order to mitigate this issue, files could be loaded into an Active table location so that only relevant files are exposed to the external table and any nonrelevant files are moved out of this location so that a full history can be accessed when needed, but daily loads are optimized.

Choosing a File Format

A major consideration with any data platform implementation is that of the file formats used throughout the system. In principle, the data lake can house files of any type; however, best practice dictates that a standard file type convention is used so that standards can be maintained. The formats available to PolyBase are
  • Delimited text files: CSV files and alike.

  • RC files: Record columnar format that generates groups of rows and then processes these into key value pairs.

  • ORC files: Optimized row columnar format that uses encoding and lightweight indexes.

  • Parquet: Similar to ORC files, however also lend support for nested attributes and hard data typing. PolyBase, however, cannot read nested Parquet files but can utilize the internal metadata that defines data type information.

With delimited text files, the data types are not enforced, and so the external table should define each column as a NVARCHAR(1000) type so that any value can be read in. However, this also then mandates that an additional processing step is implemented to coerce the untyped values into strongly typed values. To mitigate this, Parquet files could be your default because the files themselves contain metadata describing each column, meaning data does not have to be loaded into an untyped table and then transformed into a table that is strongly typed.

Implementing a Clean Directory

Up until now we have discussed mostly the Raw area of the data lake, and while this is arguably the most critical area to get right, there are other areas to focus on. Depending on how you choose to clean your data, there are some major considerations to evaluate and the route into this area can vary greatly depending on the technology choices of the platform.

Cleaning Within a Database

Both Azure SQL Database and Azure Synapse Analytics are highly capable of applying complex and repeatable cleaning rules to datasets, and therefore if your data is all tabular, then this should be your primary method. The T-SQL language that is native to both Azure Synapse Analytics and Azure SQL DB contains reams of functions designed to help developers achieve these goals. Common functions that are used heavily are TRIM, SUBSTRING, LEFT, RIGHT, UPPER, LOWER, COALESCE, REPLACE, CAST, CONVERT, and CONCATENATE, and often they are used in conjunction with each other. Further benefits of this approach include easier deployment and source control using Visual Studio. While the code to clean the data can be common across both Azure SQL engines, the method to hydrate your database with Raw data to be cleaned would be different. Of course, Azure Synapse Analytics would use PolyBase to obtain the data directly from the data lake, while Azure SQL DB would have to use Azure Data Factory to bulk copy the data. When using Data Factory, you may be tempted to call a cleaning stored procedure from the copy activity itself as per Figure 5-11; however, this is poor practice as this changes the insert from a minimally logged bulk operation into a highly transacted one, and this hits performance. Figure 5-11 shows the configuration to use a stored procedure as part of the Data Factory copy activity.
Figure 5-11

Image of Data Factory using stored procedure called from Copy activity

There is of course a redundancy benefit to writing the cleaned data out into the lake which could also enable subsequent solutions that require clean data to piggyback on this output. Were this to be of interest, then Azure Synapse Analytics can again use PolyBase to do the opposite of the import. By creating an external table from an internal table, you create a new file in the data lake that could be picked up by a subsequent process. An example of this is shown in Listing 5-2.
CREATE EXTERNAL TABLE dbo.CleanCustomer
WITH
(
    LOCATION = '/Clean/Sales System/Customer/',
    DATA_SOURCE = AzureStorage,
    FILE_FORMAT = TextFile
)
AS
SELECT TOP [N]
*
FROM
clean.Customer;
Listing 5-2

An example of a Create External Table As Select (CETAS) statement

The preceding code shows how to create the external table from the internal clean.Customer table specifying the location and file type. Additionally, note the use of TOP here. This is used to force all the data into the control node of the Azure Synapse Analytics engine and thereby producing one file instead of 60, one per storage distribution.

While processing data in Azure SQL engines should be your primary choice, there are also issues with this approach. The first is that Azure Synapse Analytics is expensive. The massively parallel processing (MPP) engine that is the core of Azure Synapse Analytics is designed for blazing fast analytics and should not be thought of as a regular SQL engine in terms of cost or capability. To maximize on your investment in Azure Synapse Analytics, you want to ensure that you are using it to serve users queries across giant datasets rather than consuming concurrency slots to perform menial ELT tasks. To avoid placing these activities on your Azure Synapse Analytics, you could of course utilize the cheaper SQL engine, Azure SQL Database. The drawback here is that data movement pipelines need to be defined and orchestrated to move data into and out of your SQL engine. Figure 5-12 shows how data can be moved between different components of a solution.
Figure 5-12

A diagram showing a polyglot approach to ingest, process, and egress data to and from the data lake

Cleaning Within a Data Lake

A different approach is to leave the data in the lake and not move it into a SQL engine at all. This relies on specialist data lake processing tools which are becoming increasingly popular due to their flexible nature and their ability to really capitalize on the underlying storage engine. Databricks is one such processing tool and is built upon the Apache Spark engine, therefore using clusters to scale out compute jobs and in-memory storage to enhance performance. Working with Databricks to clean data is beyond the scope of this book; however, all required cleaning activities can be easily undertaken using either
  • Spark SQL: A SQL language that abstracts a set of dataframe APIs

  • Python: The world’s most popular programming language with a whole heap of external libraries to solve every possible scenario

  • R: Traditionally a statistical language that can perform complex data transformations

  • Scala: The native language of Spark and the language that Spark SQL, Python, and R compile into

The benefits of this approach are that the data does not have to move as far; Databricks connects to the data lake by impersonating a service principal and then exploits its deep integrations with the HDFS ecosystem. Additionally, a truly immense file that would be difficult or too time consuming to load into SQL can easily be processed by Databricks as its partitions will be exploited and the workload parallelized. Databricks can also rack up a cost; however, be cautious with features such as auto scaling and default sizing as often you can begin to consume compute resources long before you realize how much it is costing and be sure to terminate a cluster when not needed.

Cleaning Within Azure Data Factory

A final option that coincides nicely with the topic of Databricks is that of Azure Data Factory Mapping Data Flows. These are graphical data flows that are created using Azure Data Factory but executed as Scala jobs on a Databricks cluster. They allow developers to drag and drop well-defined activities into a left to right flow and configure properties at each step. Similar to SSIS Data Flows, they can perform row- and column-based transformation operations while also handling aggregations lookups and filtering. At the time of writing, Mapping Data Flows were recently released (May 2019) and are therefore a fairly immature offering at this stage; however, they do provide a low/no code option to working with data not within a SQL environment.

Implementing a Transformed Directory

Once again, certain characteristics of your data may dictate that the role of your data lake extends all the way to implementing business logic using Spark or Hive instead of using a traditional relational engine. In this case you would want to carve out a further area of the data lake likely to be named “Warehouse.” “Warehouse” is the area where clean data is joined and transformed into a shape that resembles facts and dimensions, although the data has not been surrogate and dimension keyed or undergone slowly changing dimension logic. The operative word here is “joined” and joining requires a relational engine because often we are combining two or more source entities to create a single conformed warehouse entity that encompasses the constituent parts. Databricks has a relational capability through the implementation of Spark SQL and PySpark, which allows a developer to write traditional looking SQL that can perform joins without the need to actually move data into a SQL engine. This approach can drastically decrease the overhead to process data as the relational engine is brought directly to the data, rather than the data being brought to it. Additionally, Spark’s performance is founded upon processing data in memory and can perform such tasks at a very large scale due to its distributed architecture.

Regardless of SQL or data lake being chosen for each step, all of the cleaning and warehouse operations can be orchestrated using Azure Data Factory. ADF can call stored procedures in either SQL engines and also invoke Databricks “notebooks,” like repeatable scripts, and this means that wherever the processing of the data is done, the orchestration and control of the processing is handled by ADF. This approach is now known as ELT .
  • Extract data from source files or source database.

  • Load data into a SQL engine or data lake.

  • Transform data using SQL stored procs or notebooks that are executed by Azure Data Factory.

This approach is most effective when datasets begin to cross the boundary into that of “big data” as the data is now transformed, aggregated, processed, and so on in a proper engine that can have the scale to cope with such a task, rather than inflight between a source and destination, as is the pattern with SQL Server Integration Services. Essentially the compute resource is brought to the data and is transformed in place.

In summary, the role of the enterprise data lake is to support a SQL engine when the data becomes too large or too loosely structured. And in either of these cases, there is very little trade-off between using the data lake against using SQL because of the quality of tools that are available, such as Databricks. Further, the integrations between the lake and Azure Synapse Analytics mean that the two can work cohesively to provide a “best of both” solution. Of course the data lake holds another benefit in that no matter what the data, the lake can handle it. Whether it be multimedia files, frequently arriving log files, or ginormous data files, the lake cannot only store this data but provide a base for rich analytics against these datasets. Of course, multimedia files cannot be read into a relational SQL data warehouse, but by being in an accessible location, their metadata can be used for reporting and analytics if needed.

Another key point is that when a new stream of data is uncovered, it can simply be pointed at the data lake to be stored away safely until a team of developers is ready to do something with it. With no limits on storage, there is never a reason not to store everything. Even if no value is derived specifically from that data immediately, by storing the data in a platform such as a data lake, it is ripe for analytics as soon as the need arrives. While I mention that technologies such as SharePoint and Shared drives can be treated as data lakes in many organizations, the offerings within the Azure platform implement features that ensure the storage platform is not only limitless and accessible but also easily integrated into database engines and existing security structures. This ensures the data lake is flexible to the needs of the business but robust enough to underpin mission critical systems such as a data warehouse.

Example Polyglot Architectures

The following figures and explanations discuss a number of different ways the technology offerings could be blended to produce a solution that covers all bases.

Example One

Figure 5-13 contains a diagram displaying a solution with the following characteristics:
  • Small/medium data warehouse that ingests moderate amounts of data per day

  • Has little need for processing data back to lake after RAW

  • May have a need for regular micro inserts or updates

Figure 5-13

A diagram showing a polyglot architecture with a SQL preference

Example Two

The diagram in Figure 5-14 displays a solution with the following characteristics:
  • Large warehouse that ingests massive amounts of data per day

  • Has need for processing data back to lake after RAW

  • Ingests data only in large batches – no micro batches

Figure 5-14

A diagram showing a polyglot architecture with a blend on Azure Synapse Analytics and Azure Data Lake Gen 2

Example Three

Figure 5-15 displays a solution with the following characteristics:
  • Small/medium warehouse that needs Spark or Hive to clean complex or ginormous datasets

  • Has ability to accept smaller batches or micro inserts

  • Needs to serve a broader analytical community of analysts

Figure 5-15

A polyglot architecture that utilizes Databricks to assist with data cleaning and preparation

Example Four

Figure 5-16 displays a solution with the following characteristics:
  • Large warehouse capable of processing highly complex and ginormous datasets

  • Has ability to ingest giant datasets from the lake in parallel

  • Seamless lake integration via PolyBase

Figure 5-16

A polyglot architecture that utilizes mostly lake processing, with a SQL engine layer for presentation