5. The Role of the Data Lake
As the data needs of a business change, the methods to store, organize, and audit data need to change as well. Big data is the buzz word of the day, and big data needs a scalable storage platform. Multimedia files such as images, videos, and audio files need to be co-located and reported against, and so a platform that can accommodate such diverse data types is required. A modern data platform may also need to ingest data at incredibly high speeds, and having a platform that can cope with streaming and scale accordingly is essential. There is such a variety of requirements for data storage with modern businesses that managing and maintaining storage systems specifically for each would be impossible. What is needed is a simple option that implements a “less is more” approach to offer the scalability and diversity required. What is needed is a data lake.
If you think of a Data Mart as a store of bottled water, cleansed and packaged and structured for easy consumption, the Data Lake is a large body of water in a more natural state. The contents of the Data Lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.
The way the data lake overcomes these restrictions is by being a much more generic store for raw data, meaning that users can easily deliver data of any type into the lake while rapidly deriving insight from it because the data does not need to be coerced and bound to the schema of a data mart. No longer will analysts have to wait for months to even begin exploring a dataset, only to discover that the essential data they need has been aggregated away into the ether. Now they can dive straight into the data lake, doing as much cleaning as necessary, and once a proven value has been asserted, a proper process can be built to funnel the data into a warehouse. In practice, the high-value datasets may well go via the data lake and more or less immediately into a data mart; however, with the limitless storage capabilities of a data lake, there is never a reason to throw data away. In fact, these datasets often can hold unprecedented insight that can only be discovered when enough of the data is held in the same place and in its raw, low-level format.
Additionally, users can benefit from unstructured datasets such as images and videos that could never be represented in a traditional data mart. This capability is of particular interest to data science teams looking to extrapolate tags or metadata about images before blending that data with some other dataset such as customer or product. What’s more, in a data lake environment, the data can be nicely co-located so that a semi-structured JSON file can easily be joined to a Parquet file which can then be updated to hold the output of some AI algorithm – the possibilities are truly endless when data storage is not a barrier.
The key point here though is that the data in a data lake is in a raw, untranslated state and cannot easily be read or evaluated using traditional SQL-based methods. Depending on the user and their intention this can be beneficial, often data science teams prefer to do all cleaning and loading from raw to model stage themselves; however, to fuel a data warehouse, a degree of structure is required. In order to use a data lake in conjunction with a data warehouse, we must use the lake as a raw storage area that is used as a landing and staging platform. Crucially, we need a structure for the lake that allows us to properly segment business areas for security or logical reasons. Without this kind of structure, we would find ourselves in charge of a data swamp – a place where data comes to die and insights cannot be discovered.
The Modern Enterprise and Its Data Lake
Any organization will likely have a data lake although they may just not call it that. They may call it SharePoint, or “The Intranet,” or even just the shared network drive. Branded data lake technologies such as Azure Data Lake Gen 1 and Gen 2 are flagship products that specialize in being data lakes; however, these other systems can also compete in some areas. Just because a data lake is not called a data lake doesn’t mean it doesn’t do the same job. Often however, a cloud-based data lake holds a special place between these technologies – it is not quite so user friendly as to be used daily by a nontechnical user, yet it is much easier to access and load data to than an SFTP site. What’s more, the Azure Data Lake technologies make use of AD integrated security and can be closely tied in with existing security configurations.
So, if your organization does not have an Azure data lake, but you do already use some sort of large-scale file repository, do you still need a data lake? The answer is yes. Whereas systems like an Intranet or SharePoint are built to maximize collaboration, the data lake should be a developer lead initiative so that the structure is conducive to warehouse loading and data science research if required. This may mean breaking apart data silos; where data was previously kept together in isolation, files should be relocated so that they can be loaded more efficiently, and because the lake is easily accessed and defined by users with the right permission, development of the lake in this way can be rapid and agile, lest we forget that the lake is also scalable to almost limitless capacities. It requires very little maintenance or up keep as there are no servers that you need to worry about, Microsoft takes care of all of that for you; the only concern to the business is the structure and quality of the data in the lake. This founding feature of cloud data lakes means that there is never a reason to throw data away without a very good reason (GDPR, etc.). Any data stream that is identified in the enterprise should be directed to output data into the lake in some capacity. Even if there is no actual processing or defined purpose for the data, the fact that it is captured means that it can be profiled, analyzed, and built upon when the time is right.
Azure Data Lake Technology
Azure Data Lake Gen 1
Azure Data Lake Gen 2
From a functional perspective, these products are obviously fairly similar; however, there are aspects about them that are different, and these distinguishing features are important to understand. Additionally, Microsoft are fully bought into the concept of the data lake and therefore continually develop their offerings to ensure they are competitive products that lead the market.
Azure Data Lake Gen 1
The initial data lake offering, Azure Data Lake Gen 1 (ADL Gen1) is a well-matured product at this point and has been the go-to data lake technology in Azure for a good number of years. While this has been moved on in the form of Azure Data Lake Gen 2 (ADL Gen2), it is worth a few sentences to explain why this product was beneficial and how the architecture was structured. ADL Gen1 is built using an Apache Hadoop file system (HDFS) and exposes the WebHDFS REST API Layer to calling applications. This means it is easily integrated into other technologies that understand those APIs such as Spark and Hive. A feature of an HDFS type file system is that it can store files of any type and size; there are no restrictions whatsoever. Files can range from bytes to petabytes in size, and ADL Gen1 will have no problems storing, reading, and writing them. In fact, when files are deposited into the Azure Data Lake Gen 1, they are split across a number of storage servers to offer maximum resiliency but also parallel reading capabilities. This splitting of data means that analytical compute resources that run on top of the lake, such as Spark, Hive, and Azure Synapse Analytics, are able to run as efficiently as possible. Lastly, ADL Gen1 implements Active Directory integrated security, so that access to folders and files can be managed through groups to a high degree of granularity.
Azure Blob Storage
Before Azure Data Lake Gen 2 became generally available in February 2019, the only alternative to Azure Data Lake Gen 1 was Azure Storage or Blob Storage as it is commonly known. Azure Storage also uses the HDFS-based file system client and therefore offers optimizations for parallel reads and analytical queries; however, it exposes its own set of Azure Storage APIs rather than the more generic WebHDFS APIs using its own Windows Azure Storage Blob (WASB) driver. One more major difference between the two technologies is the way that files and folders are implemented. In Azure Data Lake Gen 1, folders are true folders in that they are stand-alone objects in the system, and this is known as a hierarchical file system. In Blob Storage, the files are stored as objects in a container which is a flat namespace. The concept of folders does not really exist; however, virtual directories can be implemented using part of the object name. Despite this, all the tools to work with Blob Storage use the name “folder” to describe levels in the system; however, if you create an empty “folder” and navigate away from it, you will notice that the folder does not appear to exist and this is because there is no object that has that folder as part of its name. Therefore, the “folder” does not exist either. This can be confusing at first although in practice this is rarely an issue. A further benefit that Blob Storage has over the Azure Data Lake Gen 1 is the concept of redundancy. In ADL Gen1, data is locally redundant, meaning copies are stored within the same Azure region. However, Blob Storage can offer locally redundant, zone redundant, and globally redundant levels of geo-redundancy making the recovery options a bit more flexible.
Azure Data Lake Gen 2
Finally, we have Azure Data Lake Gen 2, which is essentially the marriage of ADL Gen1 and Azure Storage. Mostly the technology is based on Azure Blob Storage so that costs are low and features such as geo-redundancy are implemented by default. There are, of course, a couple of differentiating factors that make this a true data lake technology optimized for big data analytics instead of a generic object storage engine. The first is hierarchical namespaces. This feature allows a directory structure to be realized physically rather than being mimicked as is the case in Azure Blob Storage. As mentioned earlier, the folders do not technically exist in Blob Storage, and so any changes to the directory structure incur the need to iterate each object and perform an update. With the implementation of hierarchical namespaces, a directory update becomes a simple metadata change in the storage engine and data access is simplified greatly, thereby improving query performance. Another addition is that of the ABFS driver, which is a driver that is available in all Apache Hadoop environments such as Azure Databricks and Azure Synapse Analytics and is specifically optimized for big data analytics. Previously, the WASB driver was used to complete the complex mappings between the HDFS semantics and the object store interface used in Azure Blob Storage. However, due to the arrival of hierarchical namespaces, the system semantics are now aligned and therefore the mapping exercise is no longer required making reads much more efficient. The security implementation for Azure Data Lake Gen 2 is very similar to that of Gen 1 now that folders are no longer virtualized. Azure Active Directory is fully integrated and permissions can be set for each file and folder. The permissions themselves can be assigned through the Azure Portal and also using Azure Storage Explorer.
Each of these data lake technologies interacts excellently with Azure Synapse Analytics as the PolyBase engine and can make full use of the distributed storage structure to read data into the instance in parallel. However, Azure SQL DB on the other hand is not a distributed system that has support for HDFS type file storage, meaning that data moving from any of the preceding data lake options into Azure SQL DB must be loaded via an integration engine such as Azure Data Factory.
Planning the Enterprise Data Lake
When you first begin using an operating system (OS), such as Windows, the first thing you probably take for granted is that all the files are neatly packaged into folders for you and the OS provider will usually have already created some empty ones that are preconfigured for things that you use regularly such as downloads, music, images, and others. Despite this, there will always be an area where you are encouraged to put in your own organization structure – this is your documents folder for users of Windows. Here, the file system has been well thought out so that when new data arrives in your system, there is a clear place for it to go and files can easily be located when needed. Imagine firing up your PC to find that every file on your machine was stored in a single folder. You would be completely lost! Additionally, some folders are purposefully locked down to avoid you accidentally deleting something that it critical to your system, again, good planning of the file structure.
Moving out of the realm of a user’s PC and into that of cloud data lakes, the same principles still very much apply. A data lake without a folder structure is a data swamp and is of very little use to anyone, in much the same way as a machine with all its files stored in one place would be. The first step is to define the purpose of your lake and determine which parts of the data processing pipeline will be hosted in the data lake. Commonly, the data lake is used to store data in its rawest form. However, there are tools that can perform complex cleaning and relational logic to data, all within the data lake. This next section will explore when to use a data lake to fulfil various needs that are common in data warehouse scenarios.
Storing Raw Data
The primary usage of the data lake should be storing files in their raw format and so a specific directory should be defined for that purpose. Once data arrives in this directory, it should be immutable (never overwritten or changed) so that you can always roll back to a previous point in time if needed. Additionally, keeping all files in their raw state means that future solutions developed outside of your data warehouse do not have a dependence on your cleaning and transformation logic, thereby reducing the need for regular changes to the ETL processing. It is also best practice to group data by source system, again to ensure that future solutions can easily be developed without interfering with warehouse processing.
Storing Cleaned Data
Storing Transformed Data
Implementing the Enterprise Data Lake
When implementing an enterprise data lake on Azure, it is important to remember that the lake should be for the benefit of every single employee – even those users without requirements currently may well have business-critical data in the future, and the goal is to create a solution that is generic and future proof enough to ensure these scenarios can be implemented with the least amount of developer effort. This can easily be achieved with proper planning of directory structure and security, but as the lake develops over time, this principle is important to keep in mind. Despite this point, we are looking at data warehousing and how a warehouse can be feed from a data lake. Therefore, we will begin to discuss the specifics of this approach, but in practice, these steps should only form part of the lake and not dictate the entirety of its purpose.
Another key attribute of the lake is its relative cost compared to its value. A data lake is a way to store immense amounts of data while paying very little to do so, with the added benefit that the implementation of the HDFS APIs means data is very efficiently read from the lake. For any solutions that have a tight budgetary constraint, a data lake means that the expensive compute resource is only spent deriving actual analytical value and not cleaning and transforming data, a task that can be done by a lower-level technology. In this section, we will discuss cleaning opportunities in the lake and how these activities can be completed without the use of an expensive SQL engine.
Security Configuration in Azure Data Lake
Read: The ability to read a file or list the contents of a folder
Write: The ability to add, delete, and overwrite folders and files
Execute: The ability to iterate through a folder and access the subfolders within it
Applying Security in Azure Data Lake Gen 2
- 1.The key piece of information you need is the AD object id of the group or principal. This can be found by accessing the Azure Active Directory resource via the Azure portal. See Figure 5-8 for reference.
- 2.From here you can locate any of the key principles that you may need to configure security for. Figure 5-9 shows the main areas of interest.
Within each of these areas is the ability to search for a group or principal by name or application id, and the Object id is then easily located either within the search result itself or by clicking the application and locating the object id item. For reference, an object id is a GUID that could resemble the following: 1abc6475-79cd-4292-8203-c6c926b3b679.
- 4.Once you have your Object id for the object you want to configure permissions for, open the Azure Storage Explorer application and locate your Data Lake Gen2 instance from the tree menu on the left-hand side. Click it to open the folder view in the main window and right-click the first folder to see the dialog box shown in Figure 5-10.
In the bottom text box, copy the object id in and click Add. Once the object id is validated, you will see the new object id highlighted and the boxes below unchecked. From here you can check the “Access” boxes to determine what permissions are applied directly to the folder and the “Default” boxes to determine what permissions are applied to new files and folders that are created underneath the selected object.
Implementing a Raw Directory
As mentioned previously, the first area for inbound data should be the Raw directory of the data lake. The route into this area should be simplistic and low maintenance so that there is little to no barrier to entry and data can quickly be consumed and stored securely away in a place where it will not be lost. Immediately within the Raw folder should be top-level source system folders that group related data together. By operating in this way, security can be configured to meet any requirement. If you have a source system that is capable of writing data directly into the lake, then this system can be granted access to write into this single folder without the ability to affect any other source systems. Conversely, if you need to obtain data yourself using Azure Data Factory, then you could allow this Data Factory the ability to write into each of the folders as required. Further to this, with the advent of GDPR, there is the need to understand and process sensitive data separately to nonsensitive data. For reference, sensitive data includes attributes such as race, ethnic origin, politics, religion, genetics, and others that can be linked to specific individuals by either a unique identifier or more natural aspects such as name, email address, and phone number. As such it may be prudent to subcategorize Raw into Sensitive and Non-Sensitive, also ensuring that any processes that are writing into the lake are only able to do so into the correct folders. This could therefore mandate the need for two Data Factories, one that operates with Sensitive and one that operates with Non-Sensitive.
A further key benefit of the Raw directory is the resilience that it offers to the overall solution. By storing data redundantly in the lake, you can ensure that you always have the ability to rehydrate your data warehouse should the need arise. Of course, the larger the data volumes, the more difficult a full hydration may be, but at least with the data stored in Raw, you always have the option.
Data Factory expression to concatenate current datetime values with directory paths
Once this pattern is in place, most tools that operate over a data lake (including Azure Synapse Analytics) can begin reading data at the table root level which in this case would be /raw/Source System 1/. All of the date partitioning and subfolders underneath are completely transparent to the engine, and the data can therefore be treated as a single dataset, regardless of which year, month, or day partition the file is stored under. It does not matter how many files are included within the hierarchy; Azure Synapse Analytics will have access to them all. A key point to understand here is that over-partitioning of data can be a bad thing. This is known as the small file problem and generally arises when files are split up to a point that the overhead to read multiple files exceeds the benefit that is generated through parallelism. Essentially the engine has too many files to read, and because the files are so small, the engine reads them too quickly and then has to go through the overhead of reading the next file. Depending on the scale of your data warehouse, you can achieve different amounts of parallelism when reading data from your data lake. For example, an Azure Synapse Analytics running at 500 cDWU can have a maximum of 40 external readers, meaning that 40, 512 MB chunks of data can be read at once. Be aware that compressed files can bottleneck performance because although there may be less data to retrieve from disk, PolyBase cannot open multiple threads on a compressed file.
An additional consideration when implementing an enterprise data lake that needs to feed Azure Synapse Analytics is that the PolyBase engine cannot push filtering predicates down onto the data lake layer. This means that an external table that is pointed at the table root will have to read the full dataset every single time, and this will gradually degrade performance over time. In order to mitigate this issue, files could be loaded into an Active table location so that only relevant files are exposed to the external table and any nonrelevant files are moved out of this location so that a full history can be accessed when needed, but daily loads are optimized.
Choosing a File Format
Delimited text files: CSV files and alike.
RC files: Record columnar format that generates groups of rows and then processes these into key value pairs.
ORC files: Optimized row columnar format that uses encoding and lightweight indexes.
Parquet: Similar to ORC files, however also lend support for nested attributes and hard data typing. PolyBase, however, cannot read nested Parquet files but can utilize the internal metadata that defines data type information.
With delimited text files, the data types are not enforced, and so the external table should define each column as a NVARCHAR(1000) type so that any value can be read in. However, this also then mandates that an additional processing step is implemented to coerce the untyped values into strongly typed values. To mitigate this, Parquet files could be your default because the files themselves contain metadata describing each column, meaning data does not have to be loaded into an untyped table and then transformed into a table that is strongly typed.
Implementing a Clean Directory
Up until now we have discussed mostly the Raw area of the data lake, and while this is arguably the most critical area to get right, there are other areas to focus on. Depending on how you choose to clean your data, there are some major considerations to evaluate and the route into this area can vary greatly depending on the technology choices of the platform.
Cleaning Within a Database
An example of a Create External Table As Select (CETAS) statement
The preceding code shows how to create the external table from the internal clean.Customer table specifying the location and file type. Additionally, note the use of TOP here. This is used to force all the data into the control node of the Azure Synapse Analytics engine and thereby producing one file instead of 60, one per storage distribution.
Cleaning Within a Data Lake
Spark SQL: A SQL language that abstracts a set of dataframe APIs
Python: The world’s most popular programming language with a whole heap of external libraries to solve every possible scenario
R: Traditionally a statistical language that can perform complex data transformations
Scala: The native language of Spark and the language that Spark SQL, Python, and R compile into
The benefits of this approach are that the data does not have to move as far; Databricks connects to the data lake by impersonating a service principal and then exploits its deep integrations with the HDFS ecosystem. Additionally, a truly immense file that would be difficult or too time consuming to load into SQL can easily be processed by Databricks as its partitions will be exploited and the workload parallelized. Databricks can also rack up a cost; however, be cautious with features such as auto scaling and default sizing as often you can begin to consume compute resources long before you realize how much it is costing and be sure to terminate a cluster when not needed.
Cleaning Within Azure Data Factory
A final option that coincides nicely with the topic of Databricks is that of Azure Data Factory Mapping Data Flows. These are graphical data flows that are created using Azure Data Factory but executed as Scala jobs on a Databricks cluster. They allow developers to drag and drop well-defined activities into a left to right flow and configure properties at each step. Similar to SSIS Data Flows, they can perform row- and column-based transformation operations while also handling aggregations lookups and filtering. At the time of writing, Mapping Data Flows were recently released (May 2019) and are therefore a fairly immature offering at this stage; however, they do provide a low/no code option to working with data not within a SQL environment.
Implementing a Transformed Directory
Once again, certain characteristics of your data may dictate that the role of your data lake extends all the way to implementing business logic using Spark or Hive instead of using a traditional relational engine. In this case you would want to carve out a further area of the data lake likely to be named “Warehouse.” “Warehouse” is the area where clean data is joined and transformed into a shape that resembles facts and dimensions, although the data has not been surrogate and dimension keyed or undergone slowly changing dimension logic. The operative word here is “joined” and joining requires a relational engine because often we are combining two or more source entities to create a single conformed warehouse entity that encompasses the constituent parts. Databricks has a relational capability through the implementation of Spark SQL and PySpark, which allows a developer to write traditional looking SQL that can perform joins without the need to actually move data into a SQL engine. This approach can drastically decrease the overhead to process data as the relational engine is brought directly to the data, rather than the data being brought to it. Additionally, Spark’s performance is founded upon processing data in memory and can perform such tasks at a very large scale due to its distributed architecture.
Extract data from source files or source database.
Load data into a SQL engine or data lake.
Transform data using SQL stored procs or notebooks that are executed by Azure Data Factory.
This approach is most effective when datasets begin to cross the boundary into that of “big data” as the data is now transformed, aggregated, processed, and so on in a proper engine that can have the scale to cope with such a task, rather than inflight between a source and destination, as is the pattern with SQL Server Integration Services. Essentially the compute resource is brought to the data and is transformed in place.
In summary, the role of the enterprise data lake is to support a SQL engine when the data becomes too large or too loosely structured. And in either of these cases, there is very little trade-off between using the data lake against using SQL because of the quality of tools that are available, such as Databricks. Further, the integrations between the lake and Azure Synapse Analytics mean that the two can work cohesively to provide a “best of both” solution. Of course the data lake holds another benefit in that no matter what the data, the lake can handle it. Whether it be multimedia files, frequently arriving log files, or ginormous data files, the lake cannot only store this data but provide a base for rich analytics against these datasets. Of course, multimedia files cannot be read into a relational SQL data warehouse, but by being in an accessible location, their metadata can be used for reporting and analytics if needed.
Another key point is that when a new stream of data is uncovered, it can simply be pointed at the data lake to be stored away safely until a team of developers is ready to do something with it. With no limits on storage, there is never a reason not to store everything. Even if no value is derived specifically from that data immediately, by storing the data in a platform such as a data lake, it is ripe for analytics as soon as the need arrives. While I mention that technologies such as SharePoint and Shared drives can be treated as data lakes in many organizations, the offerings within the Azure platform implement features that ensure the storage platform is not only limitless and accessible but also easily integrated into database engines and existing security structures. This ensures the data lake is flexible to the needs of the business but robust enough to underpin mission critical systems such as a data warehouse.
Example Polyglot Architectures
The following figures and explanations discuss a number of different ways the technology offerings could be blended to produce a solution that covers all bases.
Small/medium data warehouse that ingests moderate amounts of data per day
Has little need for processing data back to lake after RAW
May have a need for regular micro inserts or updates
Large warehouse that ingests massive amounts of data per day
Has need for processing data back to lake after RAW
Ingests data only in large batches – no micro batches
Small/medium warehouse that needs Spark or Hive to clean complex or ginormous datasets
Has ability to accept smaller batches or micro inserts
Needs to serve a broader analytical community of analysts
Large warehouse capable of processing highly complex and ginormous datasets
Has ability to ingest giant datasets from the lake in parallel
Seamless lake integration via PolyBase