6. Data Delivery and Visualization – Streaming Integration

Chapter 6. Data Delivery and Visualization

Chapter 2 discusses collecting real-time data, and the different techniques that you can use to obtain streaming data from different sources. This chapter looks at the opposite of that: now that you have your streaming data, where do you put it?

Generally, when starting to work with streaming data, businesses have a specific problem – or problems – to solve. This means the target, or where the data will be placed, is top of mind from the beginning. Which target you choose and the way that target will solve the specific problems you face will be very different depending on the particular use case.

Certain technologies are better suited for real-time applications. They’re superior at gaining insights from data as it flows in. Others are better suited for long-term storage and then, downstream, to perform large-scale analytics. Still others, instead of delivering data into some ultimate location, make it possible to get immediate insights by performing processing or analytics within the streaming integration platform.

In this chapter, we talk about streaming data targets: databases, files, Hadoop or other big data technologies, and messaging systems, among others. We also investigate the prevalence of cloud services and how moving to the cloud can alter the way you use target technologies. This latter point is very important given that the majority of businesses have already begun their transition to the cloud.

Here are some of the questions we answer about working with streaming data targets:

  • How do you determine the end goal of your streaming integration?

  • Where should you put streaming data?

  • What should you do with it?

The first step when choosing a target technology is to understand your use case. For example, a common goal in a streaming initiative is to migrate (or replicate) an on-premises database to (into) a cloud database. This can be because an application is being moved from on-premises infrastructure to the cloud. Or perhaps some reporting on an on-premises application is being done in the cloud.

In most of these cases, a database is the optimal target – a database that’s similar to the original database and kept synchronized from an on-premises source to a database target. However, if the goal is to do long-term analytics, or perhaps even machine learning, on data that was gathered in a continuous fashion using streaming technologies, a database might not be the most appropriate target.

Another option would be to move the data into storage. This could be done using a Hadoop technology like Hadoop Distributed File System (HDFS), or you could put it into a cloud storage environment like Amazon Simple Storage Service (Amazon S3), Microsoft Azure Data Lake Store, or Google Cloud Storage. Your target could be a cloud data warehouse or even an on-premises data warehouse. Again, it depends on what you want to get out of the streaming data – what your particular use case, or goal, is.

This is why, when looking at streaming integration platforms, an important consideration is their ability to write data continuously into your technology of choice and to be able to single source the data once from wherever you’re pulling it from, whether a database, messaging system, or IoT device.

After you articulate your goal, it’s time to consider the target. We can repurpose data in multiple ways. Some of it can be written to an on-premises messaging system like Kafka. Some can go into cloud storage. And some can go into a cloud data warehouse. The important point is that a streaming integration platform should be able to work with all of the different types of target technologies. Data that’s been collected once can also be written with different processing technologies to multiple different targets.

The techniques to work with these different targets are as varied as they are for data collection (see Chapter 2). Which one to choose also depends on the mode and the nature of the source data as well as its mission criticality, which we discuss more fully in Chapter 7. But it’s important to keep mission criticality in mind from the beginning because you must determine ahead of time, based on your use case, how important the streaming data in question is. Can you afford to lose any of it? What would happen if you had duplicate data? Based on that you can determine the level of reliability that is needed and choose your target given that some targets are better suited to highly reliable scenarios than others.

In this chapter we discuss each of the different potential targets and their respective pros and cons.

Databases and Data Warehouses

When you write into databases from a streaming integration system, you need to consider a number of issues:

  • How do I want to write the data?

  • How do I deal with different tables?

  • How do I map the data from a source schema to a target schema?

  • How do I manage data-type differences from, for example, an Oracle database to a Teradata database?

  • How do I optimize performance?

  • How do I deal with changes?

  • How do I deal with transactionality?

These questions persist across all the variants of databases and data storage that you might need to integrate with as a target, both on-premises and in the cloud. It’s also important to include document stores, MongoDB, and newer scalable databases like Google Spanner and Azure Cosmos DB that are available only in the cloud.

Whenever writing into any of these databases, for example, the first step is to identify how to translate the data. This means taking the event streams from the streaming integration platform and transforming them into what you want them to appear like in the database. This typically requires a lot of configuration. Whatever streaming integration platform you choose must be capable of enabling you to do this.

To configure your data streams correctly, you must consider your use case. For example, Java Database Connectivity (JDBC) is the most common way to connect when a database is the target on a Java-based streaming processing platform. But even with that relatively simple technology, you need to make configuration choices based upon the specific use case.

If you’re collecting real-time IoT data, and the goal is to insert all of the data into a table in a target database, you could write it as inserts into that table. That’s a very simple use case.

However, recall from Chapter 2 that one of the major techniques for collecting data from databases is CDC, or being able to collect database operations, inserts, updates, and deletes as changes. When doing that, it might be necessary for certain use cases to maintain a change – to keep track of whether it was an insert as opposed to an update or a delete – and to keep that information about the data available so you can take appropriate action in the target database.

Take migrating an on-premises database into the cloud. Not only is a snapshot in time of all of the data in that database required, it’s important to recognize that unless you can stop the application related to that database from operating, changes are going to continue to be made to that database. You cannot stop most mission-critical applications. They’re too important to the business.

To move an important application to the cloud, then, requires also moving dynamically changing data in the database to the cloud. This is referred to as either an online migration or phased migration to the cloud. Simply making a copy of whatever is in the database at a given moment doesn’t work.

Here’s why. An initial copy of the data can be captured at a particular point in time, but then any changes that happen to that data from that point until you are ready to start using that new database must be added to the target. Only then can it be tested to ensure that the application is working and subsequently rolled into production.

In this use case, you would effectively utilize a combination of batch-oriented load of data from tables into a target database, but then overlay that with changes that have been – and which continue to be – made to that data. This provides you with the ability to not just copy an on-premises database, but to synchronize that database to a new cloud instance.

The point is, it’s reasonably trivial to write all data in a streaming architecture to a target database as inserts. It’s more complex to honor the changes (e.g., to apply inserts into a target as inserts, updates as updates, and deletes as deletes). It becomes even more complicated to honor transactionality, and transaction boundaries that might have been present on the source.

For example, if five changes happened within a single transaction on a streaming source, it might be necessary to ensure that those five changes – perhaps a couple of inserts, an update, and a delete – make it all the way to the target database and that they are done within the scope of a single transaction. That’s because they need to either all be there or not.

Therefore, for databases, you need to be able to handle very high throughput, respect change, and honor transaction boundaries. When batching operations together, and considering parallelizing operations for performance reasons, you need to do this while keeping transactional integrity and potential relationships within the database tables in mind.

Data warehouses have their own challenges. The biggest difference between writing streaming data into databases and writing it into data warehouses is that the latter typically are not built to receive streaming data one event at a time. They are more likely designed to be bulk loaded with large amounts of data in one go, and that doesn’t always align with the event-by-event nature of streaming integration. Thus, when considering on-premises data warehouses, all of them work better when you can load them in bulk.

The same is true of cloud data warehouses. Amazon Redshift, Google BigQuery, Azure SQL Data Warehouse, and Snowflake all operate better from a loading perspective if you do it in bulk. But even with data warehouses, loading bulk inserts isn’t always the answer. Most of these data warehouses support modifications through different mechanisms that can be quite convoluted, especially when you aren’t loading all the data, but trying to apply changes that might have occurred.

When using streaming integration, it is possible to keep a cloud data warehouse like Snowflake synchronized with an on-premises relational database like Oracle by utilizing the proper technology integration points. You must keep in mind, however, that real-time latencies aren’t usually possible when writing into a data warehouse. Data warehouses are typically not designed to receive data in a real-time fashion, and they certainly are not designed to also create the kind of data structures typically found in a database in real time from source data. Typically, there is a batch-processing step that needs to happen, which means that the best kind of latencies possible with data warehouses, even with streaming integration technologies, are in the order of minutes, not seconds or subseconds.

Files

Files are really nothing more than chunks of data that have been named. It’s very common to write streaming data into file targets, on-premises and in the cloud. In this section, we talk about both. We’re also going to discuss files separately from technologies like HDFS, which is part of the Hadoop ecosystem, for reasons that will be clear in the next section.

When writing to files, it’s important to consider what data is going into them. Is all of the data being stored in one big file? Is it a “rolling” file so that as soon as it reaches a certain size or triggers some other criteria, it rolls over to a new file? That scenario would, of course, require you to implement a numbering or naming system to distinguish the files from one another.

Perhaps the goal is to write to multiple files in parallel based on some criteria in the data. For example, if you’re reading change data from databases, you might want to split it based on metadata such as timestamps or source location and write it into one file per table.

Because of this, the ability of your streaming platform to take streaming data and choose which of potentially multiple buckets it goes into is an essential consideration. Here are some of the questions that arise:

  • How do you split the data?

  • How do you write the files?

  • How do you name the files?

  • Where do they reside (on-premises or in the cloud)?

  • Are they managed by a storage system, or are they directory entries?

  • What is the format of the data?

These are all important considerations, but the last, the format of the data, is the most critical. When writing data to files, what does it look like? Does it look like raw data? Perhaps it is delimited in some way. Maybe it needs to be in JSON, which is structured human-readable text. Or perhaps one of the industry-standard binary formats.

Then there’s the decision of which technology to use to work with the filesystem. Is it Hadoop? Is it cloud storage? All of those considerations are very much based upon the use case and the end goal. For example, are the files going to be used to store things for long-term retrieval? Are they going to be utilized for doing machine learning somewhere down the line? The important point is that the streaming data platform you choose is flexible enough to accommodate what you want to achieve, across a wide variety of use cases.

Storage Technologies

The technology used to store data as files has changed dramatically in the past few years. It has transitioned from simple directory structures on network filesystems, through distributed scalable storage designed for parallel processing of data in the on-premises Hadoop big data world, to almost infinite, elastically scalable and globally distributed storage using cloud storage mechanisms.

Within the Hadoop world, the HDFS is the basic technology. However, you might also want to write into HBase as a key/value store, or write data using Hive – a SQL-based HDFS interface – into a filesystem. You might even be using a filesystem as an intermediary for some other integration that ends up being transparent to the user.

For example, if writing data into the Snowflake database, it might be desirable to land intermediary files on Amazon S3 or Azure Data Lake Store, which can then be picked up immediately by Snowflake and loaded as external tables. That storage system can act as a staging area for loading into HDFS, in which case all the questions we discussed about how often to write and the format of the data still must be answered. This might or might not be abstracted from the user perspective, depending on which streaming integration system is being used.

Typically, with Hadoop-based systems, external tables for HDFS play a large role in where the data is targeted. Kudu, a newer, real-time data warehouse infrastructure built on top of Hadoop does this. Although working with raw data can be important, it might be necessary to abstract it so that you can write directly into a Hadoop-based technology and make it easier to process or analyze.

There are many blurred lines between these various storage technologies, and trade-offs are often required. When uploading to the cloud, the interfaces don’t always provide ways of writing real-time streams but require file uploads, so determining an optimal microbatch strategy that optimizes speed, latency and cost is important.

Messaging Systems

Messaging targets are very important as targets because they continue the good work of stream processing that have already been started by the streaming integration platform.

Just as it is important to be able to read from messaging systems (as is discussed in Chapter 2), it’s important to be able to write into them. Your reasons for using messaging systems as targets will vary considerably by use case. It might be that processed or analyzed data is being delivered for other downstream purposes, or that the messaging system is being utilized as a bridge between different environments or different locations.

Messaging systems can be useful as a way to enable others to work with the same data or as an entry point into a cloud environment. And because they are inherently streaming themselves, messaging systems can often be the jumping-off point to something else while maintaining a streaming architecture. For example, a use case might require taking real-time database information or real-time legacy data and pushing that out into a cloud messaging system like Azure Event Hubs. Then, that real-time change data becomes the backbone for future processing or analytics.

When working with messaging systems, it’s important to understand the major differences in both the APIs used to access them, and the underlying capabilities of the messaging system that determine what can and can’t be done using that messaging system as a target.

In addition to popular on-site messaging systems such as IBM MQ, Java Message Service (JMS)–based systems, and Kafka, it’s also important to consider newer cloud native technologies, and technologies that span on-premises and cloud such as Solace. Cloud native systems include Amazon Kinesis, Google Pub/Sub, and Azure Event Hubs. These, however, require specific APIs to work with them.

JMS offers a common way for Java applications to access any JMS-compliant system. With JMS-based messaging systems, there are both queues and topics that can be written into. Both can have transactional support as well persistent messaging, ensuring that everything can be guaranteed to be written at least once into JMS targets.

However, for cases in which exactly-once processing is required, queues can be challenging. Topics can have multiple subscribers for the same data, so it’s possible, such as in the case of restart after failure, to query a topic to determine what has been written. Queues are typically peer to peer and thus cannot be queried in the same way and need different strategies to ensure exactly-once processing. We discuss this more in Chapter 7.

On the other hand, if you use a Kafka-based messaging system, every reader of a topic is independent, and maintains its own index of what has been written, allowing you to get around this challenge. Although Kafka is very popular, it’s not the only messaging system out there, and streaming integration platforms that work only with Kafka are extremely limiting.

JMS-based messaging systems give you the ability to integrate with a lot of different messaging systems through a single technique. Alternatively, generic protocols like Advanced Message Queuing Protocol (AMQP) allow you to work with many different providers. Which approach you choose depends on your specific use case.

Application Programming Interfaces

It may also be important to deliver real-time data to a target destination through application programming interfaces (APIs). This isn’t a new subject because all of the target technologies we’ve discussed also work through APIs. JMS has an API, Kafka has an API, databases have APIs, and at the lowest level, you have APIs to write to files, as well.

But when we talk about APIs in this section, we’re focusing on business application–level APIs. The working definition of API we’re using is: an interface or communication protocol that dictates how two systems can interact. These can be RESTful APIs or streaming socket–based APIs. But they have the same goal: to deliver real-time streaming data directly into an application.

For example, the Salesforce API would provide the ability to write into Salesforce and update records continually using streaming data. APIs can be used to deliver data continuously to everything from a mission-critical enterprise application, to a project-management support system, to a bug-tracking system, to keep them up to date. The suitability of any particular API for delivery from streaming integration depends a lot on workload and the API mechanism. RESTful APIs which require a request/response per payload might not scale to high-throughput rates, unless they are called with microbatches of data. Streaming APIs such as WebSockets can scale more easily but often require more robust connection management to avoid timeouts and handle failures.

Cloud Technologies

Working with cloud technologies is essential for any streaming integration platform because so many streaming use cases involve moving data continuously in real time from an on-site application to the cloud.

From a streaming integration perspective, cloud technologies aren’t that different to working with their on-premises counterparts, including databases, files, and messaging systems. What is different is how you connect into the cloud.

It’s important to understand that each cloud vendor has its own security mechanisms. These are proprietary security rules that you must follow, and they include proprietary ways to generate the keys needed to connect. Each cloud vendor also has its own secure access control. Therefore, in addition to thinking about the target technology itself – the database, the file system, the messaging system – you need to consider how you use that technology to access whatever cloud vendor(s) you select (Figure 6-1).

Figure 6-1. A streaming platform should support different technologies across cloud vendors

Streaming integration platforms should support all of this. Databases, data warehouses, storage, messaging systems, Hadoop infrastructures, and other target technologies should be able to connect to and across Google Cloud, Amazon AWS, and Microsoft Azure. And, significantly, not only should your streaming platform support all these different technologies across all these different clouds, they must support multiple instances of them across multiple cloud platforms.

Here are some considerations when the cloud is your target for streaming data integration:

  • What cloud platform should you choose (or which one[s] have already been chosen at your organization)?

  • Who manages the cloud?

  • Am I allowed to access it directly, or do I need to use an internally defined proxy?

  • How do I access it?

  • What authentication do I need?

  • Where do I store credentials to get in?

Many quite complicated cloud use cases exist. For example, some digital transformation use cases involve two-way moving of data from on premises up to the cloud and then back down from the cloud to on premises. Some involve multicloud scenarios with single-source data, some of which is going into one cloud technology hosted by one cloud provider, others going into another cloud technology hosted by another cloud provider.

You even have intercloud use cases. That’s when the streaming integration platform is used to synchronize a database provided by one cloud provider, say AWS, into a database on another cloud provider, like Google. The reason you might do this is to have a totally redundant backup that’s continually updated and hosted by a different cloud provider for peace of mind.

In summary, delivery into cloud by definition involves working both with the target technology you choose, but also working with a particular cloud or clouds. Much depends on the source of the data and the purpose of the integration. Keep in mind, additionally, that all of these different technologies to choose from as targets could also be used as sources from the cloud, as discussed in Chapter 2.

Visualization

When attempting to manage all these data pipelines, streaming integrations, data processing, and data analytics you have put into place, it is natural to want to know: what is happening? You might want to simply do basic monitoring, but more frequently you will have particular SLAs or key performance indicators (KPIs) that you need to attain. You might want to trigger alerts if analytics identifies that certain thresholds have been breached. In these cases, you want some sort of visualization.

Visualization in streaming integration is a little different to what most people are accustomed to in BI applications. This is because the data being visualized is moving in real time. You’re seeing up-to-the-second results of the analytics that you’ve done. You’re seeing what’s happening right now.

So, visualization for real-time data must support continuous and rapidly changing data. Your streaming platform needs to support mechanisms for displaying this data: be capable of producing different charts, graphs, tables, and other options for viewing the data. Perhaps you are tracking multiple KPIs or SLAs. Maybe you want to track alerts or anomalies detected. You might want to look at trend lines and see the direction in which your data is heading.

It might also be important that you can drill down from a high-level view to a more detailed view. This could be based on analytics to get more insight into real-time data. Being able to dynamically filter data is critical when it comes to real-time visualization. And this should be easy to do; your users should have as easy a time working with real-time streaming data through an integration platform as they do working with large amounts of static data through standard business intelligence tools.