Chapter 7. Mission Criticality
When it comes to streaming integration – or simply any type of real-time data processing – it’s fairly straightforward to build the integrations we’ve been talking about in a lab scenario. You can try it, piece it together with bits of open source, test it, and generally get things working as a viable proof of concept (POC).
But, when it comes time to put these things into production, that’s another matter altogether. When building continuous data pipelines that need to run around the clock and support the applications that the business depends upon, then many questions arise:
How do I make sure that it works 24/7?
How do I ensure that it can handle failure?
How can I be sure that it can scale to the levels I need it to, given the expected growth of data within my organization?
How do I make sure that the data remains secure as it moves around the enterprise at high speeds?
This is all about mission criticality. And it can’t be an afterthought. You need to design your streaming integration infrastructure with mission criticality in mind. This is one of the biggest issues to consider when investigating streaming integration platforms.
We’ve talked about the benefits of building a distributed platform for scaling streaming integration. Clustering1 is one of the most popular methods for building a distributed environment. But there are several ways to build a scalable and reliable cluster. Your choice of stream-processing platform becomes critical at this point. When doing your due diligence, make sure that the platform has the following characteristics:
Requires little upfront configuration
Is self-managing and self-governing
Allows users to add and remove cluster nodes without interrupting processing flow
The old-school way of scaling was to have a standalone node that could be scaled up by adding hardware. You would simply add more CPUs and more memory to that standalone machine. Of course, for reliability reasons, you would have replicated the application and its data to a backup node and switched over to it in case of failure. That’s how most enterprises structured their databases.
Early incarnations of stream processing used the same paradigm. Because distributed architecture had not yet reached mainstream, the pioneers of stream processing relied on scaling on top of a single machine. But there are limitations to this approach. Among other things, memory management in a Java-based environment causes performance to suffer if you take the single-processor route.
However, with the new breed of stream processing, we use a modern method, one made popular by the Hadoop approach to big data. This paradigm involves building large, scalable clusters of nodes over which the stream integration and processing can be distributed.
Because of this, it’s important that the stream integration platform you choose supports clustering. It should also be able to handle scalability, failover, and recovery by distributing the load or failing over the load to another node in the cluster if one or more machines go down.
Your streaming integration platform must also make all of this easy to work with. Some platforms are very configuration heavy, especially in the Hadoop world, and you must do an immense amount of configuration upfront to determine which nodes in a cluster should take on which tasks and what purpose each node serves. That’s why we recommend choosing a platform that has a “low-touch” way of building out a cluster.
The low-touch approach basically allows the software to determine what’s best. Effectively, you start up a certain number of nodes, those nodes talk among themselves, and they automatically determine which nodes services should run on, including management services, monitoring services, and all the required other services to keep stream integration and processing humming along nicely. Additionally, a low-touch stream integration platform allows you to add or remove nodes without interrupting processing.
Scalability and Performance
Scalability is obviously an essential aspect of building a successful stream integration system. As we discussed in the previous section, you need to do this very differently from the traditional ways that enterprises have approached scalability. There’s only so much you can scale on a single node.
So scaling out rather than scaling up is best (Figure 7-1). But as we explained, your stream integration platform should be able to add additional nodes to the cluster as necessary, as the load increases. Additionally, the cluster should be self-aware, aware of the resources it’s using on every node, and be able to distribute the load and scaling out as necessary to maintain performance.
Keep in mind that different parts of the architecture can require different scale-out capabilities. It might be possible to read incoming data at a certain rate. For example, one part might be reading changes that are happening within a database at a rate of 100,000 changes per second. But, downstream from that, some processing might be necessary, and maybe that processing is quite complex, and the fastest that processing task can run is 25,000 changes per second. In such a case, you have a problem.
If a distributed clustered streaming integration platform is designed well, it should be possible to partition the events that are being processed over the available nodes. For example, if there are four nodes in the cluster, the streaming integration platform should be able to partition the processing over those nodes, with each one processing 25,000 changes per second. Overall, that downstream processing is occurring at 100,000 per second, keeping up with your reader.
Not only should you be able to parallelize events and have different pieces of that streaming architecture running in different volumes around the cluster, you should also be able to dictate which pieces scale. For example, you might not be able to scale the reader, because it will work only if it has one connection into your source. Perhaps you’re reading from a database change log, or you’re reading from a messaging queue that can have only one reader to guarantee the order of events. In those cases, you don’t want to scale the reader, but you might want to scale the processing. Make sure that the streaming platform you choose allows you to do this.
To give another example, in our experience, the biggest bottleneck to a distributed clustered architecture like this is the writing into the various targets (Chapter 6). You might want to have a single reader, maybe some parallelism across the cluster for the processing, but you might want even more parallelism in the writing side, for the data targets. In such cases, it’s crucial that you can scale not just the cluster, but the individual streaming components that are running within the cluster.
This brings us to an important point to make about scalability: it is irrevocably tied to performance. The example we just gave – the need to be able to scale individual streaming components within a cluster – is critical for performance.
Another important scalability-related performance concept is that you should attempt to do everything on a single node, as much as is possible. This includes reading from a source, all the processing, and the writing to a target. A single node should handle it all.
This is because every time data is moved between nodes, either on a single machine or on separate physical machines, data is moving over the network. This involves serialization and deserialization of the objects that you’re working with, which inevitably adds overhead.
In environments that require high levels of security, that data might also be encrypted as it moves over the network. In such cases, you need to worry about network bandwidth and also the additional CPU usage required to serialize and deserialize, and encrypt and decrypt, that data.
From a performance perspective, when designing data flows, you want to try and keep those hops between physical instances of the cluster as low as possible. Ideally, you would read once, and if you need to distribute it out over a cluster for processing, you would do that once, but then it should stay on the same nodes as it passes down the data pipeline. Similarly, for performance reasons, you don’t want to allow input/output (I/O) functions such as writing to disk between individual processing steps.
Enterprises struggle with streaming integration architectures because of these challenges. As a result, they end up with unnecessary overhead on their distributed environments. For example, if utilizing a persistent messaging system between every step in the data pipeline, it is physically writing over the network, serializing the data, maybe encrypting it, writing it to disk, then reading from disk, and then deserializing it and possibly having to decrypt it, as well. That’s a long pipeline.
That is going to add overhead, and in high-performance, scalable, low-latency systems, you want to minimize overhead. So as much as possible, try to avoid I/O, and stick to in-memory processing of data only.
Constant serialization and deserialization of data can also be a problem. Ideally, your streaming platform should make this easy to do. For example, if you’re choosing to work with SQL for ease-of-use reasons, you still want to make sure that at runtime it is as fast as if you’d actually written custom code yourself. So, deploy a stream integration platform that gives you highly optimized code even if generated by SQL.
Another point related to scalability and performance: as we mentioned before, when working with distributed caches, it’s important to make sure that if you’re joining event data with data that you’ve loaded into memory in a distributed cache, for either reference or context purposes, you still can bring the event to the physical machine that actually possesses the data for that cache. That’s because the cache itself can be distributed.
Some of the data will be present on only some of the nodes in the cluster. You want to make sure that each event lands on a node in the cluster that physically has the data that it’s going to join with. Otherwise, a remote lookup will be necessary, invoking serialization and deserialization and significantly slowing performance.
In summary, when evaluating stream processing platforms, be cautious. A system might work well when testing in a lab with low throughput, when scalability isn’t a factor. Before going into production, you will need to perform scalability tests to ensure that your distributed architecture will work – and stay low-latency – when processing high-volume loads.
Failover and Reliability
The next three topics we cover – failover, recovery, and exactly-once processing – are about reliability: how to ensure that you can recover data, events, and processing results in case of a system failure, large or small.
First, failover. The stream-processing platform you choose needs to recognize that failures can and will happen and must have ways of dealing with those failures.
There are many different types of failure. There are physical machine crashes, and there are software crashes. There are problems reading from data sources. There are difficulties writing into targets that are outside of the platform. There may be issues with the network, causing extended outages. The streaming integration platform needs to be able to be aware that these things happen. Your platform needs to be able to handle all of them effectively and with ease.
In the case of the failure of a single node in the cluster that’s doing stream processing, you should be able to failover data flow to another node, or another set of nodes, and have it pick up where it left off, without losing anything.
This is where the cluster comes in with regard to reliability. For scalability, as you recall, the cluster is important because of parallelism. For reliability, it’s important that you have redundancy so that you can easily have an application automatically failover to other nodes if the node that the application is running on goes down for some reason.
In scalability, we talked about adding nodes to scale. You also want to add nodes so that the cluster can handle nodes disappearing, and still maintain the application.
Now, operating in a cluster doesn’t necessarily mean that the application will continue running uninterrupted. If all of the pieces of an application are running on a single node that goes down, by definition, it’s going to be interrupted. The point of failover, however, it should be able to pick up from where it left off, and catch up, using the same recovery techniques that we discuss in the next section.
From an external perspective, there may be a slight bump in the processing rate, but after a period of time, things should go back to normal without losing any data.
Recovery is about coming back from a failure with operations and data intact.
You might not necessarily care whether you lose some data in a failure. Completely recovering from a failure might not be a significant issue in such cases. Perhaps you’re monitoring a real-time IoT device. Even if you miss some data while things are failing over, the results of the analysis aren’t going to be affected. You just might have a short period of time in which you don’t have insight into the device you are monitoring.
But in most cases – especially when the application is mission-critical and the results have to accurately represent all the incoming data – full recovery becomes very important.
There are quite a few recovery strategies employed by streaming integration platforms. For example, a platform can replicate states, so that for every operation that occurs on one node, the state required for that tool occurs on another node.
The issue is that it takes bandwidth and resources to do that replication. If you’re doing it for every event that’s coming in, that process is going to consume a lot of additional resources, which could considerably slow down the data flow. After all, replicating requires network transformation in the form of all the serialization and deserialization we’ve discussed. That will severely slow things down.
An alternate way of handling recovery is to enable checkpointing. Checkpointing means that you are periodically recording where you are in terms of the processing. What was the last event received? What was the last event output from the system before it went down?
Checkpointing takes into account the different types of constructs that can exist between a source and a target. That includes queries, windows, pattern matching, event buffers, and anything else that happens to the data. For example, you might be joining messages together, you might be discarding messages, or you might be aggregating certain pieces of data before you output it into a target. So, it’s not necessarily a case of simply remembering the last event that came into the system.
There is a caveat: using checkpointing for recovery does require rewindable data sources. If you are recording states using checkpointing, perhaps while utilizing data windows, and you have a minute’s worth of data in a window to recover, you need to rebuild that minute’s worth of data, and then pick up outputting data after you get to the point where you’re able to start processing again.
Effectively, a low “watermark” in checkpointing is a pointer into your data source that represents the first event from which you need to rebuild the state: the point of failover. A high watermark is the first new event that needs to be processed.
This means that if an entire cluster fails, you would need to be able to go back in time and ask your data sources, “please give me the last minute’s worth of data again.” And although it’s possible to go back in time and reread data with certain types of data sources such as change data or file sources, with other types of data sources it simply won’t work. Certain messaging systems, and especially IoT data, for example, are not rewindable. First, you might not be able to contact them in that way. Second, they might not record that amount of information themselves. Small IoT devices, for example, physically might not be able to replay the data.
That’s where the persistent data streams that we talked about earlier in the book become important. They record all of the incoming data at the source so that when you are in a recovery situation, you can rewind into that data stream, pick up from the low watermark, and then continue processing when you get to the high watermark. When using persistent data streams, using checkpointing for recovery has a much lower impact on processing because you need only periodically record checkpoints.
Of course, even the mechanism where you replicate a state across a cluster could fail if the entire cluster disappeared. Some of the checkpointing could have issues in the place where you’re recording where the checkpoint has disappeared.
One of the strategies to deal with that is to record checkpoints alongside a target. For example, say you are writing to a database as a target, and the data delivered are the results of stream processing. If you could write the checkpoint information also into that database, you could always pick up from where you left off. That’s because the data is present with the data target that you’re writing to. The information about how the target got into this state is also present within the target. That’s true for databases. It can also be true for messaging systems and some other types of targets, as well.
Understanding recovery brings us to the different types of reliability guarantees that you can have in a streaming integration system. For the purposes of this book, there are at-least-once processing and exactly-once processing.
At-least-once processing is where you guarantee that every incoming message will be read and delivered to a target. But it doesn’t guarantee that it will be delivered to a target only once.
Exactly-once processing is what it sounds like. It means that every incoming event is processed, but you write the results of that processing only once to the target. Once exactly. You do not write it again. You never have duplicates.
Why does this matter? Some recovery scenarios, trying to be extra cautious not to miss anything, might write data or events or other functions into a target where they have already been written. That works satisfactorily in some circumstances, either because the target can deal with it, or downstream processing from the target can handle it.
But there are other cases for which you do need exactly-once processing. This isn’t possible all the time, but theoretically it is possible for situations in which you can write checkpointing information into the target. It’s also possible where you can ask the target, “what was the last event that was given to you?” Then, based on that, the platform can determine the next data it should be given.
By validating the data that’s been written to a target, by writing checkpoints to a target, or by including metadata in the data that you write to a target, you can ensure exactly-once processing.
The final criteria around mission criticality is security (Figure 7-2). The three major issues about security when it comes to stream processing are as follows:
Ensuring that you have policies around who can do what to the data
Encrypting data in flight
Encrypting data at rest
In this section, we’ll discuss these three issues.
The first aspect of security centers around who has access. That means the importance of being able to stipulate which individuals are authorized to work with data streams, who can start and stop applications, and who can view data but do nothing else.
For this type of security, you need to make sure that the stream-processing platform you’re using has end-to-end, policy-based security. You need to be able to define different sets of users, and which kinds of permissions each set possesses.
With leading streaming-integration platforms that also offer analytics, you should also be able to specify which users can build data flows. Perhaps some users are restricted to only setting up initial data sources. Other users can’t access those data sources but can work with preprocessed streams. Still other users can build dashboards and analytics. And finally, those who are often called “end” users, who are only authorized to view the end result of the analytics. They can simply work with the final dashboards, and that’s it.
In summary, you can have developers, and you can have administrators and monitors of data. And then you can have end users who are looking at analytics results. Each has access to different aspects of the data.
Another part of security involves being able to identify intermediate raw data streams that might contain PII or other sensitive data. For example, a data flow might exist in which you obfuscate, or mask, some of that data. Because the data is going to be used for analytics, it’s not necessary to have the real data there. In fact, having the raw data displayed could be a security issue.
In such scenarios, the streaming platform’s authorization mechanism and policies kick in. You could make sure that people only were authorized to access the intermediate data streams, the ones that contained the obfuscated sensitive data.
These kinds of security policies, security restrictions, and user access rights need to be able to be applied across all of the components in a homogeneous way.
Encrypting Data in Flight
The other side of security revolves around data protection, which is ensuring that data is encrypted and thus not visible to unauthorized users. Because streaming platforms typically aren’t persisting data – unless you’re using a persistent data stream – and they’re not writing to intermediate data stores, the most important place that data can be encrypted is within the data streams themselves, when they’re moving from node to node.
They’re going across a network, so it’s essential that a stream-processing platform permits you to encrypt the data as it’s flowing across a network, in an industry-standard way, based on what your company policies are.
But also, if you are utilizing persistent data streams that do write data to disk, those persistent streams should also be encrypted, as well, almost by default, because you are writing it to disk.
Thus, encrypting data in stream, as it’s moving, or if it’s being stored in a persistent data stream, is essential.
Encrypting Data at Rest
When writing data to targets, your streaming platform should have the ability to encrypt data based on the capabilities of the targets. It should be able to provide the correct configuration and credentials to enable whatever encryption is available with that target. However, for the security-conscious organization, encrypting data with its own keys and algorithms, and managing keys security should also be possible through the streaming-integration platform.
Additionally, typically sources and targets often contain important information that must remain secret. A database might possess a password that is needed to connect to it. Or a cloud service might have an access key to allow connection. Your stream processing platform must protect that information by encrypting it in an industry-standard way so that no one can hack in and acquire credentials that give them access to sensitive systems.
These represent the three most critical considerations about security and streaming platforms. The ability to address security within your streaming integration solution is key to supporting mission-critical applications.
1 Clustering refers here to a set of nodes that are ready to run one or more streaming pipeline. There are no jobs in streaming, because they run continually. After the pipelines are deployed to the cluster they run until stopped.