Building AI Models for High-Frequency Streaming Data – Part Two
Many data scientists have implemented machine or deep learning algorithms on static data or in batch, but what considerations must you make when building models for a streaming environment? In this post, we will discuss these considerations.
By Heather Gorr, Ph.D., Senior MATLAB Product Manager, MathWorks
AI continues making headlines in the data science community, and predictive models are front and center in engineering applications such as autonomous driving and equipment monitoring. Introducing AI models into engineering systems can be challenging, however, especially when predictions must be reported in near real-time on data from multiple sensors.
Many data scientists have implemented machine or deep learning algorithms on static data or in batch, but what considerations must you make when building models for a streaming environment? In this post, we will discuss these considerations. It’s Part 2 of a two-part blog series, following the Part 1 topic of data management and strategies on aligning times and resampling data.
Streaming high-frequency data
What is streaming? If streaming movies or music comes to mind, you’ve got the right idea! Data is incoming continuously, but instead of simply watching, actions must be taken based on the information. Therefore, predictions must be made and reported continuously.
So, what does this mean for an AI model? Consider an example of predicting equipment failure using sensors for temperature, pressure, and current. The flow looks something like this:
Figure 1: Diagram of streaming workflow. © 1984–2020 The MathWorks, Inc.
The raw sensor data is passed to a messaging service for initial data management. Then additional data processing and model predictions are performed. The model is updated based on recent data, and results are sent to a dashboard (repeatedly!).
The first step is to plan out the system with the team. It is important to capture requirements and decide on parameters throughout the system before building anything. It is also helpful to build a full streaming prototype as early as possible, then come back to tune algorithms.
In our example, we used Apache Kafka for messaging, which is a distributed streaming platform with APIs for many languages to facilitate reading and writing data to the stream. One of those APIs is a MATLAB interface, which we used here. We can also specify how to manage out-of-order data, buffering, and other parameters ideal for high-frequency data.
One important parameter to consider is the time window. It controls how much data enters the system for prediction and you must decide before approaching data prep or model training. In our example, we chose one second, which is reasonable for the mathematical assumptions and model updates.
Data preparation for machine learning
Part 1 of this series focused on time alignment and synchronization of the sensor data. Now let’s think about representing the data to train a model. First, you need failure data to predict failures. Don’t worry, there’s no need to break your equipment (repeatedly) if you don’t have enough, as failure scenarios can be simulated! In our example, we apply various faults to a physical model using Simulink. We used the generated data from many simulations, along with the experimental data, to train the model.
Figure 2: Physical model of pump simulating failure data using Simulink. © 1984–2020 The MathWorks, Inc
Since only one second of data is passing through the stream, it’s important to represent the most information (and least noise). It’s common to use features from the frequency domain like the FFT and power spectrum, as in our case. We live in the time domain, so the frequency domain might sound uncomfortable. But this just means we’re analyzing the data with respect to frequency instead of time. We won’t get into it here, but you can learn more with examples on signal prep for machine and deep learning and a practical introduction to time-frequency analysis.
AI modeling approach
There are many resources for comparing various algorithms, so let’s focus on how streaming affects the choice of model. In general, models suited to time series and forecasting are used frequently and include:
- Traditional time-series models (curve fitting, ARIMA, GARCH)
- Machine learning models (nonlinear: trees, SVMs, Gaussian processes)
- Deep learning models (multilayer perceptron, CNNs, LSTMs, TCNs)
Any of these could work in our example, but there are several key aspects to first consider for streaming. The training data set includes only one second of data at a time, so the algorithm must be capable of learning in this condition and robust to noise. Also, the model needs to be updated over time as new data enters the system, without retraining historical data. The model predictions and updates must also be fast and easily distributed, which can greatly influence the choice of algorithm. Generally, keep it simple when streaming.
In our example, we prioritized getting the streaming prototype running in production, so we needed to select and train a model quickly. We used the Classification Learner and Deep Network Designer apps in MATLAB to explore models, then exported the most accurate model. We used a classification tree ensemble for predicting faults and regression for estimating the remaining lifetime, both of which are fast and updateable in the stream.
Figure 3: Training multiple models using MATLAB Classification Learner app. © 1984–2020 The MathWorks, Inc.
Once the model is trained and validated, we can start integrating. The steps for data prep, model prediction, and updating the model state are performed in a function. This accepts the window of data and the model as inputs and returns the predictions and updated model as outputs. With this signature, the model can be easily cached in-memory to facilitate rapid updates while avoiding additional network latency. Here, we used an open source data structure for caching and storing state, and included with MATLAB Production Server, which made it easy to integrate and test the model caching within the streaming environment.
Putting it all together
Obviously, planning is crucial for streaming. Capturing requirements for the time window, data types, and other expectations throughout the stream is helpful and important to communicate during the development process. In addition, using standard software practices like source control, documentation, and unit testing will help facilitate development.
It is also important to ease the code handoffs with teammates. For example, as data scientists, we may be sharing our data prep and modeling with a system architect. In our example, we used MATLAB to create a library with our code and model, and the library can be called from many programming languages. This captures dependencies and creates a readme file for the integration steps. We also used the testing environment to run our code via a local host within the live streaming architecture, which is helpful for debugging.
Figure 4: Testing code in streaming environment via local client in MATLAB Compiler SDK. © 1984–2020 The MathWorks, Inc.
Implementing AI models into streaming applications can be challenging. But throughout this post, we discussed considerations for training and implementing models for streaming systems. It is important to consider the requirements from the different parts of the system before approaching data prep and algorithm development. Many common models for time series are appropriate, but the need for the model to be updated over time will influence the choice of algorithm. Caching the model is also helpful to maintain low latency needed in these systems. Tools like MATLAB and Apache Kafka can help integrate the data prep and AI modeling into the streaming architecture for an easier implementation.
To learn more about streaming and deploying AI, visit the resources below see the resources below or email me at firstname.lastname@example.org.
Read More …