This is a mind-map style depiction of the key capabilities to look for when architecting a data-science platform or choosing off-the-shelf products.
When dealing with Predictive Maintenance of machines and systems, some capabilities are essential in the underlying data platform to effectively carry out maintenance Just-In-Time before the actual failure happens.
- Near real-time data ingestion of millions data points per second
- Ability to apply ML models for predicting machine failure on the ingested data in real-time
- Efficiently handling of Time-Series data for historical and real-time
- API and Microservices based Integration to downstream and upstream systems – so as to allow event driven nature of the use case
- Alerts and Visibility integrated with the downstream systems for end-to-end Automation
Let us look at the most key capabilities more closely.
Imminent failure signals can be detected within a few minutes of appearing, if real-time data is made available to the predictive analytics system. Not taking advantage of available signals as soon as they occur results in inability to take corrective actions within a reasonable time and leads to operational outages.
Also, real-time situational awareness of the overall operational aspects can only be derived from real-time ingestion of data. Real-time visualisation can be only possible with real-time data points available in the system.
Real-time ingestion and processing of telemetry data from sensor might technically become very challenging soon
- Combining and correlating multiple event streams in real-time
- Combining fresh real-time data with large voluminous amount of historical data for trend
- Combining fresh real-time data with with static reference for additional context
Now add time-series to this mix.
Why Time Series Database?
Usually time series data has the following characteristics:
- The data may have records containing 100s or even 1000s of attributes
- Data records are generated in time order, where time-intervals can be either uniform or irregular
- The data is generally immutable, e.g. sensor data points once recorded at a time remains unaltered. New data points are generated for each new time interval.
- The raw data grows quickly over-time in linear fashion – however, the insights needed from the data are based on various time aggregation functions, such as:
- Min/Max/Averages/Moving Averages/Standard Deviations etc. over various time windows
General purpose NoSQL databases such as HBase and traditional RDBMS databases such as MySQL are not well equipped to handle time-series data mainly due to the following reasons:
- High IOPS: Time-series data requires a very high write-speed (IOPS). The usual transactional databases are overwhelmed by millions of records per second. Because those are concerned with consistency,
- Rolling Time Window: Time series prediction algorithms operate on rolling windows, where a window of consecutive observations is used to predict the future samples. This fixed length window moves from the beginning of the data to the end of it. Traditional Databases do not support retrieval of data by Rolling Time Windows. Even in the case of batch operations, when the rolling window straddles two files, data from both are required, that poses challenges in processing the data in distributed and hence in timely manner.
- Data Compression: Time-series data grows quickly and linearly and disk space concerns limit the:
- Granularity of data that can be stored for historical analysis and ML training
- Amount of historical data that can be stored and made available for ML training
The Data Platform for Predictive Maintenance use cases should be equipped with a time-series database that supports compression algorithms built-in, more data can be efficiently made available for computation workloads.
There are various reasons why serverless is important in this use case.
- Decoupling ML models from any Proprietary Platforms and Environment Dependencies; By following the Microservice Principles, ML models should be exposed as a RESTFul API. It allows the ML models to decouple itself from the underlying platform – so that it can be ported easily or even these models can be remotely utilised from other apps.
- Function as unit for ML models; An ML model has two distinct parts in its life-cycles. First, In which the models is trained, tested and developed. Secondly, the model is deployed to evaluate fresh data points. This evaluation phase of the lifecycle of the ML models are suitable for deploying as functions.
In serverless architecture, functions act as the unit of functionality and scale. This is a scalable architecture to deploy ML models as functions. This architecture is applicable during the evaluation phase of the lifecycle of ML models, as stated above. Each instance of a Model can be thought as an independent function that can be versioned, deployed, invoked, updated or even deleted at any time without compromising the rest of the system.
- Event-Driven; Serverless functions are triggered by events. In scenarios such as Sensor Data Analytics from Machinery – the events occur real-time – the ML models should be triggered as the events occur for the best possible results. The ML models should be housed in the serverless container and usually serverless functions can be triggered by REST API, MQTT, File-drop, schedule-based and so on.
- Auto-scalability; No run-time management and administration is required for the ML model functions that are deployed as containers. Everything is taken care of by the underlying container management platform, such as Kubernetes. For example, Kubernetes manages availability, automatic-scalability, monitoring, logging and security aspects of the containers. In this way the ML functions can be scaled and managed easily.
- Support for any language / polyglot architecture
Most common framework or language capable of binding web services, various language APIs or Spark data provider interfaces usually are supported within serverless functions. Go, Python, Java, NodeJS, .NET, and shell scripts are the most common. So, AI and ML frameworks that uses Python packages, R/CRAN and TensorFlow etc. are all possible to be deployed within a serverless environment based on the choice of the developer.