Explore fundamentals of real-time analytics
5 minute read
https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/
Understand batch and stream processing
https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/2-batch-stream
Batch processing, in which multiple data records are collected and stored before being processed together in a single operation
Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur
Understand batch processing
- Data elements are collected and stored
- The whole group is processed together as a batch
- Processing can be scheduled or triggered
- Good
- Processing large volumes of data at a convenient time
- Scheduled to run when system is least loaded
- Bad
- Time delay between ingesting data and getting the results
- Data quality must be high as a single bad record can fail an entire batch
Understand stream processing
- Data elements are processed as they arrive
- Good
- Time-critical operations that require an instant result
Differences between batch and streaming data
Data scope
- Batch; can process all data in the dataset
- Stream; only has access to the most recent data
Data size
- Batch; large datasets
- Stream; small datasets (micro batches)
Performance
- Batch; typically a few hours
- Stream; immediately
Analysis
- Batch; Complex
- Stream; Simple
Combine batch and stream processing
Often you use streaming to collect the data then batch to process it later
Explore common elements of stream processing architecture
- An event generate some data for example, IoT
- The data is captured in a streaming source for processing for example, a database table or a queue
- The data is processed
- The results are written to an output (or sink) for example, a database table, dashboard or even another queue for downstream processing
Real-time analytics in Azure
Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query, and write the results to an output.
Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.
Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.
Sources for stream processing
Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once.
Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data from Internet-of-things (IoT) devices.
Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can also be used as a source of streaming data.
Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure HDInsight to create a Kafka cluster.
Sinks for stream processing
Azure Event Hubs: Used to queue the processed data for further downstream processing.
Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.
Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.
Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.
Explore Azure Stream Analytics
Azure Stream Analytics is a service for complex event processing and analysis of streaming data. Stream Analytics is used to:
Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage blob container.
Process the data by using a query to select, project, and aggregate data values.
Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure event hub, Microsoft Power BI, or others.
Azure Stream Analytics is a great technology choice when you need to continually capture data from a streaming source, filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting.
Explorer Apache Spark on Microsoft Azure
Apache Spark is a distributed processing framework for large scale data analytics. You can use Spark on Microsoft Azure in the following services:
Azure Synapse Analytics
Azure Databricks
Azure HDInsight
Used to run code
- Pythin, Scala or Java
Parallel across multiple cluster nodes
Processes very large volumes efficiently
Support batch and stream processing
Spark Structured Streaming
- API for ingesting, processing, and outputting results from perpetual streams of data
- Built on a dataframe
- encapsulates a table of data
- You read real-time data such as kafka hub, into a “boundless” dataframe
- means it is continuously populated with new data from the stream
- Define a query that selects, projects, or aggregates the data
- The results of the query generate another dataframe which can be processed further
Great choice for real-time analytics
Delta Lake
- Open-source
- Storage layer
- Supports
- transactional consistency
- schema enforcement
- Unifies storage for batch and streaming data
- Supports Spark
Azure Data Explorer
Azure Data Explorer is a standalone Azure service for efficiently analyzing data
Kusto Query Language (KQL)
- Query explorer data tables
- Specifically optimized for fast read performance
- Particularly telemetry data with time stamp attribute