Explore fundamentals of real-time analytics

5 minute read

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/

Understand batch and stream processing

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/2-batch-stream

Batch processing, in which multiple data records are collected and stored before being processed together in a single operation

Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur

Understand batch processing

Data elements are collected and stored
The whole group is processed together as a batch
Processing can be scheduled or triggered
Good
- Processing large volumes of data at a convenient time
- Scheduled to run when system is least loaded
Bad
- Time delay between ingesting data and getting the results
- Data quality must be high as a single bad record can fail an entire batch

Understand stream processing

Data elements are processed as they arrive
Good
- Time-critical operations that require an instant result

Differences between batch and streaming data

Data scope

Batch; can process all data in the dataset
Stream; only has access to the most recent data

Data size

Batch; large datasets
Stream; small datasets (micro batches)

Performance

Batch; typically a few hours
Stream; immediately

Analysis

Batch; Complex
Stream; Simple

Combine batch and stream processing

Often you use streaming to collect the data then batch to process it later

Explore common elements of stream processing architecture

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/3-explore-common-elements

An event generate some data for example, IoT
The data is captured in a streaming source for processing for example, a database table or a queue
The data is processed
The results are written to an output (or sink) for example, a database table, dashboard or even another queue for downstream processing

Real-time analytics in Azure

Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query, and write the results to an output.

Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.

Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.

Sources for stream processing

Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once.

Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data from Internet-of-things (IoT) devices.

Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can also be used as a source of streaming data.

Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure HDInsight to create a Kafka cluster.

Sinks for stream processing

Azure Event Hubs: Used to queue the processed data for further downstream processing.

Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.

Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.

Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.

Explore Azure Stream Analytics

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/4-stream-analytics

Azure Stream Analytics is a service for complex event processing and analysis of streaming data. Stream Analytics is used to:

Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage blob container.

Process the data by using a query to select, project, and aggregate data values.

Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure event hub, Microsoft Power BI, or others.

Azure Stream Analytics is a great technology choice when you need to continually capture data from a streaming source, filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting.

Explorer Apache Spark on Microsoft Azure

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/6-spark-streaming

Apache Spark is a distributed processing framework for large scale data analytics. You can use Spark on Microsoft Azure in the following services:

Azure Synapse Analytics
Azure Databricks
Azure HDInsight
Used to run code
- Pythin, Scala or Java
Parallel across multiple cluster nodes
Processes very large volumes efficiently
Support batch and stream processing

Spark Structured Streaming

API for ingesting, processing, and outputting results from perpetual streams of data
Built on a dataframe
- encapsulates a table of data
You read real-time data such as kafka hub, into a “boundless” dataframe
- means it is continuously populated with new data from the stream
Define a query that selects, projects, or aggregates the data
The results of the query generate another dataframe which can be processed further

Great choice for real-time analytics

Delta Lake

Open-source
Storage layer
Supports
- transactional consistency
- schema enforcement
Unifies storage for batch and streaming data
Supports Spark

Azure Data Explorer

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/8-data-explorer

Azure Data Explorer is a standalone Azure service for efficiently analyzing data

Kusto Query Language (KQL)

Query explorer data tables
Specifically optimized for fast read performance
Particularly telemetry data with time stamp attribute

Explorer Azure Synapse Data Explorer

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/9-exercise-data-explorer

Feedback

Was this page helpful?

Glad to hear it!

Sorry to hear that.

Last modified January 27, 2025: Delete cloud-adoption-framework.md (1a91b0a)