Explore fundamentals of real-time analytics

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/

Understand batch and stream processing

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/2-batch-stream

Batch processing, in which multiple data records are collected and stored before being processed together in a single operation

Stream processing, in which a source of data is constantly monitored and processed in real time as new data events occur

Understand batch processing

  • Data elements are collected and stored
  • The whole group is processed together as a batch
  • Processing can be scheduled or triggered
  • Good
    • Processing large volumes of data at a convenient time
    • Scheduled to run when system is least loaded
  • Bad
    • Time delay between ingesting data and getting the results
    • Data quality must be high as a single bad record can fail an entire batch

Understand stream processing

  • Data elements are processed as they arrive
  • Good
    • Time-critical operations that require an instant result

Differences between batch and streaming data

Data scope

  • Batch; can process all data in the dataset
  • Stream; only has access to the most recent data

Data size

  • Batch; large datasets
  • Stream; small datasets (micro batches)

Performance

  • Batch; typically a few hours
  • Stream; immediately

Analysis

  • Batch; Complex
  • Stream; Simple

Combine batch and stream processing

Often you use streaming to collect the data then batch to process it later


Explore common elements of stream processing architecture

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/3-explore-common-elements

  1. An event generate some data for example, IoT
  2. The data is captured in a streaming source for processing for example, a database table or a queue
  3. The data is processed
  4. The results are written to an output (or sink) for example, a database table, dashboard or even another queue for downstream processing

Real-time analytics in Azure

Azure Stream Analytics: A platform-as-a-service (PaaS) solution that you can use to define streaming jobs that ingest data from a streaming source, apply a perpetual query, and write the results to an output.

Spark Structured Streaming: An open-source library that enables you to develop complex streaming solutions on Apache Spark based services, including Azure Synapse Analytics, Azure Databricks, and Azure HDInsight.

Azure Data Explorer: A high-performance database and analytics service that is optimized for ingesting and querying batch or streaming data with a time-series element, and which can be used as a standalone Azure service or as an Azure Synapse Data Explorer runtime in an Azure Synapse Analytics workspace.

Sources for stream processing

Azure Event Hubs: A data ingestion service that you can use to manage queues of event data, ensuring that each event is processed in order, exactly once.

Azure IoT Hub: A data ingestion service that is similar to Azure Event Hubs, but which is optimized for managing event data from Internet-of-things (IoT) devices.

Azure Data Lake Store Gen 2: A highly scalable storage service that is often used in batch processing scenarios, but which can also be used as a source of streaming data.

Apache Kafka: An open-source data ingestion solution that is commonly used together with Apache Spark. You can use Azure HDInsight to create a Kafka cluster.

Sinks for stream processing

Azure Event Hubs: Used to queue the processed data for further downstream processing.

Azure Data Lake Store Gen 2 or Azure blob storage: Used to persist the processed results as a file.

Azure SQL Database or Azure Synapse Analytics, or Azure Databricks: Used to persist the processed results in a database table for querying and analysis.

Microsoft Power BI: Used to generate real time data visualizations in reports and dashboards.


Explore Azure Stream Analytics

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/4-stream-analytics

Azure Stream Analytics is a service for complex event processing and analysis of streaming data. Stream Analytics is used to:

Ingest data from an input, such as an Azure event hub, Azure IoT Hub, or Azure Storage blob container.

Process the data by using a query to select, project, and aggregate data values.

Write the results to an output, such as Azure Data Lake Gen 2, Azure SQL Database, Azure Synapse Analytics, Azure Functions, Azure event hub, Microsoft Power BI, or others.

Azure Stream Analytics is a great technology choice when you need to continually capture data from a streaming source, filter or aggregate it, and send the results to a data store or downstream process for analysis and reporting.


Explorer Apache Spark on Microsoft Azure

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/6-spark-streaming

Apache Spark is a distributed processing framework for large scale data analytics. You can use Spark on Microsoft Azure in the following services:

  • Azure Synapse Analytics

  • Azure Databricks

  • Azure HDInsight

  • Used to run code

    • Pythin, Scala or Java
  • Parallel across multiple cluster nodes

  • Processes very large volumes efficiently

  • Support batch and stream processing

Spark Structured Streaming

  • API for ingesting, processing, and outputting results from perpetual streams of data
  • Built on a dataframe
    • encapsulates a table of data
  • You read real-time data such as kafka hub, into a “boundless” dataframe
    • means it is continuously populated with new data from the stream
  • Define a query that selects, projects, or aggregates the data
  • The results of the query generate another dataframe which can be processed further

Great choice for real-time analytics

Delta Lake

  • Open-source
  • Storage layer
  • Supports
    • transactional consistency
    • schema enforcement
  • Unifies storage for batch and streaming data
  • Supports Spark

Azure Data Explorer

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/8-data-explorer

Azure Data Explorer is a standalone Azure service for efficiently analyzing data

Kusto Query Language (KQL)

  • Query explorer data tables
  • Specifically optimized for fast read performance
  • Particularly telemetry data with time stamp attribute

Explorer Azure Synapse Data Explorer

https://docs.microsoft.com/en-us/learn/modules/explore-fundamentals-stream-processing/9-exercise-data-explorer

Last modified July 21, 2024: update (e2ae86c)