Explore fundamentals of modern data warehousing

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/

  • infrastructure and processes used to support large-scale data analytics.
  • support business intelligence (BI)
  • which typically involves copying data from transactional data stores into a relational database with a schema that’s optimized for querying and building multidimensional models
  • large volumes of data in multiple formats are batch loaded or captured in real-time streams
  • stored in a data lake from which distributed processing engines like Apache Spark are used to process the data at scale.

Describe modern data warehousing

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/2-describe-warehousing

  • Data ingestion and processing

    • data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse.
    • The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis.
    • In ETL processes, the data is transformed before being loaded into an analytical store
    • while in an ELT process the data is copied to the store and then transformed.
    • Either way, the resulting data structure is optimized for analytical queries.
    • The data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters.
    • Data ingestion includes both batch processing of static data and real-time processing of streaming data.
  • Analytical data store

    • data stores for large scale analytics include relational data warehouses, file-system based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data lakehouses or lake databases).
  • Analytical data model

    • while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce
    • reports
    • dashboards
    • interactive visualizations.
  • Often these data models are described as cubes, in which numeric data values are aggregated across one or more dimensions (for example, to determine total sales by product and region).

  • The model encapsulates the relationships between data values and dimensional entities to support “drill-up/drill-down” analysis.

  • Data visualization – data analysts consume data from analytical models and directly from analytical stores to create - reports - dashboards - other visualizations

    • users in an organization who may not be technology professionals might perform self-service data analysis and reporting
    • The visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization, and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually.

Explore data ingestion pipelines

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/3-data-ingestion-pipelines

explore how data is ingested into an analytical data store from one or more sources.

  • large scale data ingestion is best implemented by creating pipeline

  • these orchestrate the etl processes

  • azure data factory creates and runs pipelines

  • azure synapse analytics can use the same engine meaning you consolidate the warehouse solution into one workspace

  • pipeline consist of one or more activites that operate data

  • input dataset provides the source data

  • activities are defined that incrementally manipulate the data until an output dataset is produced

  • pipeline use linked services to load and process data

    • azure blob to ingest data
    • azure sql database to run a stored procedure to look up data values
    • azure data bricks or azure HDInsight to process data
    • azure function to apply custom logic
    • azure synapse analytics to save the output dataset

Explore analytical data stores

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/4-analytical-data-stores

Two types of analytical data store

Data warehouse

  • Relational database
  • Data stored in a schema
  • Optimized for analytics rather than transactional workloads
  • Data is denormalized where numeric data is stored in a fact table
    • The process of systematically combining data to get information quickly
  • Values whereby data can be aggregated are stored in dimension tables
  • Known as a star schema
  • Can be extended to a snowflake by adding additional dimension tables at a second layer
  • GOOD FOR; transactional data organized into structure schema of tables

Data lake

  • File store
  • SPark or Hadoop process queries
  • Schema-on-read approach that defines tabluar schemas on semi-structured data at the point of analysis but not during storage
  • GOOD FOR; mix of structured, semi-structured, or un-structured that do not need a fixed schema when written to storage

Azure service for analytical stores

Azure Synapse Analytics

  • End-to-end solution
  • Large scale data analytics
  • Azure Synapse Studio

Azure Databricks

  • Azure implementation of Databricks
  • Built on Apache Spark
  • Native SQL capabilities

Azure HDInsight

  • Supports multiple open-source data analytics cluster types
  • Good for pure open-source frameworks and Hadoop-based solutions

Last modified July 21, 2024: update (e2ae86c)