Explore fundamentals of modern data warehousing

4 minute read

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/

infrastructure and processes used to support large-scale data analytics.
support business intelligence (BI)
which typically involves copying data from transactional data stores into a relational database with a schema that’s optimized for querying and building multidimensional models
large volumes of data in multiple formats are batch loaded or captured in real-time streams
stored in a data lake from which distributed processing engines like Apache Spark are used to process the data at scale.

Describe modern data warehousing

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/2-describe-warehousing

Data ingestion and processing
- data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse.
- The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis.
- In ETL processes, the data is transformed before being loaded into an analytical store
- while in an ELT process the data is copied to the store and then transformed.
- Either way, the resulting data structure is optimized for analytical queries.
- The data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters.
- Data ingestion includes both batch processing of static data and real-time processing of streaming data.
Analytical data store
- data stores for large scale analytics include relational data warehouses, file-system based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data lakehouses or lake databases).
Analytical data model
- while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce
- reports
- dashboards
- interactive visualizations.
Often these data models are described as cubes, in which numeric data values are aggregated across one or more dimensions (for example, to determine total sales by product and region).
The model encapsulates the relationships between data values and dimensional entities to support “drill-up/drill-down” analysis.
Data visualization – data analysts consume data from analytical models and directly from analytical stores to create - reports - dashboards - other visualizations
- users in an organization who may not be technology professionals might perform self-service data analysis and reporting
- The visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization, and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually.

Explore data ingestion pipelines

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/3-data-ingestion-pipelines

explore how data is ingested into an analytical data store from one or more sources.

large scale data ingestion is best implemented by creating pipeline
these orchestrate the etl processes
azure data factory creates and runs pipelines
azure synapse analytics can use the same engine meaning you consolidate the warehouse solution into one workspace
pipeline consist of one or more activites that operate data
input dataset provides the source data
activities are defined that incrementally manipulate the data until an output dataset is produced
pipeline use linked services to load and process data
- azure blob to ingest data
- azure sql database to run a stored procedure to look up data values
- azure data bricks or azure HDInsight to process data
- azure function to apply custom logic
- azure synapse analytics to save the output dataset

Explore analytical data stores

https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/4-analytical-data-stores

Two types of analytical data store

Data warehouse

Relational database
Data stored in a schema
Optimized for analytics rather than transactional workloads
Data is denormalized where numeric data is stored in a fact table
- The process of systematically combining data to get information quickly
Values whereby data can be aggregated are stored in dimension tables
Known as a star schema
Can be extended to a snowflake by adding additional dimension tables at a second layer
GOOD FOR; transactional data organized into structure schema of tables

Data lake

File store
SPark or Hadoop process queries
Schema-on-read approach that defines tabluar schemas on semi-structured data at the point of analysis but not during storage
GOOD FOR; mix of structured, semi-structured, or un-structured that do not need a fixed schema when written to storage

Azure service for analytical stores

Azure Synapse Analytics

End-to-end solution
Large scale data analytics
Azure Synapse Studio

Azure Databricks

Azure implementation of Databricks
Built on Apache Spark
Native SQL capabilities

Azure HDInsight

Supports multiple open-source data analytics cluster types
Good for pure open-source frameworks and Hadoop-based solutions

Feedback

Was this page helpful?

Glad to hear it!

Sorry to hear that.

Last modified January 27, 2025: Delete cloud-adoption-framework.md (1a91b0a)