Explore fundamentals of modern data warehousing
4 minute read
https://docs.microsoft.com/en-us/learn/modules/examine-components-of-modern-data-warehouse/
- infrastructure and processes used to support large-scale data analytics.
- support business intelligence (BI)
- which typically involves copying data from transactional data stores into a relational database with a schema that’s optimized for querying and building multidimensional models
- large volumes of data in multiple formats are batch loaded or captured in real-time streams
- stored in a data lake from which distributed processing engines like Apache Spark are used to process the data at scale.
Describe modern data warehousing
Data ingestion and processing
- data from one or more transactional data stores, files, real-time streams, or other sources is loaded into a data lake or a relational data warehouse.
- The load operation usually involves an extract, transform, and load (ETL) or extract, load, and transform (ELT) process in which the data is cleaned, filtered, and restructured for analysis.
- In ETL processes, the data is transformed before being loaded into an analytical store
- while in an ELT process the data is copied to the store and then transformed.
- Either way, the resulting data structure is optimized for analytical queries.
- The data processing is often performed by distributed systems that can process high volumes of data in parallel using multi-node clusters.
- Data ingestion includes both batch processing of static data and real-time processing of streaming data.
Analytical data store
- data stores for large scale analytics include relational data warehouses, file-system based data lakes, and hybrid architectures that combine features of data warehouses and data lakes (sometimes called data lakehouses or lake databases).
Analytical data model
- while data analysts and data scientists can work with the data directly in the analytical data store, it’s common to create one or more data models that pre-aggregate the data to make it easier to produce
- reports
- dashboards
- interactive visualizations.
Often these data models are described as cubes, in which numeric data values are aggregated across one or more dimensions (for example, to determine total sales by product and region).
The model encapsulates the relationships between data values and dimensional entities to support “drill-up/drill-down” analysis.
Data visualization – data analysts consume data from analytical models and directly from analytical stores to create - reports - dashboards - other visualizations
- users in an organization who may not be technology professionals might perform self-service data analysis and reporting
- The visualizations from the data show trends, comparisons, and key performance indicators (KPIs) for a business or other organization, and can take the form of printed reports, graphs and charts in documents or PowerPoint presentations, web-based dashboards, and interactive environments in which users can explore data visually.
Explore data ingestion pipelines
explore how data is ingested into an analytical data store from one or more sources.
large scale data ingestion is best implemented by creating pipeline
these orchestrate the etl processes
azure data factory creates and runs pipelines
azure synapse analytics can use the same engine meaning you consolidate the warehouse solution into one workspace
pipeline consist of one or more activites that operate data
input dataset provides the source data
activities are defined that incrementally manipulate the data until an output dataset is produced
pipeline use linked services to load and process data
- azure blob to ingest data
- azure sql database to run a stored procedure to look up data values
- azure data bricks or azure HDInsight to process data
- azure function to apply custom logic
- azure synapse analytics to save the output dataset
Explore analytical data stores
Two types of analytical data store
Data warehouse
- Relational database
- Data stored in a schema
- Optimized for analytics rather than transactional workloads
- Data is denormalized where numeric data is stored in a fact table
- The process of systematically combining data to get information quickly
- Values whereby data can be aggregated are stored in dimension tables
- Known as a star schema
- Can be extended to a snowflake by adding additional dimension tables at a second layer
- GOOD FOR; transactional data organized into structure schema of tables
Data lake
- File store
- SPark or Hadoop process queries
- Schema-on-read approach that defines tabluar schemas on semi-structured data at the point of analysis but not during storage
- GOOD FOR; mix of structured, semi-structured, or un-structured that do not need a fixed schema when written to storage
Azure service for analytical stores
Azure Synapse Analytics
- End-to-end solution
- Large scale data analytics
- Azure Synapse Studio
Azure Databricks
- Azure implementation of Databricks
- Built on Apache Spark
- Native SQL capabilities
Azure HDInsight
- Supports multiple open-source data analytics cluster types
- Good for pure open-source frameworks and Hadoop-based solutions