Core data concepts

6 minute read

https://docs.microsoft.com/en-us/learn/paths/azure-data-fundamentals-explore-core-data-concepts/

Identify data formats

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-data-formats

Data structures into which data is organized are known as entities
Each entity has one or more attributes
Data can be classed as structured, semi-structured, or unstructured

Structured data

Adheres to a fixed schema
All the data has the same fields/properties
Commonly this is tabular or in table format
Row represents each instance of data entity
Column represents attributes of the entity
Data can be stored in multiple tables which can refer to each other known as relational

Semi-structure data

Data has some structure but is not as rigid as structured data
It allows for variation between entity instances
For example some customers may have more than one email address attribute and others none at all
A common format is a JavaScript Object Notation (JSON) file

Unstructured data

Data without any structure such as documents, images, video files

Data stores

There are two broad stores for data

File stores
Databases

File store

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/3-file-storage

Delimited text files

Data stored in flat text file with field delimiters and row terminators
Most common is a CSV file; comma-separated values
- The field delimiter is a comma
- The row terminator is a carriage return
Other popular formats are TSV; tab-separated values
This is a good format for structured data that needs to be accessed by a wide range of systems and in human-readable format
Excel or a text editor is often used to read/change the data
In the example below a comma is used to separate the data

FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com

JSON

Uses a hierarchical document schema to define entities that have multiple attributes
Good for both structures and un-structured data
In the example below each customer has three attributes; firstName, lastName, and contact
The contact attribute contains a collection (multiple) objects that represent one or more contact methods
- Objects are enclosed in braces {..}
- Collections are enclosed in square brackets [..]
- Attributes are represented by name : value pairs and separated by a comma

{
  "customers":
  [
    {
      "firstName": "Joe",
      "lastName": "Jones",
      "contact":
      [
        {
          "type": "home",
          "number": "555 123-1234"
        },
        {
          "type": "email",
          "address": "joe@litware.com"
        }
      ]
    },
    {
      "firstName": "Samir",
      "lastName": "Nadoy",
      "contact":
      [
        {
          "type": "email",
          "address": "samir@northwind.com"
        }
      ]
    }
  ]
}

XML

Popular in the 1990s and 2000s
Superseded by JSON
Uses tags enclosed in angle-brackets <../> to define elements and attributes

<Customers>
  <Customer name="Joe" lastName="Jones">
    <ContactDetails>
      <Contact type="home" number="555 123-1234"/>
      <Contact type="email" address="joe@litware.com"/>
    </ContactDetails>
  </Customer>
  <Customer name="Samir" lastName="Nadoy">
    <ContactDetails>
      <Contact type="email" address="samir@northwind.com"/>
    </ContactDetails>
  </Customer>
</Customers>

BLOB

Essentially alls files are stored in binary as zeroes and ones
But the data is made human readable using the application such as a document or image viewer

Optimized file formats

Human readable formats are convenient but can grow is size over time
Optimized file formats have been developed

Avro is a row-based format developed by Apache. Each record contains a header that describes the structure of the data. The header is stored in JSON. The data is stored in binary. An application uses the information in the JSON file to parse the binary data and extract the required fields. Good format for compressing data.

ORC is optimized row columnar format. Organizes data into columns rather than rows. Developed by HortonWorks. Optimizes read and write operations in Apache Hive (a warehouse solution). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds the statistical information (count, sum, max, min, etc) for each column.

Parquet is another columnar format. Developed by Cloudera and Twitter. It contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A parquet file includes metadata that describes the set of row found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

Databases

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/4-databases

Relational databases

Used to store and query structured data
Data is stored in tables
Each instance of an entity is assigned a primary key
The key is unique to the entity and used as a reference from other tables
For example the primary key of a customer ID can be used as a reference in the sale order table instead of the actual customer ID
- This enables a relational database to be normalized
- Which means the elimination of duplicate data

Non-relational databases

Do not have a relational schema to the data
Often referred to as NoSQL
Four common types
- Key-value
  - each record consists of a unique key and an associated value
- Document
  - key-value databases in JSON format
- Column family
  - stores tabular data in rows and columns but you can divide the columns into groups known as column-families. Each family holds a set of columns that are logically related.
- Graph
  - Stores entities as nodes with links to define the relationship

Transactional data processing

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/5-transactional-data-processing

Records transactions that encapsulate specific events
For example, a banking customer transferring money between accounts or tracking a customer payment for a service
A small, discrete, unit of work
Often high-volume
Typically known as Online Transactional Processing (OLTP)
OLTP relies in a database with read/write operations being optimized
- Create Retrieve Update Delete CRUD
Operations are applied transactionally to ensure the integrity of the data
To accomplish this OLTP uses ACID

Atomicity each transaction is treated as a single unit that either fails or succeeds entirely. If either transaction fails then the other transaction must also fail.

Consistency transactions can only take the data in the database from one valid state to another

Isolation concurrent transactions cannot interfere with each other.

Durability when a transaction has been completed it cannot revert

OLTP systems are typically used to support live applications that process business data which are themseslves known as LOB “Line of business”

Analytical data processing

https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/6-analytical-processing

Analytical data is typically processed as read-only

Data is stored in a data lake
An ETL process copies data from the data lake to a data warehouse

ETL = Extract Transform Load
A data warehouse is optimized for analysis
- Fact tables contain numeric values to analyze (example, sale amounts)
- Dimension tables represent the entities by which you want to measure the facts (example, customer)

Data can be aggregated into an OLAP model or cube

OLAP = Online Analytical Processing
Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions (example, sales revenue might be totaled by date, customer, and product)

The data can be queried to produce reports, visualizations and dashboards

Data lakes are common for analysis where a large volume of file-based data must be collected and analyzed

Data warehouses are used to store relational data that is optimized for read operations

Feedback

Was this page helpful?

Glad to hear it!

Sorry to hear that.

Last modified January 27, 2025: Delete cloud-adoption-framework.md (1a91b0a)