Core data concepts
6 minute read
https://docs.microsoft.com/en-us/learn/paths/azure-data-fundamentals-explore-core-data-concepts/
Identify data formats
https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/2-data-formats
- Data structures into which data is organized are known as entities
- Each entity has one or more attributes
- Data can be classed as structured, semi-structured, or unstructured
Structured data
- Adheres to a fixed schema
- All the data has the same fields/properties
- Commonly this is tabular or in table format
- Row represents each instance of data entity
- Column represents attributes of the entity
- Data can be stored in multiple tables which can refer to each other known as relational
Semi-structure data
- Data has some structure but is not as rigid as structured data
- It allows for variation between entity instances
- For example some customers may have more than one email address attribute and others none at all
- A common format is a JavaScript Object Notation (JSON) file
Unstructured data
- Data without any structure such as documents, images, video files
Data stores
There are two broad stores for data
- File stores
- Databases
File store
https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/3-file-storage
Delimited text files
- Data stored in flat text file with field delimiters and row terminators
- Most common is a CSV file; comma-separated values
- The field delimiter is a comma
- The row terminator is a carriage return
- Other popular formats are TSV; tab-separated values
- This is a good format for structured data that needs to be accessed by a wide range of systems and in human-readable format
- Excel or a text editor is often used to read/change the data
- In the example below a comma is used to separate the data
FirstName,LastName,Email
Joe,Jones,joe@litware.com
Samir,Nadoy,samir@northwind.com
JSON
- Uses a hierarchical document schema to define entities that have multiple attributes
- Good for both structures and un-structured data
- In the example below each customer has three attributes; firstName, lastName, and contact
- The contact attribute contains a collection (multiple) objects that represent one or more contact methods
- Objects are enclosed in braces {..}
- Collections are enclosed in square brackets [..]
- Attributes are represented by name : value pairs and separated by a comma
{
"customers":
[
{
"firstName": "Joe",
"lastName": "Jones",
"contact":
[
{
"type": "home",
"number": "555 123-1234"
},
{
"type": "email",
"address": "joe@litware.com"
}
]
},
{
"firstName": "Samir",
"lastName": "Nadoy",
"contact":
[
{
"type": "email",
"address": "samir@northwind.com"
}
]
}
]
}
XML
- Popular in the 1990s and 2000s
- Superseded by JSON
- Uses tags enclosed in angle-brackets <../> to define elements and attributes
<Customers>
<Customer name="Joe" lastName="Jones">
<ContactDetails>
<Contact type="home" number="555 123-1234"/>
<Contact type="email" address="joe@litware.com"/>
</ContactDetails>
</Customer>
<Customer name="Samir" lastName="Nadoy">
<ContactDetails>
<Contact type="email" address="samir@northwind.com"/>
</ContactDetails>
</Customer>
</Customers>
BLOB
- Essentially alls files are stored in binary as zeroes and ones
- But the data is made human readable using the application such as a document or image viewer
Optimized file formats
- Human readable formats are convenient but can grow is size over time
- Optimized file formats have been developed
Avro is a row-based format developed by Apache. Each record contains a header that describes the structure of the data. The header is stored in JSON. The data is stored in binary. An application uses the information in the JSON file to parse the binary data and extract the required fields. Good format for compressing data.
ORC is optimized row columnar format. Organizes data into columns rather than rows. Developed by HortonWorks. Optimizes read and write operations in Apache Hive (a warehouse solution). An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds the statistical information (count, sum, max, min, etc) for each column.
Parquet is another columnar format. Developed by Cloudera and Twitter. It contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A parquet file includes metadata that describes the set of row found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.
Databases
https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/4-databases
Relational databases
- Used to store and query structured data
- Data is stored in tables
- Each instance of an entity is assigned a primary key
- The key is unique to the entity and used as a reference from other tables
- For example the primary key of a customer ID can be used as a reference in the sale order table instead of the actual customer ID
- This enables a relational database to be normalized
- Which means the elimination of duplicate data
Non-relational databases
- Do not have a relational schema to the data
- Often referred to as NoSQL
- Four common types
- Key-value
- each record consists of a unique key and an associated value
- Document
- key-value databases in JSON format
- Column family
- stores tabular data in rows and columns but you can divide the columns into groups known as column-families. Each family holds a set of columns that are logically related.
- Graph
- Stores entities as nodes with links to define the relationship
- Key-value
Transactional data processing
- Records transactions that encapsulate specific events
- For example, a banking customer transferring money between accounts or tracking a customer payment for a service
- A small, discrete, unit of work
- Often high-volume
- Typically known as Online Transactional Processing (OLTP)
- OLTP relies in a database with read/write operations being optimized
- Create Retrieve Update Delete CRUD
- Operations are applied transactionally to ensure the integrity of the data
- To accomplish this OLTP uses ACID
Atomicity each transaction is treated as a single unit that either fails or succeeds entirely. If either transaction fails then the other transaction must also fail.
Consistency transactions can only take the data in the database from one valid state to another
Isolation concurrent transactions cannot interfere with each other.
Durability when a transaction has been completed it cannot revert
OLTP systems are typically used to support live applications that process business data which are themseslves known as LOB “Line of business”
Analytical data processing
https://docs.microsoft.com/en-us/learn/modules/explore-core-data-concepts/6-analytical-processing
- Analytical data is typically processed as read-only
- Data is stored in a data lake
- An ETL process copies data from the data lake to a data warehouse
- ETL = Extract Transform Load
- A data warehouse is optimized for analysis
- Fact tables contain numeric values to analyze (example, sale amounts)
- Dimension tables represent the entities by which you want to measure the facts (example, customer)
- Data can be aggregated into an OLAP model or cube
- OLAP = Online Analytical Processing
- Aggregated numeric values (measures) from fact tables are calculated for intersections of dimensions (example, sales revenue might be totaled by date, customer, and product)
- The data can be queried to produce reports, visualizations and dashboards
Data lakes are common for analysis where a large volume of file-based data must be collected and analyzed
Data warehouses are used to store relational data that is optimized for read operations