Table of Contents
What is Data Warehouse ?
Data Warehouse is a “Subject-Oriented, Integrated, Time-Variant Nonvolatile collection of data in support of decision making”. A data warehouse is a database to contain large amounts of historical data which are derived from transaction data, but it can include data from other sources also.
It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.
Why Data Warehouse is require ?
DWH is a collection and data and it is designed for query and analysis rather than for transaction processing. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources. The fundamental reason for building a data warehouse is to improve the quality of information in the organization. The main goal of data warehouse is to report and present the information in a very user friendly form.
Benefits to have data warehousing:
Some of the potential benefits of putting data into a data warehouse include:
- Improving turnaround time for data access and reporting.
- Standardizing data across the organization so there will be one view of the “truth”.
- Merging data from various source systems to create a more comprehensive information source.
- Lowering costs to create and distribute information and reports.
- Sharing data and allowing others to access and analyze the data.
- Encouraging and improving fact-based decision making.
Data Warehouse Life Cycle
Data warehouses can have many different types of life cycles with independent data marts. The following is an example of a data warehouse life cycle. In the life cycle of this example, four important steps are involved.
Life Cycle of Data Warehouse
- Extraction – As a first step, heterogeneous data from different online transaction processing systems is extracted. This data becomes the data source for the data warehouse.
- Cleansing/transformation – The source data is sent into the populating systems where the data is cleansed, integrated, consolidated, secured and stored in the corporate or central data warehouse.
- Distribution – From the central data warehouse, data is distributed to independent data marts specifically designed for the end user.
- Analysis – From these data marts, data is sent to the end users who access the data stored in the data mart depending upon their requirement.
When should a company consider implementing a data warehouse ?
Data warehouses or a more focused database called a data mart should be considered when a significant number of potential users are requesting access to a large amount of related historical information for analysis and reporting purposes. So-called active or real-time data warehouses can provide advanced decision support capabilities.
What is Data Mart ?
Data Mart is usually sponsored at the department level and developed with a specific issue or subject in mind, a Data Mart is a subset of data warehouse with a focused objective.
A data mart is a selected part of the data warehouse which supports specific decision support application requirements of a company’s department or geographical region.
It usually contains simple replicates of warehouse partitions or data that has been further summarized or derived from base warehouse data.
Instead of running ad hoc queries against a huge data warehouse, data marts allow the efficient execution of predicted queries over a significantly smaller database.
Differentiate between a data warehouse and a data mart
Data Mart is a subset of data warehouse to in terms of design data warehouse and data mart are almost the same.
In general, a Data Warehouse is used on an enterprise level and a Data Marts is used on a business division/department level. A data mart only contains data specific to a particular subject areas.
A data warehouse is for very large databases (VLDBs) and a data mart is for smaller databases.
The difference lies in the scope of the things with which they deal.
A data mart is an implementation of a data warehouse with a small and more tightly restricted scope of data and data warehouse functions.
A data mart serves a single department or part of an organization. In other words, the scope of a data mart is smaller than the data warehouse. It is a data warehouse for a smaller group of end users.
How data is processed in Data Warehouse ?
Data in data warehouse processed in two way:
- OLAP : Online Analytical Processing
- OLTP : On-line Transaction Processing
What is OLAP ?
OLAP is software for manipulating multidimensional data from a variety of sources. The data is often stored in data warehouse. OLAP software helps a user create queries, views, representations and reports.
OLAP tools can provide a “front-end” for a data-driven DSS. On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.
OLAP functionality is characterized by dynamic multi-dimensional analysis of consolidated enterprise data supporting end user analytical and navigational activities.
Difference between OLTP and OLAP
OLAP – Online Analytical processing, mainly required for DSS, data is in denormalized manner and mainly used for non volatile data, highly indexed, improve query response time OLTP – Transactional Processing – DML, highly normalized to reduce deadlock & increase concurrency.
Data Model
List of Data Models are :
Entity, Attribute and Relationship
An entity represents a chunk of information. In relational databases, an entity often maps to a table. An attribute is a component of an entity and helps define the uniqueness of the entity. In relational databases, an attribute maps to a column. The entities are linked together using relationships.
ER model and Dimensional Model
ER Model – Relational Dimensional – Star Schema (central table fact table with numeric data, all others are linked to central table, faster, but denormalised), Snowflake Schema (one fact table, Normalizing the dimension tables, Fact Constellation (Different fact tables and combined from one DataMart to other).
Physical data model
During the physical design process, you convert the data gathered during the logical design phase into a description of the physical database, including tables and constraints.
Logical data model
A logical design is a conceptual and abstract design. We do not deal with the physical implementation details yet; we deal only with defining the types of information that we need. The process of logical design involves arranging data into a series of logical relationships called entities and attributes.
Metadata
Information about domain structure of data warehouse.
Dimensional Data Modeling:
Different types of Dimensional Modeling
Dimensional – Star Schema (central table fact table with numeric data, all others are linked to central table, faster, but denormalised), Snowflake Schema (one fact table, Normalizing the dimension tables, Fact Constellation (Different fact tables and combined from one datamart to other).
Star Schema
A star schema is a set of tables comprised of a single, central fact table surrounded by de-normalized dimensions. Each dimension is represented in a single table. Star schema implement dimensional data structures with de- normalized dimensions.
The data is stored in a central fact table, with one or more tables holding information on each dimension. Dimensions have levels, and all levels are usually shown as columns in each dimension table. The center of the star consists of a large fact table and the points of the star are the dimension tables. Star schema contains demoralized dimension tables and fact table, each primary key value in dimension table associated with foreign key of fact tables.
Snowflake Schema
A snowflake schema is a set of tables comprised of a single, central fact table surrounded by normalized dimension hierarchies. Each dimension level is represented in a table. Snowflake schema implements dimensional data structures with fully normalized dimensions. Snowflake schemas normalized dimension tables to eliminate redundancy. That is, the Dimension data has been grouped into multiple tables instead of one large table. Snowflake schema basically a normalized dimension tables to reduce redundancy in the dimension tables.
Dimensions and Facts
- Dimensional modeling begins by dividing the world into measurements and context. Measurements are usually numeric and taken repeatedly.
- Numeric measurements are facts. Facts are always surrounded by mostly textual context that’s true at the moment the fact is recorded.
- Facts are very specific, well-defined numeric attributes.
- Dimensional modeling divides the world of data into two major types: Measurements and Descriptions of the context surrounding those measurements.
- The measurements, which are typically numeric, are stored in fact tables, and the descriptions of the context, which are typically textual, are stored in the dimension tables.
- A fact table in a pure star schema consists of multiple foreign keys, each paired with a primary key in a dimension, together with the facts containing the measurements.
Every foreign key in the fact table has a match to a unique primary key in the respective dimension (referential integrity). - This allows the dimension table to possess primary keys that aren’t found in the fact table. Therefore, a product dimension table might be paired with a sales fact table in which some of the products are never sold. Dimensional models are full-fledged relational models, where the fact table is in third normal form and the dimension tables are in second normal form.
Slowly Changing Dimensions
Slowly Changing Dimensions(SCD)
Slowly changing dimensions refers to the change in dimensional attributes over time. An example of slowly changing dimension is a product dimension where attributes of a given product change over time, due to change in component or ingredients or packaging details.
Types of SCD
- SCD Type 1
- SCD Type 2
- SCD Type 3
SCD Type 1
Overwriting. No History maintained. The new record replaces the original record, that means Type 1 SCD is overwrite of a dimensional attribute and history is definitely lost. We overwrite when we are correcting an error in the data or when we truly don’t want to save history.
SCD Type 2
Creating another dimension record with time stamps.
A Type 2 SCD creates a new dimension record and requires a generalized or surrogate key for the dimension. We create surrogate keys when a true physical change occurs in a dimension entity at a specific point in time, such as the customer address change or the product packing change. We often add a timestamp and a reason code in the dimension record to precisely describe the change.
The Type 2 SCD records changes of values of dimensional entity attributes over time. The technique requires adding a new row to the dimension each time there’s a change in the value of an attribute (or group of attributes) and assigning a unique surrogate key to the new row.
SCD Type 3
Creating a current value field. The original record is modified to reflect the change.
A Type 3 SCD adds a new field in the dimension record but does not create a new record. We might change the designation of the customer’s sales territory because we redraw the sales territory map, or we arbitrarily change the category of the product from confectionary to candy. In both cases, we augment the original dimension attribute with an “old” attribute so we can switch between these alternate realities.
Different Keys
Surrogate Key :A surrogate key is an artificial or synthetic key that is used as a substitute for a natural key. It is just a unique identifier or number for each row that can be used for the primary key to the table. It is useful because the natural primary key (i.e. Customer Number in Customer table).
Advantages of using Surrogate Keys :
- We can save substantial storage space with integer valued surrogate keys
- Eliminate administrative surprises coming from production
- Potentially adapt to big surprises like a merger or an acquisition
- Have a flexible mechanism for handling slowly changing dimensions
Difference between primary key and unique key constraints
Primary key maintains uniqueness and not null values Where as unique constrains maintain unique values and null values.
Data Mining
Data Mining is the process of automated extraction of predictive information from large databases.
It predicts future trends and finds behavior that the experts may miss as it lies beyond their expectations. Data Mining is part of a larger process called knowledge discovery; specifically, the step in which advanced statistical analysis and modeling techniques are applied to the data to find useful patterns and relationships.
Data mining can be defined as “a decision support process in which we search for patterns of information in data.” This search may be done just by the user, i.e. just by performing queries, in which case it is quite hard and in most of the cases not comprehensive enough to reveal intricate patterns.
Data mining uses sophisticated statistical analysis and modeling techniques to uncover such patterns and relationships hidden in organizational databases – patterns that ordinary methods might miss. Once found, the information needs to be presented in a suitable form, with graphs, reports, etc.
Difference between view and materialized view
Views contains query whenever execute views it has read from base table Where as Mviews loading or replicated takes place only once which gives you better query performance Refresh mviews 1.On commit and 2. On demand (Complete, never, fast, force).
Bitmap index in DWH
A bitmap for each key value replaces a list of rowids. Bitmap index more efficient for data warehousing because low cardinality, low updates, very efficient for where class. Bitmap index used in data warehouse environment to increase query response time, since DWH has low cardinality, low updates, very efficient for where clause. Bitmap join index used to join dimension and fact table instead reading 2 different index.
What is ETL ?
ETL stands for extraction, transformation and loading. ETL is a process that involves the following tasks:
- Extracting :data from source operational or archive systems which are the primary source of data for
the data warehouse. - Transforming :the data – which may involve cleaning, filtering, validating and applying business rule.
- Loading :the data into a data warehouse or any other database or application that houses data.
ETL- Basic
- ETL process is very often referred to as Data Integration process.
- ETL tool is a data Integration platform.
- The main goal of maintaining an ETL process in an organization is to migrate and transform data from
the source database to target database. - The terms closely related to and managed by ETL processes are :
- Data migration
- Data management
- Data cleansing
- Data synchronization
- Data consolidation
Life Cycle of ETL
The typical real-life ETL cycle consists of the following execution steps:
- Cycle initiation
- Build reference data
- Extract (from sources)
- Validate
- Transform (clean, apply business rules, check for data integrity, create aggregates)
- Stage (load into staging tables)
- Audit reports (Are business rules met? Also in case of failure – helps to diagnose/repair).
- Publish (to target tables)
- Archive
- Clean up
Extract
Extract (Extracts information from a source systems & places in a staging table) : Is a process to extract the data from Source data.
Transformation
Transformation is a process to change the attribute value based on business requirement.Source data transformed as per the business requirement or transformation logics.
Types of Transformation
Active Transformation
An active transformation can change the number of rows that pass through it from source to target.
Passive Transformation
A passive transformation does not change the number of rows that pass through it.
Connected Transformation
Connected transformation is connected to other transformations or directly to target table in the mapping.
Unconnected Transformation
An unconnected transformation is not connected to other transformations in the mapping. It is called within another transformation, and returns a value to that transformation.
Aggregator Transformation
Aggregator transformation is an Active and Connected transformation. This transformation is useful to perform calculations such as averages and sums . Aggregate functions such as AVG, FIRST, COUNT, PERCENTILE, MAX, SUM etc.
Expression Transformation
Expression transformation is a Passive and Connected transformation. This can be used to calculate values in a single row before writing to the target. For example, to calculate discount of each product or to concatenate first and last names.
Filter Transformation
Filter transformation is an Active and Connected transformation. This can be used to filter rows in a mapping that do not meet the condition. For example, find out the products that falls between the rate category $100 and $5000.
Joiner Transformation
Joiner Transformation is an Active and Connected transformation. This can be used to join two sources coming from two different locations or from same location. For example, to join a flat file and a relational source
Lookup Transformation
Lookup transformation is Passive and it can be both Connected and UnConnected as well. It is used to look up data in a relational table, view, or synonym. Lookup definition can be imported either from source or from target tables.
Rank Transformation
Rank transformation is an Active and Connected transformation. It is used to select the top or bottom rank of data. For example, to select top 10 Regions where the sales volume was very high.
Router Transformation
Router is an Active and Connected transformation. It is similar to filter transformation. The only difference is, filter transformation drops the data that do not meet the condition whereas router has an option to capture the data that do not meet the condition.
Sequence Generator Transformation
Sequence Generator transformation is a Passive and Connected transformation. It is used to create unique primary key values or cycle through a sequential range of numbers or to replace missing keys.
Sorter Transformation
Sorter transformation is a Connected and an Active transformation. It allows to sort data either in ascending or descending order according to a specified field. Also used to configure for case-sensitive sorting, and specify whether the output rows should be distinct.
Source Qualifier Transformation
Source Qualifier transformation is an Active and Connected transformation. When adding a relational or a flat file source definition to a mapping, it is must to connect it to a Source Qualifier transformation. The Source Qualifier performs the various tasks such as overriding default SQL query, filtering records; join data from two or more tables etc.
Load
Load (Load the data into the DWH) : Is a Process to load the data to the Target Database.
ETL Testing
As per ETL Data Flow Architecture, testing activates takes:
- Landing (Working Source for Testing) to Staging Area.
- Staging Area to Target Area.
Landing Area gets replicated data dump from source. So for testing proper Landing area data are consider as a source data only.
ETL Process is responsible to move data from Landing area to Staging area as per business logic or transformation logics.
Again ETL process is responsible for data load from Staging area to target database.
ETL Testing Vs Data Testing
Data Testing
Data Testing or ETL Testing means data validation at database level to confirm the data movement from one database to another database
As ETL process, extract the data from source database , transform the data based on the business logic or transformation logic and load at target database.
Following data quality check needs to be done as a part of data validation
- Quantitative Testing: Data quantity validation
Data quantity validation between inputs and out puts of ETL process - Qualitative Testing: Data quality validation
- Metadata and Data integrity constrain check
Prerequisite for ETL Testing activities
- Requirement development : Proper mapping document for ETL process, Technical process document , business requirement/ functional requirement document, ETL Architecture document.
Quantitative Testing
Data quantity validation takes place based on total objects extracting by ETL process, transform and load at target.
As per standard ETL dataflow architecture following Quantitative testing is required:
- Total Object count which is considering as a valid source data by ETL process vs. Total object count available at staging area after transformation by ETL process.
- Total object count available at staging area after transformation by ETL process Vs. total object count available at target database after the data load takes place by ETL Process
Qualitative Testing
Data Quality check for all attributes or fields data of tables.
In this level data validation takes place for each attributes, its validates the data quality for different ETL Process.
As per standard ETL dataflow architecture following Data Quality Check is required:
- Source Data before ETL process run and Staging data after ETL transformation should be in synchronization based on requirement.
- Staging Data and target data after ETL data load should be in synchronization based on requirement
Example Data Validation method(Say Oracle database):
Source Data (With All Transformation rules)
MINUS
Target data
Output : Should not return any data mismatches.
Metadata and Data integrity constrain
Following Standard data validations needs to be performed :
- Primary Key / mandatory value check
- Nullity/Data type check
- Data integrity constrains check at entity level ( based on the requirement document if any)
- Data integrity constrains check at attributes level ( based on the requirement document if any)
- Data check for SCD related fields
For more about Data Warehouse
- Wiki : www.en.wikipedia.org/wiki/Data_warehouse
- searchsqlserver : searchsqlserver.techtarget.com/definition/data-warehouse
- tutorialspoint : www.tutorialspoint.com › Dwh › DWH – Concepts
- Guru99 : www.guru99.com/data-warehouse-architecture.html