Data Warehouse: Basic Concepts (I)



Common requirements of Data Warehouse systems

  • Must provide easy access to the organization's information
  • Show consistency in the display of information
  • Must be adaptive and resilient to change
  • Must safely keep information
  • Must serve as a foundation for improved decision making
  • Must be accepted by the business community in the enterprise


Common components of a Data Warehouse (Kimball)

Four components:
(a) Operational Source System
(b) Data Staging Area
(3) Data Presentation Area
(4) Data Access Tools

"One of the biggest threats to DWH success is confusing the components' roles and functions."
i.e. Don't give access to end users to the Staging Area. Don't build the DWH on a normalized design.




Characteristics of source operational systems

  • Main priorities of source operational system are processing performance and availability
  • Queries against source systems are narrow, one-record-at-a-time queries, and are severy restricted in their demands on the Operating System
  • Source systems are not queried in broad and unexpected ways: queries are tunned, specific, and known before hand.
  • Source systems maintain little historical data



(Q) what are +two key architectural requirements for the staging area?

  • (a) Staging area is off-limits to business users
  • (b) Does not provide query and presentation services
  • (c) Does not (should not) need to have a normalized structure to hold data


(Q) What are some common characteristics of the staging area?

  • Dominated by tasks such as sorting and sequential processing
  • Often not based on relational technology but instead on many flat files
  • Usually most of the tasks of extraction, transformation and loading (ETL) of data from operational systems into the data warehouse is concentrated in the staging area.
  • Tables in the staging area should be segregated from the "live" data warehouse, i.e., they should NOT be available for end users' queries
  • A basic implementation here is to have an identical schema to the one that exists in the source operational system(s) but with some structural changes to the tables, such as range partitioning.
  • Alternatively, in some implementations all data transformation processing is done “on the fly” as data is extracted from the source system before it is inserted directly into the Presentation area.


(Q) What are four common transformations done (usually in the staging area) to the extracted data during the ETL process?

  • (a) Cleansing (correct misspellings, resolve domain conflicts, deal with missing data, reformatting)
  • (b) Combining data from multiple sources
  • (c) Deduplicating data
  • (d) Assigning warehouse keys


(Q) What are some performance considerations when planning the loading stage of the ETL process?

  • The goal in this phase is to load the data into the warehouse in the most expedient manner.
  • Careful consideration should be given to where the data being-loaded resides and how you load it into the database.
  • For example, do not use a serial database link or a single JDBC connection to move large volumes of data.
  • Flat files are the most common and preferred mechanism of large volumes of data.

  • The overall speed of your load will be determined by
    • (A) how quickly the raw data can be read from staging area and
    • (B) how fast it can be processed and inserted into the database.
  • You should stage the raw data across as many physical disks as possible to ensure the reading it is not a bottleneck during the load.


(Q) What tasks are usually part of the loading of the Presentation area?

  • Load each data Mart
  • Index new data for query performance
  • Supply/compute appropriate aggregates (i.e. build or refresh Materialized Views)


(Q) What is a Data Mart? How it relates to the Presentation Area and to the Data Warehouse?

  • A Data Mart is a section of the overall presentation area, which usually consists of a series of integrated data marts.
  • At its simplest form, it presents the data from a single business process
  • Modern Data Marts may be updated. Changes in lables, hierarchies, status, and corporate ownership may trigger changes
  • These changes are managed-load updates, however.
  • The Data Warehouse presentation area in a large enterprise may consist of 20 or more similar-looking data marts.
  • Each Data mart has similar dimensional models.
  • Each data mart contains various fact tables, each having 5-15 dimension tables, that are often shared among the various fact tables.


(Q) What is dimensional modeling? How does it relate to 3NF models?

  • The key difference between dimensional and 3NF models is the degree of normalization.
  • Both types can be represented in E-R diagrams, but 3NF are intended to reduce data redundancy.
  • Dimensional model contains the same information as a normalized model.
  • Dimensional model's goals are understandability, query performance and resilience to change.
  • 3NF modeling boosts performance of operational systems: an update or insert transaction only needs to touch the database in one place.
  • UPDATES or INSERTS do not have to touch(read/write) many different structures in the database.
    3NF models are too complex for DWH queries.
  • Dimensional modelings is applicable to both relational and multidimensional databases.


(Q) What are some key goals of dimensional modeling?

Dimensional model's goals are (a) understandability, (b) query performance and (c) resilience to change.


(Q) What does the concept of metadata represent in a Data Warehouse context?

Metadata is all the information in the Data Warehouse environment that is NOT the actual data itself


(Q) What types of metadata are there?

  • (1) Operational source system metadata - Source schemas, copybooks that facilitate the extraction proces
  • (2) Staging area metadata - Used to guide the transformation and loading processes. Includes
    • staging file layouts
    • Target table layouts
    • transformation and cleaning rules
    • conformed dimension and fact definitions
    • aggregation definitions
    • ETL transmission schedules and run-log results
  • (3) DBMS metadata - System tables, partition settings, indexes and view definitions, security privileges and grants
  • (4) data access tools metadata - Identify (a) business names and definitions for the presentation area's tables and columns; (b) application template specifications; (c) access and usage statistics
  • (5) Security Settings metadata - From Source transactional data all the way to user desktops



(Q) Consider the vocabulary of dimensional modeling: What are Fact tables?

  • Fact tables are the primary tables in a dimensional model, and store numerical performance measurements of the business
  • Measurement data resulting from a business process should be stored in a single data mart.
  • A Fact represents a business measure.
  • A measurement is taken at the intersection of all the dimensions (day, product, store, channel), which defines the grain of the measurement
  • All measurements in a fact table should have the same grain.
  • The example below shows the SALES fact table and its associated dimension tables.
(Q) What types of facts can be stored in a fact table?
  • Additive - can be rulled up. (i.e. amount_sold, quantity_sold)
  • Semiadditive - can be added only along some of the dimensions
  • Nonadditive - Cannot be added.. Averages and counts are possible.
(Q)what are the categories in which fact table grains fall into?
  • Transaction - (i.e. amount_sold and quantity_sold
  • Periodic Snapshot
  • Accumulating Snapshot
  • In the image below, quantity and amount are transactional additive facts
(Q) What are Dimension tables?
  • Contain the textual descriptors of the business
  • Tables should have as many meaningful columns as possible
  • sometimes 50-100 attributes
  • attributes here are the primary source of query constraints, groupings, and report labels
  • Attributes often represent hierarchical relationships (i.e. product -> brand -> Category)
  • Dimension tables are typically highly denormalized
  • Check the attributes of TIMES and CUSTOMERS in the example below. Denormalization is extensive.
(Q) How dimension tables improve the quality of a Data Warehouse?
  • Dimension tables should have many and robust attributes, all relevant to the business.
  • Robust dimension attributes deliver robust analytic slicing and dicing capabilities
  • Dimensions implement the user interface to the Data Warehouse
(Q) what characteristics Dimension Attributes should have?
  • Attributes should consist of real words rather than cryptic abbreviations
(Q) When designing a Data Warehosue, how can you decide whether a data field extracted from an operational systems belongs to a Fact table or to a Dimension table?
  • Ask whether
    • (a) field is a measurement that takes on lots of values and participates in calculations (FACT)
    • (b) filed is a discretely valued description that is more or less constant and participate in constraints (DIMENSION)

No comments:

Post a Comment