Showing posts with label ETL. Show all posts
Showing posts with label ETL. Show all posts

Data Warehouse: Concepts on ETL Processes (Pivoting) (II)

Transforming presentation: What is a Pivoting operation?

  • It is a key technique in DWH: allow you to transform multiple rows of input into fewer and generally wider rows
  • Allow to transform the presentation of the data, rotating rows into columns and presenting data in a crosstabular format, or vice-versa.
  • When pivoting, an aggregation operator is applied for each item in the pivot column value list
  • Use the pivot_clause of the SELECT statement to write crosstabulation queries that rotate rows into columns, aggregating data in the process of the rotation.
  • Pivoting: Transactional format <=> Crosstabular (relational) format

Consider the example below:
  • The view SALES_VIEW displays data in a transactional format
  • It lists amount and quantity sold by product, country, quarter and channel.
  • In this case, a product may be sold through various channels in a given country and a given quarter of year.
  • If you need to see the sales by channel_id on a relational or crosstabular format, you can use a pivoting operation:
SQL> CREATE VIEW sales_view AS
   prod_name product, country_name country, channel_id channel,
   SUBSTR(calendar_quarter_desc, 6,2) quarter,
   SUM(amount_sold) amount_sold, SUM(quantity_sold) quantity_sold
 FROM sales, times, customers, countries, products
 WHERE sales.time_id = times.time_id AND
    sales.prod_id = products.prod_id AND
   sales.cust_id = customers.cust_id AND
   customers.country_id = countries.country_id
 GROUP BY prod_name, country_name, channel_id,
 SUBSTR(calendar_quarter_desc, 6, 2);
  • Using the PIVOT clause you can reconstruct the data on a cross tabular form:
 (SELECT product, channel, amount_sold
   FROM sales_view
   WHERE country = 'United States of America' 
 ) S PIVOT (SUM(amount_sold)
                   5 AS CATALOG_SALES, 9 AS TELESALES))
ORDER BY product;

PRODUCT                   DIRECT_SALES           INTERNET_SALES         CATALOG_SALES          TELESALES           
------------------------- ---------------------- ---------------------- ---------------------- ------------------ 
Home Theatre Package...   2052150.1              310431.22                                                            
Internal 6X CD-ROM        135284.16              15839.73                                                             
Internal 8X CD-ROM        158157.23              24398.22                                                             
Keyboard Wrist Rest       110554                 22477.41                                      1354.87                
 71 rows selected

Oracle Data Intergrator (ODI)

Data Integration and ODI - Concepts

Integration projects can be:
  • Data Oriented, especially those involving large data volumes.
  • Event Oriented - Based on an Event Driven Architecture (EDA) or a
  • Service Oriented - Suitable to a Service Oriented Architecture (SOA), for asynchronous or synchronous integration.

Oracle Data Integrator:
  • combines all the elements of data integration: data movement, data synchronization, data quality, data management, and data services.
  • transforms large volumes of data efficiently,
  • processes events in real time through its advanced Changed Data Capture (CDC) capability, and
  • provides data services to the Oracle SOA Suite.
  • provides robust data integrity control features, assuring the consistency and correctness of data

ODI Studio - Interface and Concepts
Declarative Design
  • Declarative Design is a design method that focuses on “What” to do (the Declarative Rules) rather than “How” to do it (the Process).
  • Declarative Design in ODI uses the well known relational paradigm to declare in the form of an Interface the declarative rules for a data integration task, which includes designation of sources, targets, and transformations.
  • Declarative rules often apply to metadata to transform data and are usually described in natural language by business users.
  • In a typical data integration project (such as a Data Warehouse project), these rules are defined during the specification phase in documents written by business analysts in conjunction with project managers.
  • They can very often be implemented using SQL expressions, provided that the metadata they refer to is known and qualified in a metadata repository.
  • Types of declarative rules: Mappings, Joins, Filters and Constraints

Integration Project

Suggested steps for the life cycle of an integration project:
1. Create a new project and import Knowledge Modules.
2. Define the project organization and practises using folders, markers and documentation.
3. Create reusable components: interfaces, procedures, variables, sequences. Perform unitary tests.
4. Assemble these components into packages. Perform integration tests.
5. Release the work in scenarios

Package is a workflow, made up of a sequence of steps organized into an execution diagram.
Each step can either succeed or fail its execution.
Packages assemble and reference other component from a project (interfaces, procedures, variables, etc)
Types of steps: Flow (Interface), Procedue, Variable, ODI tools (allow access to API commands, OS system calls), models, sub-models, and Datastores.

The Package above (Load Customers) specifies a workflow that loads a sequence of tables:
(1) Initialization step: Execute procedure Initialization (type Procedure)
(2) Execute interface Load Countries (type: Flow(interface)), linked object: TRG_COUNTRY
(3) Execute interface Load Regions (type: Flow(interface)), linked object: TRG_REGION
(4) Execute interface Load Cities (type: Flow(interface)), linked object: TRG_CITY
(5) refresh variable "CUST_FLAG". Load Customers? (type: Evaluate Variable), linked objects: CUST_FLAG
IF CUST_FLAG = 1 (true) THEN Load Customers, ELSE Final Step
(6) Final Step (type procedure)
(7) If any of the steps above fails,the Package runs the "Send Admin email" step that sends an email using an ODI tool.

It is set of declarative rules that describe the loading of a datastore or a temporary target structure from one or more source datastores.
An interface is a reusable dataflow.
Components of an Integration Interface:
  • Target Datastore - element that will be loaded by the interface.
    • May be permanent (defined in a model) or temporary (created by the interface)
  • Datasets - One target is loaded with data coming from various datasets
    • each dataset corresponds to one diagram of source datastores and the mappings used to load the target datastore from these source datastores.
  • Diagram of Source Datastores
  • Mapping - defines the transformations performed on one or various source coluns to load one target column
  • Staging area
  • Flow
  • Control

In the example above,
  • The interface defines a set of rules to load the target TRG_CUSTOMER Datastore with data coming from three source datastores (SRC_AGEGROUP, SRC_CUSTOMER, SRC_SALES_PERSON)
  • It defines a Join between SRC_CUSTOMER and SRC_SALES_PERSON
  • Define Mappings between the attributes in the sources datastores and attributes in the target datastore

Interface dataflow
  • Automatically generated by ODI from the declarative rules in the interface
  • Based on the configuration files and context (target, sources, where is the staging area, etc)
  • ODI automatically computes the flow depending on the configuration in the interface's diagram.
  • It proposes default KMs for the data flow.
  • The Flow tab enables you to view the data flow and select the KMs used to load and integrate data.

In the flow, the following items appear:
Source Sets:
  • Source Datastores that are within the same dataset, located on the same physical data server and which are joined with Joins located on the Source are grouped in a single source set in the flow diagram.
  • A source set represents a group of datastores that can be extracted at the same time.
  • Datasets appear as yellow boxes in the Staging Area.
Staging Area:
  • Appears as a box that includes the different datasets, the target (if located on the same data server), and possibly some of the sources (if located on
    the same data server)
  • Appears as a separate box if it is located in a different schema from the staging area

You use the following KMs in the flow:
  • Loading KM (LKM):
    • Define how data is moved.
    • One LKM is selected for each Source Set for moving data from the sources to the staging area.
    • It can be also selected to move data from the Staging Area - when different from the Target - to the Target, when a
      single technology IKM is selected for the Staging Area.
  • Integration KM (IKM): They define how data is integrated into the target. One IKM is typically
    selected on the Target. When the staging area is different from the target, the
    selected IKM can be a multi-technology IKM that moves and integrates data from
    the Staging Area into the Target.

A Procedure is a reusable component that groups a sequence of operations that do not fit in the interface concept.
Examples of procedures:
■ wait and unzip a file
■ send a batch of files via FTP
■ receive emails
■ purge a database

Knowledge Modules
  • Knowledge Modules (KM) implement “how” the integration processes occur
  • KMs are components of ODI Open Connector technology.
  • Contain the knowledge required by ODI to perform a specific set of tasks against a specific technology or set of technologies.
  • When starting an integration project, you must import the Knowledge Module appropriate for your project.
  • It is possible to import additional KMs after setting up the project, and it is possible to change the KMs used afterwards
  • Start with Generic KMs. The SQL KMs work with almost all technologies
  • Start with simple KMs.
  • Select KMs that match your source/target combinations to increase performance
  • Select KMs according to your infrastructure limitations
  • Select JKMs and SKMs only if you need them
  • Encrypting a Knowledge Module (KM) allows you to protect valuable code
  • Oracle Data Integrator uses a DES Encryption algorithm based on a personal encryption key.
  • This key can be saved in a file and can be reused to perform encryption or decryption operations.

ODI uses six (6) types of Knowledge Modules:
RKM (Reverse Knowledge Modules) Perform a customized reverse-engineering of data models for a specific technology.
■ RKMs are used in data models.
RKM <reversed technology> [(reverse method)]
i.e. RKM Oracle, RKM Netezza

LKM (Loading Knowledge Modules) are used to extract data from source systems (files, middleware, database, etc.).
■ LKMs are used in interfaces.
LKM <source technology> to <target technology> [(loading method)]
i.e. LKM Oracle to Oracle (DBLink), LKM File to Oracle (SQLLDR), LKM SQL to SQL, etc

JKM (Journalizing Knowledge Modules) are used to create a journal of data modifications (insert, update and delete) of the source databases to keep track of
the changes.
■ JKMs are used in data models and used for Changed Data Capture.
JKM <journalized technology> <journalizing mode> (<journalizing method>)
JKM Oracle11g Consistent (Streams)
JFM Oracle Simple
JKM DB2 400 Simple (journal)

IKM (Integration Knowledge Modules) are used to integrate (load) data to the target tables.
■ IKMs are used in interfaces.
IKM [<staging technology>] <target technology> [<integration mode>] [(<integration method>)].
i.e. IKM Oracle Incremental Update (MERGE)
IKM SQL to File Append
IKM SQL Incremental Update
IKM SQL to SQL Append

CKM (Check Knowledge Modules) are used to check that constraints on the sources and targets are not violated.
■ CKMs are used in data model’s static check and interfaces flow checks.
CKM <staging technology>
i.e. CKM SQL, CKM Oracle

SKM (Service Knowledge Modules) are used to generate the code required for creating data services.
■ SKMs are used in data models.

Data Warehouse: Concepts on ETL Processes (Transforming)

Data Warehouse: Four (4) typical transformation and loading scenarios

(a) Key Lookup Scenario
(b) Business Rule Violation Scenario
(c) Data error scenario
(d) Pivoting Scenario

(a) key Lookup Scenario
  • You need to load sales transaction data into a retail Data Warehouse
  • In the operational system, each product is identified by Uniform Price Codes (UPCs);
  • In the Data Warehouse, product is identified by product_id.
  • You need to
(1) Map UPC --> PRODUCT_ID; and
(2)As you load the source data, lookup the map table and transform the data accordingly

(b) Business Rule Violation Scenario
  • Some UPC Codes are invalid
  • In which ways can you implement the transformation and handle the invalid UPCs?
  • i.e: new sales data might not have valid UPC codes
(1) use Create Table As Select (CTAS) to identify and log invalid UPCs
(2) use an outer join (+) and convert invalid UPCs to NULL
(3) use multi-table INSERT and separate the invalid UPCs to a different table

(1) use CTAS to identify and log invalid UPCs

(2) use an outer join (+) and convert invalid UPCs to NULL

(3)use multi-table INSERT and separate the invalid UPCs to a different table

(c) Data Error Scenarios
  • If the quality of the data is unknown, you may face unexpected errors, such as a a failed datatype conversion, or a constraint violation.
  • Such errors can happen independently of whether the UPC for the particular record is or not valid.
  • Here you can use a DML error logging table to capture unexpected errors.

Data Warehouse with Oracle Data Integrator (ODI)

Building a Data Warehouse: typical steps and milestones

  • Research and specify business needs (Key Indicators)
  • Identify data sources relevant to generate key indicators
  • Define business rules to transform source information into key indicators
  • Model the data structure of the target warehouse to store the key indicators
  • Populate the indicators by implementing business rules
  • Measure the overall accuracy of the data by setting up data quality rules
  • Develop reports on key indicators
  • Make key indicators and metadata available to business users through ad-hoc query tools or predefined reports
  • Measure business users’ satisfaction and add/modify key indicators

Using Oracle Data Integrator (ODI) in a Data Warehouse project: actors involved and some assigned tasks

Business User
  • Access the final calculated key indicators
  • Use reports and ad-hoc queries
  • May need to understand the definition of an indicator
  • May need to be aware of data quality issues
    • when was the last time the table was updated?
    • How many records were added, update, removed in the table?
    • What are the rules that calculate a particular indicator?
    • Where does the data come from, and how is it transformed?

Business Analyst
  • Define the key indicators
  • Identify the source applications
    • How many different source applications need to be considered?
    • Is the data needed for key indicators available in the selected pool of source applications?
    • What data quality issues are present in the source systems?
  • Specify business rules to transform source data into meaningful target indicators
    • Projects can use the ODI Designer to directly specify the business rules
    • For each target table, specify also:
      • Target datastore - name the target datastore
      • Description of transformation - describe its purpose
      • Integration strategy - How data should be written to target (replace table, append, update, etc). Each strategy specified will correspond to an ODI Integration Knowledge Module.
      • Refresh frequency
      • Dependencies - what datastores need to be loaded or jobs executed prior to this one
      • Source datastores - source databases, applications, etc used
      • Source ODI Model -
      • Datastore name, Fiel Mappings and transformations, Links or Join criteria, Filters, Data Quality requirements, constraint names and expressions, etc
  • Maintain translation data from operational semantics to the Data Warehouse semantic

  • Implement the business rules as specified by business analysts
  • Provide executable scenarios to the production team
  • Must understand infrastructure details and have business knowledge of source applications

Metadata Administrator
  • Reverse engineer source and target applications
    • Understand content and structure of source applications
    • Connect to source applications and capture their metadata
    • Define data quality business rules (specified by Business Analyst) in ODI repository
      • What level of Data Quality is required?
      • who are the business owners of source data?
      • What should be done with rejected records?
      • There should be an error recycling strategy?
      • How would business users modify erroneous source data?
      • Should a GUI be provided for source data correction?
  • Guarantee the overall consistency of Metadata in the various environments (dev, test, prod: repository)
  • Participate in the data modeling of key indicators
  • Add comments, descriptions and integrity rules (PK, FK, Check, etc)in the metadata
  • Provide version management

Database Administrator
  • Define technical database structure supporting the various layers in the data warehouse project (and ODI structure)
  • Create database profiles needed by ODI
  • Create schemas and databases for the staging areas
    • Describe columns inside the data dictionary of database - COMMENT ON TABLE/COLUMN
    • Avoid using PKs of source systems as PKs in target tables. Use counter or identity columns instead.
    • Design Referential Integrity and reverse engineerd FKs in ODI models.
    • Do not implement RIs on target database (for performance). Data quality control should guarantee data integrity.
    • Standardize obejct naming conventions
  • Distributed and maintain the descriptions of the environments (topology)

System Admin
  • Maintain technical resources for the Data Warehouse project
  • May install and monitor schedule agents
  • May backup and restore repositories
  • Install and monitor ODI console
  • Setup the various enviroments (dev, test, prod)

Security Admin
  • Define the security policy for the ODI Repository
  • Creates ODI users and grant rights on models, projects and contexts

  • Import released and tested scenarios into production environment
  • Schedule the execution of production scenarios
  • Monitor execution logs and restart failed sessions

Conceptual Architecture of an ODI solution

  • The ODI Repository is the central component.
  • ODI Repository stores configuration information about
    • IT infrastructure and topology
    • metadata of all applications
    • projects
    • Interfaces
    • Models
    • Packages
    • Scenarios
    • Solutions - versioning control
  • You can have various repositories within an IT infrastructure, linked ti separated environments that exchange metadata and scenarios (i.e. development, test, user acceptance test, production, hot fix)

Example: two environments, components, actors and tasks

(source: Oracle 2008)

Suggested environment configuration for a Data Warehouse project with ODI:

(1) A single master repository:
  • holds all the topology and security information.
  • All the work repositories are registered here.
  • Contains all the versions of objects that are committed by the designers.

(2) “Development” work repository:
  • Shared by all ODI designers
  • Holds all the projects and models under development.

(3) “Testing” work repository
  • shared by the IT testing team.
  • Contains all the projects and models being tested for future release.

(4) “User Acceptance Tests” work repository:
  • shared by the IT testing team and business analysts.
  • Contains all the projects and models about to be released.
  • Business analysts will use the ODI Console on top of this repository to validate the scenarios and transformations before releasing them to production.

(5) “Production” work repository
  • shared by the production team, the operators and the business analysts.
  • Contains all the projects and models in read-only mode for metadata lineage, as well as all the released scenarios.

(6) “Hot fix” work repository:
  • shared by the maintenance team and the development team.
  • Usually empty, but whenever a critical error happens in production, the maintenance team restores the corresponding projects and models in this repository and performs the corrections with the help of the development team.
  • Once the problems are solved, the scenarios are released directly to the production repository and the new models and projects are versioned in the master repository.

Typical environment (I):

(source: Oracle 2010)

Typical environment (II): Separate Master Repository for Production

(source: Oracle 2010)

Using Oracle External Tables

  • External tables do not reside in the database, and can be in any format for which an access driver is provided.
  • You can select, join, or sort external table data.
  • You can also create views and synonyms for external tables.
  • You CANNOT execute DML operations (UPDATE, INSERT, or DELETE) , and CANNOT create indexes on external tables.

External tables and Data Warehouse
  • Exporting data from a database:
    • External tables provide a framework to unload the result of an arbitrary SELECT statement into a platform-independent Oracle-proprietary format that can be used by Oracle Data Pump.
  • External tables provide a valuable means for performing basic extraction, transformation, and loading (ETL) tasks that are common for data warehousing.

Creating External tables
  • This statement creates only metadata in the data dictionary.
  • External tables can be thought of as views that allows running any SQL query against external data without requiring that the external data first be loaded into the database.

Accessing External Data
  • An access driver is the actual mechanism used to read the external data in the table.
  • When you use external tables to unload data, the metadata is automatically created based on the data types in the SELECT statement.

Oracle Database provides two access drivers for external tables:
  • ORACLE_LOADER driver (default)
    • allows the reading of data from external files using the Oracle loader technology.
    • provides data mapping capabilities which are a subset of the control file syntax of SQL*Loader utility.
    • lets you unload data: read data from the database and insert it into an external table, represented by one or more external files—and then reload it into an Oracle Database.

Scenario 1:
Unload data to flat files and create External Table.

Step 1 - Export data from table sh.customers

/* Step 1 - Export data from table sh.customers 
   Procedure UNLOAD_DATA receives table_name and table_owner as parameter and 
     unloads the table data to a flat file */

set serveroutput on

  procedure unload_data( 
       p_table_name in varchar2,
       p_owner  in varchar2)
   sqltext varchar2(1000);
   v_textrecord varchar2(2000);
   cv sys_refcursor;

   fdest utl_file.file_type;
   type FileAttrRec is Record (
                      vfilexists BOOLEAN,
                      vfilelength number,
                      vblocksize  binary_integer);
   vfilerec fileattrrec;
   vfilename varchar2(30);

    procedure get_sql (
       p_sqltext in out varchar2,
       p_table_name in varchar2,
       p_owner in varchar2) 
      cursor c is
        select column_name
        from dba_tab_columns
        where table_name = p_table_name and owner = p_owner
        order by column_id;
      i integer;
      sql_stmt varchar2(2000);
      from_clause varchar2(100);
      i :=0;
      sql_stmt := 'SELECT ';
      from_clause :=  'FROM ' || p_owner ||'.' || p_table_name;
      for r in c loop
        if i = 0 then
          sql_stmt := sql_stmt || ' ' || r.column_name;
          i := 1;
          sql_stmt := sql_stmt || '|| ''|''|| ' || r.column_name;
        end if;
      end loop;
      sql_stmt := sql_stmt || ' as qrystr ' || chr(10) || from_clause;
      p_sqltext := sql_stmt;
      -- dbms_output.put_line( sql_stmt);
    end get_sql; 

  -- 1. Get Query SQL
   sqltext := '';  
   get_sql(sqltext, p_table_name, p_owner);
   --dbms_output.put_line('Select stmt:' || sqltext); 

  -- 2. Open cursor for query. Write each row in the OS file
       vfilename := 'unload_' || p_owner ||'_'|| p_table_name || '.dat';
       utl_file.fgetattr('DATA_DUMP_DIR', vfilename, vfilerec.vfilexists, 
                              vfilerec.vfilelength, vfilerec.vblocksize);
       if vfilerec.vfilexists then 
          fdest := utl_file.fopen('DATA_PUMP_DIR', vfilename, 'a', 2048);
          dbms_output.put_line('Destination file exists. Appending..'); 
          fdest := utl_file.fopen('DATA_PUMP_DIR', vfilename, 'w', 2048);
       end if;
   open cv for sqltext;
       fetch cv into v_textrecord;
       utl_file.put_line(fdest, v_textrecord, true);
   close cv; 
  END unload_data;

 unload_data('CUSTOMERS' , 'SH');

Step 2 - Split unloaded file and create external table using the resulting files

$ ls -ltr
total 482928
-rw-r--r-- 1 oracle oinstall  10661903 2012-03-20 22:01 unload_SH_CUSTOMERS.dat

-- spliting the file in two smaller files
$ split -30000 unload_SH_CUSTOMERS.dat unload_SH_CUSTOMERS.dat
$ ls -ltr
total 503760
-rw-r--r-- 1 oracle oinstall  10661903 2012-03-20 22:01 unload_SH_CUSTOMERS.dat
-rw-r--r-- 1 oracle oinstall   5768182 2012-03-20 22:15 unload_SH_CUSTOMERS.dataa
-rw-r--r-- 1 oracle oinstall   4893721 2012-03-20 22:15 unload_SH_CUSTOMERS.datab

 Step 2 - Create external table using unloaded file(s)
 -- create the external table
SQL>  Create table ext_customers1
         "CUST_ID"                NUMBER,
         "CUST_FIRST_NAME"        VARCHAR2(20 BYTE),
         "CUST_LAST_NAME"         VARCHAR2(40 BYTE),
         "CUST_GENDER"            CHAR(1 BYTE),
         "CUST_YEAR_OF_BIRTH"     NUMBER(4,0),
         "CUST_POSTAL_CODE"       VARCHAR2(10 BYTE),
         "CUST_CITY"              VARCHAR2(30 BYTE),
         "CUST_CITY_ID"           NUMBER,
         "COUNTRY_ID"             NUMBER,
         "CUST_INCOME_LEVEL"      VARCHAR2(30 BYTE),
         "CUST_EMAIL"             VARCHAR2(30 BYTE),
         "CUST_TOTAL"             VARCHAR2(14 BYTE),
         "CUST_TOTAL_ID"          NUMBER,
         "CUST_SRC_ID"            NUMBER,
         "CUST_EFF_FROM" DATE,
         "CUST_EFF_TO" DATE,
     DEFAULT DIRECTORY data_pump_dir
          records delimited by newline
          badfile data_pump_dir:'cust1ext%a_%p.bad' 
          logfile data_pump_dir:'cust1ext%a_%p.log'
          fields terminated by '|'
          missing field values are null
     LOCATION ('unload_SH_CUSTOMERS.dataa', 'unload_SH_CUSTOMERS.datab')

table EXT_CUSTOMERS1 created.

SQL> select table_name, type_name, default_directory_name, reject_limit, access_parameters, property 
 from dba_external_tables;
--------------- --------------- ------------------------ --------------- ------------------------------------------
EXT_CUSTOMERS1   ORACLE_LOADER   DATA_PUMP_DIR            UNLIMITED       records delimited by newline             
                                                                          badfile data_pump_dir:'cust1ext%a_%p.bad'
                                                                          logfile data_pump_dir:'cust1ext%a_%p.log' 
                                                                          fields terminated by '|'                  
                                                                          missing filed values are null             

SQL> select count(*) from ext_customers1;


Scenario 2: Unloading and Loading Data with the ORACLE_DATAPUMP Access Driver

Step 1
  • This will create a file with the data resulting from the specified query.
  • The file is created with a binary format that can only be read by the ORACLE_DATAPUMP access driver
  • The example below creates an external table and populates the dump file for the external table with the data from table sh.customers.

connect sh/**;

SQL> create table customers1_xt
  organization external
    type oracle_datapump
    default directory data_pump_dir
    location ('customers1_xt.dmp')
  as select * from sh.customers;

table created.


(b) Check the OS files created: 

$ ls -ltr
total 60672
-rw-r----- 1 oracle oinstall 10334208 2012-03-21 00:33 customers1_xt.dmp
-rw-r--r-- 1 oracle oinstall      123 2012-03-21 00:36 CUSTOMERS1_XT_7165.log

(c) Compare descriptions of customers and customers1_xt tables.

SQL> desc customers;

Name                   Null     Type         
---------------------- -------- ------------ 
CUST_ID                NOT NULL NUMBER       
CUST_GENDER            NOT NULL CHAR(1)      
CUST_CITY              NOT NULL VARCHAR2(30) 
COUNTRY_ID             NOT NULL NUMBER       
CUST_INCOME_LEVEL               VARCHAR2(30) 
CUST_CREDIT_LIMIT               NUMBER       
CUST_EMAIL                      VARCHAR2(30) 
CUST_SRC_ID                     NUMBER       
CUST_EFF_FROM                   DATE         
CUST_EFF_TO                     DATE         
CUST_VALID                      VARCHAR2(1)

SQL> desc customers1_xt;

Name                   Null     Type         
---------------------- -------- ------------ 
CUST_ID                NOT NULL NUMBER       
CUST_GENDER            NOT NULL CHAR(1)      
CUST_CITY              NOT NULL VARCHAR2(30) 
COUNTRY_ID             NOT NULL NUMBER       
CUST_INCOME_LEVEL               VARCHAR2(30) 
CUST_CREDIT_LIMIT               NUMBER       
CUST_EMAIL                      VARCHAR2(30) 
CUST_SRC_ID                     NUMBER       
CUST_EFF_FROM                   DATE         
CUST_EFF_TO                     DATE         
CUST_VALID                      VARCHAR2(1)  

(d) Check data on customers1_xt and compare with original table

SQL> select count(*) from customers1_xt;


SQL> select * from customers minus select * from customers1_xt;

no rows selected

(e) You CANNOT perform DML on the external table:

SQL> delete from customers1_xt where cust_id = 100055;
delete from customers1_xt where cust_id = 100055
ERROR at line 1:
ORA-30657: operation not supported on external organized table

(f) You can also Create an external table with multiple dump files and with a degree of parallelism higher than one.

SQL> create table customers1_xtp3
 organization external
   type oracle_datapump
   default directory data_pump_dir
   location('customers1_xtp31.dmp' ,'customers1_xtp32.dmp' ,'customers1_xtp33.dmp' )
 parallel 3
 as select * from sh.customers;

$ ls -ltr
total 60672
-rw-r----- 1 oracle oinstall 10334208 2012-03-21 00:33 customers1_xt.dmp
-rw-r--r-- 1 oracle oinstall      123 2012-03-21 00:36 CUSTOMERS1_XT_7165.log
-rw-r--r-- 1 oracle oinstall       41 2012-03-21 10:13 CUSTOMERS1_XTP3_2690.log
-rw-r--r-- 1 oracle oinstall       41 2012-03-21 10:13 CUSTOMERS1_XTP3_2737.log
-rw-r--r-- 1 oracle oinstall       41 2012-03-21 10:13 CUSTOMERS1_XTP3_2739.log
-rw-r--r-- 1 oracle oinstall       41 2012-03-21 10:13 CUSTOMERS1_XTP3_2741.log
-rw-r----- 1 oracle oinstall  2990080 2012-03-21 10:13 customers1_xtp32.dmp
-rw-r----- 1 oracle oinstall  3792896 2012-03-21 10:13 customers1_xtp33.dmp
-rw-r----- 1 oracle oinstall  3592192 2012-03-21 10:13 customers1_xtp31.dmp

(g) You can transfer or copy the dump file(s) and use them for another external table either in the same or in a different database. 

SQL> Create table customers2_xt
         "CUST_ID"                NUMBER,
         "CUST_FIRST_NAME"        VARCHAR2(20 BYTE),
         "CUST_LAST_NAME"         VARCHAR2(40 BYTE),
         "CUST_GENDER"            CHAR(1 BYTE),
         "CUST_YEAR_OF_BIRTH"     NUMBER(4,0),
         "CUST_POSTAL_CODE"       VARCHAR2(10 BYTE),
         "CUST_CITY"              VARCHAR2(30 BYTE),
         "CUST_CITY_ID"           NUMBER,
         "COUNTRY_ID"             NUMBER,
         "CUST_INCOME_LEVEL"      VARCHAR2(30 BYTE),
         "CUST_EMAIL"             VARCHAR2(30 BYTE),
         "CUST_TOTAL"             VARCHAR2(14 BYTE),
         "CUST_TOTAL_ID"          NUMBER,
         "CUST_SRC_ID"            NUMBER,
         "CUST_EFF_FROM" DATE,
         "CUST_EFF_TO" DATE,
     DEFAULT DIRECTORY data_pump_dir
     LOCATION ('customers1_xtp31.dmp' ,'customers1_xtp32.dmp' ,'customers1_xtp33.dmp')

SQL> select count(*) from customers2_xt;


SQL> Select * from sh.customers minus select * from customers2_xt;

no rows selected.

Data Warehouse: Basic Concepts (I)

Common requirements of Data Warehouse systems

  • Must provide easy access to the organization's information
  • Show consistency in the display of information
  • Must be adaptive and resilient to change
  • Must safely keep information
  • Must serve as a foundation for improved decision making
  • Must be accepted by the business community in the enterprise

Common components of a Data Warehouse (Kimball)

Four components:
(a) Operational Source System
(b) Data Staging Area
(3) Data Presentation Area
(4) Data Access Tools

"One of the biggest threats to DWH success is confusing the components' roles and functions."
i.e. Don't give access to end users to the Staging Area. Don't build the DWH on a normalized design.

Characteristics of source operational systems

  • Main priorities of source operational system are processing performance and availability
  • Queries against source systems are narrow, one-record-at-a-time queries, and are severy restricted in their demands on the Operating System
  • Source systems are not queried in broad and unexpected ways: queries are tunned, specific, and known before hand.
  • Source systems maintain little historical data

(Q) what are +two key architectural requirements for the staging area?

  • (a) Staging area is off-limits to business users
  • (b) Does not provide query and presentation services
  • (c) Does not (should not) need to have a normalized structure to hold data

(Q) What are some common characteristics of the staging area?

  • Dominated by tasks such as sorting and sequential processing
  • Often not based on relational technology but instead on many flat files
  • Usually most of the tasks of extraction, transformation and loading (ETL) of data from operational systems into the data warehouse is concentrated in the staging area.
  • Tables in the staging area should be segregated from the "live" data warehouse, i.e., they should NOT be available for end users' queries
  • A basic implementation here is to have an identical schema to the one that exists in the source operational system(s) but with some structural changes to the tables, such as range partitioning.
  • Alternatively, in some implementations all data transformation processing is done “on the fly” as data is extracted from the source system before it is inserted directly into the Presentation area.

(Q) What are four common transformations done (usually in the staging area) to the extracted data during the ETL process?

  • (a) Cleansing (correct misspellings, resolve domain conflicts, deal with missing data, reformatting)
  • (b) Combining data from multiple sources
  • (c) Deduplicating data
  • (d) Assigning warehouse keys

(Q) What are some performance considerations when planning the loading stage of the ETL process?

  • The goal in this phase is to load the data into the warehouse in the most expedient manner.
  • Careful consideration should be given to where the data being-loaded resides and how you load it into the database.
  • For example, do not use a serial database link or a single JDBC connection to move large volumes of data.
  • Flat files are the most common and preferred mechanism of large volumes of data.

  • The overall speed of your load will be determined by
    • (A) how quickly the raw data can be read from staging area and
    • (B) how fast it can be processed and inserted into the database.
  • You should stage the raw data across as many physical disks as possible to ensure the reading it is not a bottleneck during the load.

(Q) What tasks are usually part of the loading of the Presentation area?

  • Load each data Mart
  • Index new data for query performance
  • Supply/compute appropriate aggregates (i.e. build or refresh Materialized Views)

(Q) What is a Data Mart? How it relates to the Presentation Area and to the Data Warehouse?

  • A Data Mart is a section of the overall presentation area, which usually consists of a series of integrated data marts.
  • At its simplest form, it presents the data from a single business process
  • Modern Data Marts may be updated. Changes in lables, hierarchies, status, and corporate ownership may trigger changes
  • These changes are managed-load updates, however.
  • The Data Warehouse presentation area in a large enterprise may consist of 20 or more similar-looking data marts.
  • Each Data mart has similar dimensional models.
  • Each data mart contains various fact tables, each having 5-15 dimension tables, that are often shared among the various fact tables.

(Q) What is dimensional modeling? How does it relate to 3NF models?

  • The key difference between dimensional and 3NF models is the degree of normalization.
  • Both types can be represented in E-R diagrams, but 3NF are intended to reduce data redundancy.
  • Dimensional model contains the same information as a normalized model.
  • Dimensional model's goals are understandability, query performance and resilience to change.
  • 3NF modeling boosts performance of operational systems: an update or insert transaction only needs to touch the database in one place.
  • UPDATES or INSERTS do not have to touch(read/write) many different structures in the database.
    3NF models are too complex for DWH queries.
  • Dimensional modelings is applicable to both relational and multidimensional databases.

(Q) What are some key goals of dimensional modeling?

Dimensional model's goals are (a) understandability, (b) query performance and (c) resilience to change.

(Q) What does the concept of metadata represent in a Data Warehouse context?

Metadata is all the information in the Data Warehouse environment that is NOT the actual data itself

(Q) What types of metadata are there?

  • (1) Operational source system metadata - Source schemas, copybooks that facilitate the extraction proces
  • (2) Staging area metadata - Used to guide the transformation and loading processes. Includes
    • staging file layouts
    • Target table layouts
    • transformation and cleaning rules
    • conformed dimension and fact definitions
    • aggregation definitions
    • ETL transmission schedules and run-log results
  • (3) DBMS metadata - System tables, partition settings, indexes and view definitions, security privileges and grants
  • (4) data access tools metadata - Identify (a) business names and definitions for the presentation area's tables and columns; (b) application template specifications; (c) access and usage statistics
  • (5) Security Settings metadata - From Source transactional data all the way to user desktops

(Q) Consider the vocabulary of dimensional modeling: What are Fact tables?

  • Fact tables are the primary tables in a dimensional model, and store numerical performance measurements of the business
  • Measurement data resulting from a business process should be stored in a single data mart.
  • A Fact represents a business measure.
  • A measurement is taken at the intersection of all the dimensions (day, product, store, channel), which defines the grain of the measurement
  • All measurements in a fact table should have the same grain.
  • The example below shows the SALES fact table and its associated dimension tables.
(Q) What types of facts can be stored in a fact table?
  • Additive - can be rulled up. (i.e. amount_sold, quantity_sold)
  • Semiadditive - can be added only along some of the dimensions
  • Nonadditive - Cannot be added.. Averages and counts are possible.
(Q)what are the categories in which fact table grains fall into?
  • Transaction - (i.e. amount_sold and quantity_sold
  • Periodic Snapshot
  • Accumulating Snapshot
  • In the image below, quantity and amount are transactional additive facts
(Q) What are Dimension tables?
  • Contain the textual descriptors of the business
  • Tables should have as many meaningful columns as possible
  • sometimes 50-100 attributes
  • attributes here are the primary source of query constraints, groupings, and report labels
  • Attributes often represent hierarchical relationships (i.e. product -> brand -> Category)
  • Dimension tables are typically highly denormalized
  • Check the attributes of TIMES and CUSTOMERS in the example below. Denormalization is extensive.
(Q) How dimension tables improve the quality of a Data Warehouse?
  • Dimension tables should have many and robust attributes, all relevant to the business.
  • Robust dimension attributes deliver robust analytic slicing and dicing capabilities
  • Dimensions implement the user interface to the Data Warehouse
(Q) what characteristics Dimension Attributes should have?
  • Attributes should consist of real words rather than cryptic abbreviations
(Q) When designing a Data Warehosue, how can you decide whether a data field extracted from an operational systems belongs to a Fact table or to a Dimension table?
  • Ask whether
    • (a) field is a measurement that takes on lots of values and participates in calculations (FACT)
    • (b) filed is a discretely valued description that is more or less constant and participate in constraints (DIMENSION)