dbms-notes: writing blocks to disk: 2012

ON SQL functions

Building examples of using SQL functions supported by Oracle
Check the post here (in construction)

Data Warehouse: Concepts on ETL Processes (Pivoting) (II)

Transforming presentation: What is a Pivoting operation?

It is a key technique in DWH: allow you to transform multiple rows of input into fewer and generally wider rows

Allow to transform the presentation of the data, rotating rows into columns and presenting data in a crosstabular format, or vice-versa.

When pivoting, an aggregation operator is applied for each item in the pivot column value list

Use the pivot_clause of the SELECT statement to write crosstabulation queries that rotate rows into columns, aggregating data in the process of the rotation.

Pivoting: Transactional format <=> Crosstabular (relational) format

Consider the example below:

The view SALES_VIEW displays data in a transactional format

It lists amount and quantity sold by product, country, quarter and channel.

In this case, a product may be sold through various channels in a given country and a given quarter of year.

If you need to see the sales by channel_id on a relational or crosstabular format, you can use a pivoting operation:

SQL> CREATE VIEW sales_view AS
 SELECT
   prod_name product, country_name country, channel_id channel,
   SUBSTR(calendar_quarter_desc, 6,2) quarter,
   SUM(amount_sold) amount_sold, SUM(quantity_sold) quantity_sold
 FROM sales, times, customers, countries, products
 WHERE sales.time_id = times.time_id AND
    sales.prod_id = products.prod_id AND
   sales.cust_id = customers.cust_id AND
   customers.country_id = countries.country_id
 GROUP BY prod_name, country_name, channel_id,
 SUBSTR(calendar_quarter_desc, 6, 2);

Using the PIVOT clause you can reconstruct the data on a cross tabular form:

SQL> SELECT * FROM
 (SELECT product, channel, amount_sold
   FROM sales_view
   WHERE country = 'United States of America' 
 ) S PIVOT (SUM(amount_sold)
   FOR CHANNEL IN (3 AS DIRECT_SALES, 4 AS INTERNET_SALES,
                   5 AS CATALOG_SALES, 9 AS TELESALES))
ORDER BY product;

PRODUCT                   DIRECT_SALES           INTERNET_SALES         CATALOG_SALES          TELESALES           
------------------------- ---------------------- ---------------------- ---------------------- ------------------ 
...
Home Theatre Package...   2052150.1              310431.22                                                            
Internal 6X CD-ROM        135284.16              15839.73                                                             
Internal 8X CD-ROM        158157.23              24398.22                                                             
Keyboard Wrist Rest       110554                 22477.41                                      1354.87                
...
 71 rows selected

Oracle Data Intergrator (ODI)

Data Integration and ODI - Concepts

Integration projects can be:

Data Oriented, especially those involving large data volumes.
Event Oriented - Based on an Event Driven Architecture (EDA) or a
Service Oriented - Suitable to a Service Oriented Architecture (SOA), for asynchronous or synchronous integration.

Oracle Data Integrator:

combines all the elements of data integration: data movement, data synchronization, data quality, data management, and data services.
transforms large volumes of data efficiently,
processes events in real time through its advanced Changed Data Capture (CDC) capability, and
provides data services to the Oracle SOA Suite.
provides robust data integrity control features, assuring the consistency and correctness of data

ODI Studio - Interface and Concepts

Declarative Design

Declarative Design is a design method that focuses on “What” to do (the Declarative Rules) rather than “How” to do it (the Process).

Declarative Design in ODI uses the well known relational paradigm to declare in the form of an Interface the declarative rules for a data integration task, which includes designation of sources, targets, and transformations.

Declarative rules often apply to metadata to transform data and are usually described in natural language by business users.

In a typical data integration project (such as a Data Warehouse project), these rules are defined during the specification phase in documents written by business analysts in conjunction with project managers.

They can very often be implemented using SQL expressions, provided that the metadata they refer to is known and qualified in a metadata repository.

Types of declarative rules: Mappings, Joins, Filters and Constraints

Integration Project

Suggested steps for the life cycle of an integration project:
1. Create a new project and import Knowledge Modules.
2. Define the project organization and practises using folders, markers and documentation.
3. Create reusable components: interfaces, procedures, variables, sequences. Perform unitary tests.
4. Assemble these components into packages. Perform integration tests.
5. Release the work in scenarios

Package
Package is a workflow, made up of a sequence of steps organized into an execution diagram.
Each step can either succeed or fail its execution.
Packages assemble and reference other component from a project (interfaces, procedures, variables, etc)
Types of steps: Flow (Interface), Procedue, Variable, ODI tools (allow access to API commands, OS system calls), models, sub-models, and Datastores.

The Package above (Load Customers) specifies a workflow that loads a sequence of tables:

(1) Initialization step: Execute procedure Initialization (type Procedure)
(2) Execute interface Load Countries (type: Flow(interface)), linked object: TRG_COUNTRY
(3) Execute interface Load Regions (type: Flow(interface)), linked object: TRG_REGION
(4) Execute interface Load Cities (type: Flow(interface)), linked object: TRG_CITY
(5) refresh variable "CUST_FLAG". Load Customers? (type: Evaluate Variable), linked objects: CUST_FLAG
IF CUST_FLAG = 1 (true) THEN Load Customers, ELSE Final Step
(6) Final Step (type procedure)
(7) If any of the steps above fails,the Package runs the "Send Admin email" step that sends an email using an ODI tool.

Interface
It is set of declarative rules that describe the loading of a datastore or a temporary target structure from one or more source datastores.
An interface is a reusable dataflow.

Components of an Integration Interface:

Target Datastore - element that will be loaded by the interface.

May be permanent (defined in a model) or temporary (created by the interface)

Datasets - One target is loaded with data coming from various datasets

each dataset corresponds to one diagram of source datastores and the mappings used to load the target datastore from these source datastores.

Diagram of Source Datastores

Mapping - defines the transformations performed on one or various source coluns to load one target column

Staging area

Flow

Control

In the example above,

The interface defines a set of rules to load the target TRG_CUSTOMER Datastore with data coming from three source datastores (SRC_AGEGROUP, SRC_CUSTOMER, SRC_SALES_PERSON)

It defines a Join between SRC_CUSTOMER and SRC_SALES_PERSON

it defines a Lookup: CUST.AGE BETWEEN AGE_GROUP_AGE_MIN and AGE_GROUP.AGE_MAX

Define Mappings between the attributes in the sources datastores and attributes in the target datastore

Lookup
Join
Mapping

Interface dataflow

Automatically generated by ODI from the declarative rules in the interface

Based on the configuration files and context (target, sources, where is the staging area, etc)

ODI automatically computes the flow depending on the configuration in the interface's diagram.

It proposes default KMs for the data flow.

The Flow tab enables you to view the data flow and select the KMs used to load and integrate data.

In the flow, the following items appear:
Source Sets:

Source Datastores that are within the same dataset, located on the same physical data server and which are joined with Joins located on the Source are grouped in a single source set in the flow diagram.

A source set represents a group of datastores that can be extracted at the same time.

DataSets:

Datasets appear as yellow boxes in the Staging Area.

Staging Area:

Appears as a box that includes the different datasets, the target (if located on the same data server), and possibly some of the sources (if located on
the same data server)

Target:

Appears as a separate box if it is located in a different schema from the staging area

You use the following KMs in the flow:

Loading KM (LKM):

Define how data is moved.

One LKM is selected for each Source Set for moving data from the sources to the staging area.

It can be also selected to move data from the Staging Area - when different from the Target - to the Target, when a
single technology IKM is selected for the Staging Area.

Integration KM (IKM): They define how data is integrated into the target. One IKM is typically
selected on the Target. When the staging area is different from the target, the
selected IKM can be a multi-technology IKM that moves and integrates data from
the Staging Area into the Target.

Procedure
A Procedure is a reusable component that groups a sequence of operations that do not fit in the interface concept.
Examples of procedures:
■ wait and unzip a file
■ send a batch of files via FTP
■ receive emails
■ purge a database

Knowledge Modules

Knowledge Modules (KM) implement “how” the integration processes occur
KMs are components of ODI Open Connector technology.
Contain the knowledge required by ODI to perform a specific set of tasks against a specific technology or set of technologies.
When starting an integration project, you must import the Knowledge Module appropriate for your project.
It is possible to import additional KMs after setting up the project, and it is possible to change the KMs used afterwards
Start with Generic KMs. The SQL KMs work with almost all technologies
Start with simple KMs.
Select KMs that match your source/target combinations to increase performance
Select KMs according to your infrastructure limitations
Select JKMs and SKMs only if you need them
Encrypting a Knowledge Module (KM) allows you to protect valuable code
Oracle Data Integrator uses a DES Encryption algorithm based on a personal encryption key.
This key can be saved in a file and can be reused to perform encryption or decryption operations.

ODI uses six (6) types of Knowledge Modules:
■ RKM (Reverse Knowledge Modules) Perform a customized reverse-engineering of data models for a specific technology.
■ RKMs are used in data models.

RKM <reversed technology> [(reverse method)]
i.e. RKM Oracle, RKM Netezza

■ LKM (Loading Knowledge Modules) are used to extract data from source systems (files, middleware, database, etc.).
■ LKMs are used in interfaces.

LKM <source technology> to <target technology> [(loading method)]
i.e. LKM Oracle to Oracle (DBLink), LKM File to Oracle (SQLLDR), LKM SQL to SQL, etc

■ JKM (Journalizing Knowledge Modules) are used to create a journal of data modifications (insert, update and delete) of the source databases to keep track of
the changes.
■ JKMs are used in data models and used for Changed Data Capture.

JKM <journalized technology> <journalizing mode> (<journalizing method>)
JKM Oracle11g Consistent (Streams)
JFM Oracle Simple
JKM DB2 400 Simple (journal)

■ IKM (Integration Knowledge Modules) are used to integrate (load) data to the target tables.
■ IKMs are used in interfaces.

IKM [<staging technology>] <target technology> [<integration mode>] [(<integration method>)].
i.e. IKM Oracle Incremental Update (MERGE)
IKM SQL to File Append
IKM SQL Incremental Update
IKM SQL to SQL Append

■ CKM (Check Knowledge Modules) are used to check that constraints on the sources and targets are not violated.
■ CKMs are used in data model’s static check and interfaces flow checks.

CKM <staging technology>
i.e. CKM SQL, CKM Oracle

■ SKM (Service Knowledge Modules) are used to generate the code required for creating data services.
■ SKMs are used in data models.

California Central Valley: Dreaming images

(coffee break)

I have admired Roman Loranc's photographs for a long time.
Born in Poland, Loranc moved to California years ago and has been making incredible photographs.
Check his website and the interview with him below

Data Warehouse: Concepts on ETL Processes (Transforming)

Data Warehouse: Four (4) typical transformation and loading scenarios

(a) Key Lookup Scenario
(b) Business Rule Violation Scenario
(c) Data error scenario
(d) Pivoting Scenario

(a) key Lookup Scenario
You need to load sales transaction data into a retail Data Warehouse

In the operational system, each product is identified by Uniform Price Codes (UPCs);

In the Data Warehouse, product is identified by product_id.

You need to

(1) Map UPC --> PRODUCT_ID; and
(2)As you load the source data, lookup the map table and transform the data accordingly

(b) Business Rule Violation Scenario

Some UPC Codes are invalid

In which ways can you implement the transformation and handle the invalid UPCs?

i.e: new sales data might not have valid UPC codes

(1) use Create Table As Select (CTAS) to identify and log invalid UPCs
(2) use an outer join (+) and convert invalid UPCs to NULL
(3) use multi-table INSERT and separate the invalid UPCs to a different table

(1) use CTAS to identify and log invalid UPCs

(2) use an outer join (+) and convert invalid UPCs to NULL

(3)use multi-table INSERT and separate the invalid UPCs to a different table

(c) Data Error Scenarios

If the quality of the data is unknown, you may face unexpected errors, such as a a failed datatype conversion, or a constraint violation.

Such errors can happen independently of whether the UPC for the particular record is or not valid.

Here you can use a DML error logging table to capture unexpected errors.

Data Warehouse with Oracle Data Integrator (ODI) (II)

The need for automated integration solutions

Challenges to data integration:

Terabytes to Pentabytes of data

Fenomenal data explosion

DI/ETL market growing at 15%

Most enterprises looking to moving away from scripts to solutions

Demand to near real-time/real-time data integration

Explosion in the number of sources: from 10s to 1000s

Use of ETL tools: (Oct 2011 Forrester survey)

90% of enterprise still make use of batch data loading into DW or BI platform

50% performing real-time/near real-time integration

47% performing batch database migrations

New data integration requirements:

support for all types of data (structured, semi-structured, unstructured)

Support for Big Data (integrate with Hadoop)

Top three benefits found of using Oracle Data Integrator

improvement in project completion (resource management, ability to reuse codes, reduce manual hand-coding, reducing error and troubleshooting)

shorter reporting cycle

Deferred hardware upgrade

( source: here )

About Oracle Data Integrator

As with the Oracle Warehouse Builder (OWB), ODI uses the database (ISO-92 RDBMS) as the ETL engine
ODI supports ISO-92 RDBMS (Oracle, DB2, SQL Server, etc)
Supports batch, event-based and real-time integration
Extensible through Knowledge Modules
ODI Engine runs directly on the source and target
ODI strong on heterogeneity of application sources
ODI includes application adapters for various application sources

Benefits and improvements:

Simpliies creation of data flows through declarative methodology, reducing the number of steps
Unifies integration tooling across unstructured, semi-structured and structured data
Extends functionality via Knowledge Modules for integrating to applications out-of-the box, providing new customizations
integrates management using OEM to simplify operations across all middleware, applications, database

Using ODI: Software Architecture

ODI architecture is build around several components:

ODI Studio - Graphical client
Agents - Java modules that run in source, target, or WebLogic server
ODI Console - Web console runs on WebLogic server
Extension for the Fusion Middleware Control Console
Repositories - Master and Work repositories

Designer module	Defines declarative rules for data transformation and data integrity. All project development occurs here Database and applications metadata imported and defined here Use metadata and rules to generate data integration scenarios or load plans for production Core module for developers and metadata administrators
Operator module	Monitors data integration processes in production Display execution logs (error counts, number of processed rows, execution statistics, executed code, etc) Can be used for debugging on design time
Topology module	defines physical and logical architecture of the infrastructure Infrastructure or project administrators register servers, database schemas, database catalogs and agents in the master repository
Security module	Manages user profiles and privileges Assign access authorization to objects and features

The Agent coordinates the execution of the ODI scenarios. It

(a) retrieves the code stored in the ODI repository,
(b) connects to the various source and target systems and
(c) orchestrates the overall data integration process
(d) when execution is completed, the agent updates the execution logs in the repository and reports error messages and executions statistics. These can be reviewed in the Operator navigator at ODI Studio, or through the web interfaces.

Two types of Agents:

Standalone Agent - can be installed on the source or target systems. Requires a JVM.

Java EE Agent - Deployed on Oracle WebLogic Server.

The ODI Repository consists of a Master Repository and one or more Work repositories.

Repositories are sets of tables that can be stored in RDBMS (i.e. Oracle, SQL Server, DB2, etc).

All objects used by the various ODI modules are stored in the repositories

The Master Repository contains

Security information (user profiles and privileges)

topology information

source code for all version of all ODI objects

Several work repositories can coexist in the same installation (development, quality control, production)

Work Repositories store information for

Models (i.e. metadata) - datastores, columns, data integrity constraints, cross references, data lineage and impact analysis

Projects - interfaces, packages, procedures, folders, knowlege models, variables

Runtime information - scenarios, load plans, scheduling information and logs

Big Data, Deep Data, and who needs apples?

So, they took as input "Big Data" from a double-hammer and not much later they had "f = ma" as output.
Or how to generate new laws from context-less data?

mentioned in Bradley Voytek's "Automated science, deep data and the paradox of information"

and here is the original article ("Distilling Free-Form Natural Laws from Experimental Data")

I enjoyed this story...

Data Warehouse with Oracle Data Integrator (ODI)

Building a Data Warehouse: typical steps and milestones

Research and specify business needs (Key Indicators)

Identify data sources relevant to generate key indicators

Define business rules to transform source information into key indicators

Model the data structure of the target warehouse to store the key indicators

Populate the indicators by implementing business rules

Measure the overall accuracy of the data by setting up data quality rules

Develop reports on key indicators

Make key indicators and metadata available to business users through ad-hoc query tools or predefined reports

Measure business users’ satisfaction and add/modify key indicators

Using Oracle Data Integrator (ODI) in a Data Warehouse project: actors involved and some assigned tasks

Business User

Access the final calculated key indicators

Use reports and ad-hoc queries

May need to understand the definition of an indicator

May need to be aware of data quality issues

when was the last time the table was updated?

How many records were added, update, removed in the table?

What are the rules that calculate a particular indicator?

Where does the data come from, and how is it transformed?

Business Analyst

Define the key indicators

Identify the source applications

How many different source applications need to be considered?

Is the data needed for key indicators available in the selected pool of source applications?

What data quality issues are present in the source systems?

Specify business rules to transform source data into meaningful target indicators

Projects can use the ODI Designer to directly specify the business rules

For each target table, specify also:

Target datastore - name the target datastore

Description of transformation - describe its purpose

Integration strategy - How data should be written to target (replace table, append, update, etc). Each strategy specified will correspond to an ODI Integration Knowledge Module.

Refresh frequency

Dependencies - what datastores need to be loaded or jobs executed prior to this one

Source datastores - source databases, applications, etc used

Source ODI Model -

Datastore name, Fiel Mappings and transformations, Links or Join criteria, Filters, Data Quality requirements, constraint names and expressions, etc

Maintain translation data from operational semantics to the Data Warehouse semantic

Developer

Implement the business rules as specified by business analysts

Provide executable scenarios to the production team

Must understand infrastructure details and have business knowledge of source applications

Metadata Administrator

Reverse engineer source and target applications

Understand content and structure of source applications

Connect to source applications and capture their metadata

Define data quality business rules (specified by Business Analyst) in ODI repository

What level of Data Quality is required?

who are the business owners of source data?

What should be done with rejected records?

There should be an error recycling strategy?

How would business users modify erroneous source data?

Should a GUI be provided for source data correction?

Guarantee the overall consistency of Metadata in the various environments (dev, test, prod: repository)

Participate in the data modeling of key indicators

Add comments, descriptions and integrity rules (PK, FK, Check, etc)in the metadata

Provide version management

Database Administrator

Define technical database structure supporting the various layers in the data warehouse project (and ODI structure)

Create database profiles needed by ODI

Create schemas and databases for the staging areas

Describe columns inside the data dictionary of database - COMMENT ON TABLE/COLUMN

Avoid using PKs of source systems as PKs in target tables. Use counter or identity columns instead.

Design Referential Integrity and reverse engineerd FKs in ODI models.

Do not implement RIs on target database (for performance). Data quality control should guarantee data integrity.

Standardize obejct naming conventions

Distributed and maintain the descriptions of the environments (topology)

System Admin

Maintain technical resources for the Data Warehouse project

May install and monitor schedule agents

May backup and restore repositories

Install and monitor ODI console

Setup the various enviroments (dev, test, prod)

Security Admin

Define the security policy for the ODI Repository

Creates ODI users and grant rights on models, projects and contexts

Operator

Import released and tested scenarios into production environment

Schedule the execution of production scenarios

Monitor execution logs and restart failed sessions

Conceptual Architecture of an ODI solution

The ODI Repository is the central component.

ODI Repository stores configuration information about

IT infrastructure and topology

metadata of all applications

projects

Interfaces

Models

Packages

Scenarios

Solutions - versioning control

You can have various repositories within an IT infrastructure, linked ti separated environments that exchange metadata and scenarios (i.e. development, test, user acceptance test, production, hot fix)

Example: two environments, components, actors and tasks

(source: Oracle 2008)

Suggested environment configuration for a Data Warehouse project with ODI:

(1) A single master repository:

holds all the topology and security information.

All the work repositories are registered here.

Contains all the versions of objects that are committed by the designers.

(2) “Development” work repository:

Shared by all ODI designers

Holds all the projects and models under development.

(3) “Testing” work repository

shared by the IT testing team.

Contains all the projects and models being tested for future release.

(4) “User Acceptance Tests” work repository:

shared by the IT testing team and business analysts.

Contains all the projects and models about to be released.

Business analysts will use the ODI Console on top of this repository to validate the scenarios and transformations before releasing them to production.

(5) “Production” work repository

shared by the production team, the operators and the business analysts.

Contains all the projects and models in read-only mode for metadata lineage, as well as all the released scenarios.

(6) “Hot fix” work repository:

shared by the maintenance team and the development team.

Usually empty, but whenever a critical error happens in production, the maintenance team restores the corresponding projects and models in this repository and performs the corrections with the help of the development team.

Once the problems are solved, the scenarios are released directly to the production repository and the new models and projects are versioned in the master repository.

Typical environment (I):

(source: Oracle 2010)

Typical environment (II): Separate Master Repository for Production

(source: Oracle 2010)

Business User	Access the final calculated key indicators Use reports and ad-hoc queries May need to understand the definition of an indicator May need to be aware of data quality issues when was the last time the table was updated? How many records were added, update, removed in the table? What are the rules that calculate a particular indicator? Where does the data come from, and how is it transformed?
Business Analyst	Define the key indicators Identify the source applications How many different source applications need to be considered? Is the data needed for key indicators available in the selected pool of source applications? What data quality issues are present in the source systems? Specify business rules to transform source data into meaningful target indicators Projects can use the ODI Designer to directly specify the business rules For each target table, specify also: Target datastore - name the target datastore Description of transformation - describe its purpose Integration strategy - How data should be written to target (replace table, append, update, etc). Each strategy specified will correspond to an ODI Integration Knowledge Module. Refresh frequency Dependencies - what datastores need to be loaded or jobs executed prior to this one Source datastores - source databases, applications, etc used Source ODI Model - Datastore name, Fiel Mappings and transformations, Links or Join criteria, Filters, Data Quality requirements, constraint names and expressions, etc Maintain translation data from operational semantics to the Data Warehouse semantic
Developer	Implement the business rules as specified by business analysts Provide executable scenarios to the production team Must understand infrastructure details and have business knowledge of source applications
Metadata Administrator	Reverse engineer source and target applications Understand content and structure of source applications Connect to source applications and capture their metadata Define data quality business rules (specified by Business Analyst) in ODI repository What level of Data Quality is required? who are the business owners of source data? What should be done with rejected records? There should be an error recycling strategy? How would business users modify erroneous source data? Should a GUI be provided for source data correction? Guarantee the overall consistency of Metadata in the various environments (dev, test, prod: repository) Participate in the data modeling of key indicators Add comments, descriptions and integrity rules (PK, FK, Check, etc)in the metadata Provide version management
Database Administrator	Define technical database structure supporting the various layers in the data warehouse project (and ODI structure) Create database profiles needed by ODI Create schemas and databases for the staging areas Describe columns inside the data dictionary of database - COMMENT ON TABLE/COLUMN Avoid using PKs of source systems as PKs in target tables. Use counter or identity columns instead. Design Referential Integrity and reverse engineerd FKs in ODI models. Do not implement RIs on target database (for performance). Data quality control should guarantee data integrity. Standardize obejct naming conventions Distributed and maintain the descriptions of the environments (topology)
System Admin	Maintain technical resources for the Data Warehouse project May install and monitor schedule agents May backup and restore repositories Install and monitor ODI console Setup the various enviroments (dev, test, prod)
Security Admin	Define the security policy for the ODI Repository Creates ODI users and grant rights on models, projects and contexts
Operator	Import released and tested scenarios into production environment Schedule the execution of production scenarios Monitor execution logs and restart failed sessions

Pages