Data Warehouse with Oracle Data Integrator (ODI) (II)





The need for automated integration solutions

Challenges to data integration:
  • Terabytes to Pentabytes of data
  • Fenomenal data explosion
  • DI/ETL market growing at 15%
  • Most enterprises looking to moving away from scripts to solutions
  • Demand to near real-time/real-time data integration
  • Explosion in the number of sources: from 10s to 1000s

Use of ETL tools: (Oct 2011 Forrester survey)
  • 90% of enterprise still make use of batch data loading into DW or BI platform
  • 50% performing real-time/near real-time integration
  • 47% performing batch database migrations

New data integration requirements:
  • support for all types of data (structured, semi-structured, unstructured)
  • Support for Big Data (integrate with Hadoop)

Top three benefits found of using Oracle Data Integrator
  • improvement in project completion (resource management, ability to reuse codes, reduce manual hand-coding, reducing error and troubleshooting)
  • shorter reporting cycle
  • Deferred hardware upgrade
( source: here )



About Oracle Data Integrator

  • As with the Oracle Warehouse Builder (OWB), ODI uses the database (ISO-92 RDBMS) as the ETL engine
  • ODI supports ISO-92 RDBMS (Oracle, DB2, SQL Server, etc)
  • Supports batch, event-based and real-time integration
  • Extensible through Knowledge Modules
  • ODI Engine runs directly on the source and target
  • ODI strong on heterogeneity of application sources
  • ODI includes application adapters for various application sources

Benefits and improvements:
  • Simpliies creation of data flows through declarative methodology, reducing the number of steps
  • Unifies integration tooling across unstructured, semi-structured and structured data
  • Extends functionality via Knowledge Modules for integrating to applications out-of-the box, providing new customizations
  • integrates management using OEM to simplify operations across all middleware, applications, database



Using ODI: Software Architecture





ODI architecture is build around several components:
  • ODI Studio - Graphical client
  • Agents - Java modules that run in source, target, or WebLogic server
  • ODI Console - Web console runs on WebLogic server
  • Extension for the Fusion Middleware Control Console
  • Repositories - Master and Work repositories










Designer
module

  • Defines declarative rules for data transformation and data integrity.
  • All project development occurs here
  • Database and applications metadata imported and defined here
  • Use metadata and rules to generate data integration scenarios or load plans for production
  • Core module for developers and metadata administrators

Operator
module

  • Monitors data integration processes in production
  • Display execution logs (error counts, number of processed rows, execution statistics, executed code, etc)
  • Can be used for debugging on design time

Topology
module

  • defines physical and logical architecture of the infrastructure
  • Infrastructure or project administrators register servers, database schemas, database catalogs and agents in the master repository
Security
module
  • Manages user profiles and privileges
  • Assign access authorization to objects and features





  • The Agent coordinates the execution of the ODI scenarios. It
(a) retrieves the code stored in the ODI repository,
(b) connects to the various source and target systems and
(c) orchestrates the overall data integration process
(d) when execution is completed, the agent updates the execution logs in the repository and reports error messages and executions statistics. These can be reviewed in the Operator navigator at ODI Studio, or through the web interfaces.

Two types of Agents:
  • Standalone Agent - can be installed on the source or target systems. Requires a JVM.
  • Java EE Agent - Deployed on Oracle WebLogic Server.











  • The ODI Repository consists of a Master Repository and one or more Work repositories.
  • Repositories are sets of tables that can be stored in RDBMS (i.e. Oracle, SQL Server, DB2, etc).
  • All objects used by the various ODI modules are stored in the repositories

  • The Master Repository contains
    • Security information (user profiles and privileges)
    • topology information
    • source code for all version of all ODI objects

  • Several work repositories can coexist in the same installation (development, quality control, production)
  • Work Repositories store information for
    • Models (i.e. metadata) - datastores, columns, data integrity constraints, cross references, data lineage and impact analysis
    • Projects - interfaces, packages, procedures, folders, knowlege models, variables
    • Runtime information - scenarios, load plans, scheduling information and logs




Big Data, Deep Data, and who needs apples?

So, they took as input "Big Data" from a double-hammer and not much later they had "f = ma" as output.
Or how to generate new laws from context-less data?
I enjoyed this story...




Data Warehouse with Oracle Data Integrator (ODI)





Building a Data Warehouse: typical steps and milestones

  • Research and specify business needs (Key Indicators)
  • Identify data sources relevant to generate key indicators
  • Define business rules to transform source information into key indicators
  • Model the data structure of the target warehouse to store the key indicators
  • Populate the indicators by implementing business rules
  • Measure the overall accuracy of the data by setting up data quality rules
  • Develop reports on key indicators
  • Make key indicators and metadata available to business users through ad-hoc query tools or predefined reports
  • Measure business users’ satisfaction and add/modify key indicators


Using Oracle Data Integrator (ODI) in a Data Warehouse project: actors involved and some assigned tasks



Business User
  • Access the final calculated key indicators
  • Use reports and ad-hoc queries
  • May need to understand the definition of an indicator
  • May need to be aware of data quality issues
    • when was the last time the table was updated?
    • How many records were added, update, removed in the table?
    • What are the rules that calculate a particular indicator?
    • Where does the data come from, and how is it transformed?

Business Analyst
  • Define the key indicators
  • Identify the source applications
    • How many different source applications need to be considered?
    • Is the data needed for key indicators available in the selected pool of source applications?
    • What data quality issues are present in the source systems?
  • Specify business rules to transform source data into meaningful target indicators
    • Projects can use the ODI Designer to directly specify the business rules
    • For each target table, specify also:
      • Target datastore - name the target datastore
      • Description of transformation - describe its purpose
      • Integration strategy - How data should be written to target (replace table, append, update, etc). Each strategy specified will correspond to an ODI Integration Knowledge Module.
      • Refresh frequency
      • Dependencies - what datastores need to be loaded or jobs executed prior to this one
      • Source datastores - source databases, applications, etc used
      • Source ODI Model -
      • Datastore name, Fiel Mappings and transformations, Links or Join criteria, Filters, Data Quality requirements, constraint names and expressions, etc
  • Maintain translation data from operational semantics to the Data Warehouse semantic

Developer
  • Implement the business rules as specified by business analysts
  • Provide executable scenarios to the production team
  • Must understand infrastructure details and have business knowledge of source applications

Metadata Administrator
  • Reverse engineer source and target applications
    • Understand content and structure of source applications
    • Connect to source applications and capture their metadata
    • Define data quality business rules (specified by Business Analyst) in ODI repository
      • What level of Data Quality is required?
      • who are the business owners of source data?
      • What should be done with rejected records?
      • There should be an error recycling strategy?
      • How would business users modify erroneous source data?
      • Should a GUI be provided for source data correction?
  • Guarantee the overall consistency of Metadata in the various environments (dev, test, prod: repository)
  • Participate in the data modeling of key indicators
  • Add comments, descriptions and integrity rules (PK, FK, Check, etc)in the metadata
  • Provide version management

Database Administrator
  • Define technical database structure supporting the various layers in the data warehouse project (and ODI structure)
  • Create database profiles needed by ODI
  • Create schemas and databases for the staging areas
    • Describe columns inside the data dictionary of database - COMMENT ON TABLE/COLUMN
    • Avoid using PKs of source systems as PKs in target tables. Use counter or identity columns instead.
    • Design Referential Integrity and reverse engineerd FKs in ODI models.
    • Do not implement RIs on target database (for performance). Data quality control should guarantee data integrity.
    • Standardize obejct naming conventions
  • Distributed and maintain the descriptions of the environments (topology)

System Admin
  • Maintain technical resources for the Data Warehouse project
  • May install and monitor schedule agents
  • May backup and restore repositories
  • Install and monitor ODI console
  • Setup the various enviroments (dev, test, prod)

Security Admin
  • Define the security policy for the ODI Repository
  • Creates ODI users and grant rights on models, projects and contexts

Operator
  • Import released and tested scenarios into production environment
  • Schedule the execution of production scenarios
  • Monitor execution logs and restart failed sessions




Conceptual Architecture of an ODI solution

  • The ODI Repository is the central component.
  • ODI Repository stores configuration information about
    • IT infrastructure and topology
    • metadata of all applications
    • projects
    • Interfaces
    • Models
    • Packages
    • Scenarios
    • Solutions - versioning control
  • You can have various repositories within an IT infrastructure, linked ti separated environments that exchange metadata and scenarios (i.e. development, test, user acceptance test, production, hot fix)

Example: two environments, components, actors and tasks


(source: Oracle 2008)



Suggested environment configuration for a Data Warehouse project with ODI:


(1) A single master repository:
  • holds all the topology and security information.
  • All the work repositories are registered here.
  • Contains all the versions of objects that are committed by the designers.

(2) “Development” work repository:
  • Shared by all ODI designers
  • Holds all the projects and models under development.

(3) “Testing” work repository
  • shared by the IT testing team.
  • Contains all the projects and models being tested for future release.

(4) “User Acceptance Tests” work repository:
  • shared by the IT testing team and business analysts.
  • Contains all the projects and models about to be released.
  • Business analysts will use the ODI Console on top of this repository to validate the scenarios and transformations before releasing them to production.

(5) “Production” work repository
  • shared by the production team, the operators and the business analysts.
  • Contains all the projects and models in read-only mode for metadata lineage, as well as all the released scenarios.

(6) “Hot fix” work repository:
  • shared by the maintenance team and the development team.
  • Usually empty, but whenever a critical error happens in production, the maintenance team restores the corresponding projects and models in this repository and performs the corrections with the help of the development team.
  • Once the problems are solved, the scenarios are released directly to the production repository and the new models and projects are versioned in the master repository.



Typical environment (I):


(source: Oracle 2010)



Typical environment (II): Separate Master Repository for Production


(source: Oracle 2010)