Showing posts with label Data Integration. Show all posts
Showing posts with label Data Integration. Show all posts

Oracle Data Intergrator (ODI)




Data Integration and ODI - Concepts

Integration projects can be:
  • Data Oriented, especially those involving large data volumes.
  • Event Oriented - Based on an Event Driven Architecture (EDA) or a
  • Service Oriented - Suitable to a Service Oriented Architecture (SOA), for asynchronous or synchronous integration.

Oracle Data Integrator:
  • combines all the elements of data integration: data movement, data synchronization, data quality, data management, and data services.
  • transforms large volumes of data efficiently,
  • processes events in real time through its advanced Changed Data Capture (CDC) capability, and
  • provides data services to the Oracle SOA Suite.
  • provides robust data integrity control features, assuring the consistency and correctness of data

ODI Studio - Interface and Concepts
Declarative Design
  • Declarative Design is a design method that focuses on “What” to do (the Declarative Rules) rather than “How” to do it (the Process).
  • Declarative Design in ODI uses the well known relational paradigm to declare in the form of an Interface the declarative rules for a data integration task, which includes designation of sources, targets, and transformations.
  • Declarative rules often apply to metadata to transform data and are usually described in natural language by business users.
  • In a typical data integration project (such as a Data Warehouse project), these rules are defined during the specification phase in documents written by business analysts in conjunction with project managers.
  • They can very often be implemented using SQL expressions, provided that the metadata they refer to is known and qualified in a metadata repository.
  • Types of declarative rules: Mappings, Joins, Filters and Constraints



Integration Project

Suggested steps for the life cycle of an integration project:
1. Create a new project and import Knowledge Modules.
2. Define the project organization and practises using folders, markers and documentation.
3. Create reusable components: interfaces, procedures, variables, sequences. Perform unitary tests.
4. Assemble these components into packages. Perform integration tests.
5. Release the work in scenarios







Package
Package is a workflow, made up of a sequence of steps organized into an execution diagram.
Each step can either succeed or fail its execution.
Packages assemble and reference other component from a project (interfaces, procedures, variables, etc)
Types of steps: Flow (Interface), Procedue, Variable, ODI tools (allow access to API commands, OS system calls), models, sub-models, and Datastores.



The Package above (Load Customers) specifies a workflow that loads a sequence of tables:
(1) Initialization step: Execute procedure Initialization (type Procedure)
(2) Execute interface Load Countries (type: Flow(interface)), linked object: TRG_COUNTRY
(3) Execute interface Load Regions (type: Flow(interface)), linked object: TRG_REGION
(4) Execute interface Load Cities (type: Flow(interface)), linked object: TRG_CITY
(5) refresh variable "CUST_FLAG". Load Customers? (type: Evaluate Variable), linked objects: CUST_FLAG
IF CUST_FLAG = 1 (true) THEN Load Customers, ELSE Final Step
(6) Final Step (type procedure)
(7) If any of the steps above fails,the Package runs the "Send Admin email" step that sends an email using an ODI tool.




Interface
It is set of declarative rules that describe the loading of a datastore or a temporary target structure from one or more source datastores.
An interface is a reusable dataflow.
Components of an Integration Interface:
  • Target Datastore - element that will be loaded by the interface.
    • May be permanent (defined in a model) or temporary (created by the interface)
  • Datasets - One target is loaded with data coming from various datasets
    • each dataset corresponds to one diagram of source datastores and the mappings used to load the target datastore from these source datastores.
  • Diagram of Source Datastores
  • Mapping - defines the transformations performed on one or various source coluns to load one target column
  • Staging area
  • Flow
  • Control

In the example above,
  • The interface defines a set of rules to load the target TRG_CUSTOMER Datastore with data coming from three source datastores (SRC_AGEGROUP, SRC_CUSTOMER, SRC_SALES_PERSON)
  • It defines a Join between SRC_CUSTOMER and SRC_SALES_PERSON
  • it defines a Lookup: CUST.AGE BETWEEN AGE_GROUP_AGE_MIN and AGE_GROUP.AGE_MAX
  • Define Mappings between the attributes in the sources datastores and attributes in the target datastore
Lookup
Join
Mapping


Interface dataflow
  • Automatically generated by ODI from the declarative rules in the interface
  • Based on the configuration files and context (target, sources, where is the staging area, etc)
  • ODI automatically computes the flow depending on the configuration in the interface's diagram.
  • It proposes default KMs for the data flow.
  • The Flow tab enables you to view the data flow and select the KMs used to load and integrate data.

In the flow, the following items appear:
Source Sets:
  • Source Datastores that are within the same dataset, located on the same physical data server and which are joined with Joins located on the Source are grouped in a single source set in the flow diagram.
  • A source set represents a group of datastores that can be extracted at the same time.
DataSets:
  • Datasets appear as yellow boxes in the Staging Area.
Staging Area:
  • Appears as a box that includes the different datasets, the target (if located on the same data server), and possibly some of the sources (if located on
    the same data server)
Target:
  • Appears as a separate box if it is located in a different schema from the staging area



You use the following KMs in the flow:
  • Loading KM (LKM):
    • Define how data is moved.
    • One LKM is selected for each Source Set for moving data from the sources to the staging area.
    • It can be also selected to move data from the Staging Area - when different from the Target - to the Target, when a
      single technology IKM is selected for the Staging Area.
  • Integration KM (IKM): They define how data is integrated into the target. One IKM is typically
    selected on the Target. When the staging area is different from the target, the
    selected IKM can be a multi-technology IKM that moves and integrates data from
    the Staging Area into the Target.


Procedure
A Procedure is a reusable component that groups a sequence of operations that do not fit in the interface concept.
Examples of procedures:
■ wait and unzip a file
■ send a batch of files via FTP
■ receive emails
■ purge a database






Knowledge Modules
  • Knowledge Modules (KM) implement “how” the integration processes occur
  • KMs are components of ODI Open Connector technology.
  • Contain the knowledge required by ODI to perform a specific set of tasks against a specific technology or set of technologies.
  • When starting an integration project, you must import the Knowledge Module appropriate for your project.
  • It is possible to import additional KMs after setting up the project, and it is possible to change the KMs used afterwards
  • Start with Generic KMs. The SQL KMs work with almost all technologies
  • Start with simple KMs.
  • Select KMs that match your source/target combinations to increase performance
  • Select KMs according to your infrastructure limitations
  • Select JKMs and SKMs only if you need them
  • Encrypting a Knowledge Module (KM) allows you to protect valuable code
  • Oracle Data Integrator uses a DES Encryption algorithm based on a personal encryption key.
  • This key can be saved in a file and can be reused to perform encryption or decryption operations.

ODI uses six (6) types of Knowledge Modules:
RKM (Reverse Knowledge Modules) Perform a customized reverse-engineering of data models for a specific technology.
■ RKMs are used in data models.
RKM <reversed technology> [(reverse method)]
i.e. RKM Oracle, RKM Netezza

LKM (Loading Knowledge Modules) are used to extract data from source systems (files, middleware, database, etc.).
■ LKMs are used in interfaces.
LKM <source technology> to <target technology> [(loading method)]
i.e. LKM Oracle to Oracle (DBLink), LKM File to Oracle (SQLLDR), LKM SQL to SQL, etc

JKM (Journalizing Knowledge Modules) are used to create a journal of data modifications (insert, update and delete) of the source databases to keep track of
the changes.
■ JKMs are used in data models and used for Changed Data Capture.
JKM <journalized technology> <journalizing mode> (<journalizing method>)
JKM Oracle11g Consistent (Streams)
JFM Oracle Simple
JKM DB2 400 Simple (journal)


IKM (Integration Knowledge Modules) are used to integrate (load) data to the target tables.
■ IKMs are used in interfaces.
IKM [<staging technology>] <target technology> [<integration mode>] [(<integration method>)].
i.e. IKM Oracle Incremental Update (MERGE)
IKM SQL to File Append
IKM SQL Incremental Update
IKM SQL to SQL Append

CKM (Check Knowledge Modules) are used to check that constraints on the sources and targets are not violated.
■ CKMs are used in data model’s static check and interfaces flow checks.
CKM <staging technology>
i.e. CKM SQL, CKM Oracle

SKM (Service Knowledge Modules) are used to generate the code required for creating data services.
■ SKMs are used in data models.

Data Warehouse with Oracle Data Integrator (ODI) (II)





The need for automated integration solutions

Challenges to data integration:
  • Terabytes to Pentabytes of data
  • Fenomenal data explosion
  • DI/ETL market growing at 15%
  • Most enterprises looking to moving away from scripts to solutions
  • Demand to near real-time/real-time data integration
  • Explosion in the number of sources: from 10s to 1000s

Use of ETL tools: (Oct 2011 Forrester survey)
  • 90% of enterprise still make use of batch data loading into DW or BI platform
  • 50% performing real-time/near real-time integration
  • 47% performing batch database migrations

New data integration requirements:
  • support for all types of data (structured, semi-structured, unstructured)
  • Support for Big Data (integrate with Hadoop)

Top three benefits found of using Oracle Data Integrator
  • improvement in project completion (resource management, ability to reuse codes, reduce manual hand-coding, reducing error and troubleshooting)
  • shorter reporting cycle
  • Deferred hardware upgrade
( source: here )



About Oracle Data Integrator

  • As with the Oracle Warehouse Builder (OWB), ODI uses the database (ISO-92 RDBMS) as the ETL engine
  • ODI supports ISO-92 RDBMS (Oracle, DB2, SQL Server, etc)
  • Supports batch, event-based and real-time integration
  • Extensible through Knowledge Modules
  • ODI Engine runs directly on the source and target
  • ODI strong on heterogeneity of application sources
  • ODI includes application adapters for various application sources

Benefits and improvements:
  • Simpliies creation of data flows through declarative methodology, reducing the number of steps
  • Unifies integration tooling across unstructured, semi-structured and structured data
  • Extends functionality via Knowledge Modules for integrating to applications out-of-the box, providing new customizations
  • integrates management using OEM to simplify operations across all middleware, applications, database



Using ODI: Software Architecture





ODI architecture is build around several components:
  • ODI Studio - Graphical client
  • Agents - Java modules that run in source, target, or WebLogic server
  • ODI Console - Web console runs on WebLogic server
  • Extension for the Fusion Middleware Control Console
  • Repositories - Master and Work repositories










Designer
module

  • Defines declarative rules for data transformation and data integrity.
  • All project development occurs here
  • Database and applications metadata imported and defined here
  • Use metadata and rules to generate data integration scenarios or load plans for production
  • Core module for developers and metadata administrators

Operator
module

  • Monitors data integration processes in production
  • Display execution logs (error counts, number of processed rows, execution statistics, executed code, etc)
  • Can be used for debugging on design time

Topology
module

  • defines physical and logical architecture of the infrastructure
  • Infrastructure or project administrators register servers, database schemas, database catalogs and agents in the master repository
Security
module
  • Manages user profiles and privileges
  • Assign access authorization to objects and features





  • The Agent coordinates the execution of the ODI scenarios. It
(a) retrieves the code stored in the ODI repository,
(b) connects to the various source and target systems and
(c) orchestrates the overall data integration process
(d) when execution is completed, the agent updates the execution logs in the repository and reports error messages and executions statistics. These can be reviewed in the Operator navigator at ODI Studio, or through the web interfaces.

Two types of Agents:
  • Standalone Agent - can be installed on the source or target systems. Requires a JVM.
  • Java EE Agent - Deployed on Oracle WebLogic Server.











  • The ODI Repository consists of a Master Repository and one or more Work repositories.
  • Repositories are sets of tables that can be stored in RDBMS (i.e. Oracle, SQL Server, DB2, etc).
  • All objects used by the various ODI modules are stored in the repositories

  • The Master Repository contains
    • Security information (user profiles and privileges)
    • topology information
    • source code for all version of all ODI objects

  • Several work repositories can coexist in the same installation (development, quality control, production)
  • Work Repositories store information for
    • Models (i.e. metadata) - datastores, columns, data integrity constraints, cross references, data lineage and impact analysis
    • Projects - interfaces, packages, procedures, folders, knowlege models, variables
    • Runtime information - scenarios, load plans, scheduling information and logs