
Building examples of using SQL functions supported by Oracle
Check the post here (in construction)
Consider the example below:
- It is a key technique in DWH: allow you to transform multiple rows of input into fewer and generally wider rows
- Allow to transform the presentation of the data, rotating rows into columns and presenting data in a crosstabular format, or vice-versa.
- When pivoting, an aggregation operator is applied for each item in the pivot column value list
- Use the
pivot_clause
of theSELECT
statement to write crosstabulation queries that rotate rows into columns, aggregating data in the process of the rotation.
- Pivoting: Transactional format <=> Crosstabular (relational) format
- The view SALES_VIEW displays data in a transactional format
- It lists amount and quantity sold by product, country, quarter and channel.
- In this case, a product may be sold through various channels in a given country and a given quarter of year.
- If you need to see the sales by channel_id on a relational or crosstabular format, you can use a pivoting operation:
SQL> CREATE VIEW sales_view AS SELECT prod_name product, country_name country, channel_id channel, SUBSTR(calendar_quarter_desc, 6,2) quarter, SUM(amount_sold) amount_sold, SUM(quantity_sold) quantity_sold FROM sales, times, customers, countries, products WHERE sales.time_id = times.time_id AND sales.prod_id = products.prod_id AND sales.cust_id = customers.cust_id AND customers.country_id = countries.country_id GROUP BY prod_name, country_name, channel_id, SUBSTR(calendar_quarter_desc, 6, 2);
- Using the PIVOT clause you can reconstruct the data on a cross tabular form:
SQL> SELECT * FROM (SELECT product, channel, amount_sold FROM sales_view WHERE country = 'United States of America' ) S PIVOT (SUM(amount_sold) FOR CHANNEL IN (3 AS DIRECT_SALES, 4 AS INTERNET_SALES, 5 AS CATALOG_SALES, 9 AS TELESALES)) ORDER BY product; PRODUCT DIRECT_SALES INTERNET_SALES CATALOG_SALES TELESALES ------------------------- ---------------------- ---------------------- ---------------------- ------------------ ... Home Theatre Package... 2052150.1 310431.22 Internal 6X CD-ROM 135284.16 15839.73 Internal 8X CD-ROM 158157.23 24398.22 Keyboard Wrist Rest 110554 22477.41 1354.87 ... 71 rows selected
Declarative Design
- Declarative Design is a design method that focuses on “What” to do (the Declarative Rules) rather than “How” to do it (the Process).
- Declarative Design in ODI uses the well known relational paradigm to declare in the form of an Interface the declarative rules for a data integration task, which includes designation of sources, targets, and transformations.
- Declarative rules often apply to metadata to transform data and are usually described in natural language by business users.
- In a typical data integration project (such as a Data Warehouse project), these rules are defined during the specification phase in documents written by business analysts in conjunction with project managers.
- They can very often be implemented using SQL expressions, provided that the metadata they refer to is known and qualified in a metadata repository.
- Types of declarative rules: Mappings, Joins, Filters and Constraints
The Package above (Load Customers) specifies a workflow that loads a sequence of tables:
(1) Initialization step: Execute procedureInitialization
(type Procedure)
(2) Execute interface Load Countries (type: Flow(interface)), linked object: TRG_COUNTRY
(3) Execute interface Load Regions (type: Flow(interface)), linked object: TRG_REGION
(4) Execute interface Load Cities (type: Flow(interface)), linked object: TRG_CITY
(5) refresh variable "CUST_FLAG". Load Customers? (type: Evaluate Variable), linked objects: CUST_FLAG
IF CUST_FLAG = 1 (true) THEN Load Customers, ELSE Final Step
(6) Final Step (type procedure)
(7) If any of the steps above fails,the Package runs the "Send Admin email" step that sends an email using an ODI tool.
Interface
It is set of declarative rules that describe the loading of a datastore or a temporary target structure from one or more source datastores.
An interface is a reusable dataflow.
Components of an Integration Interface:
- Target Datastore - element that will be loaded by the interface.
- May be permanent (defined in a model) or temporary (created by the interface)
- Datasets - One target is loaded with data coming from various datasets
- each dataset corresponds to one diagram of source datastores and the mappings used to load the target datastore from these source datastores.
- Diagram of Source Datastores
- Mapping - defines the transformations performed on one or various source coluns to load one target column
- Staging area
- Flow
- Control
In the example above,
- The interface defines a set of rules to load the target
TRG_CUSTOMER
Datastore with data coming from three source datastores (SRC_AGEGROUP, SRC_CUSTOMER, SRC_SALES_PERSON
)- It defines a Join between
SRC_CUSTOMER
andSRC_SALES_PERSON
- it defines a Lookup:
CUST.AGE BETWEEN AGE_GROUP_AGE_MIN and AGE_GROUP.AGE_MAX
- Define Mappings between the attributes in the sources datastores and attributes in the target datastore
LookupJoinMapping
Interface dataflow
- Automatically generated by ODI from the declarative rules in the interface
- Based on the configuration files and context (target, sources, where is the staging area, etc)
- ODI automatically computes the flow depending on the configuration in the interface's diagram.
- It proposes default KMs for the data flow.
- The Flow tab enables you to view the data flow and select the KMs used to load and integrate data.
In the flow, the following items appear:
Source Sets:
DataSets:
- Source Datastores that are within the same dataset, located on the same physical data server and which are joined with Joins located on the Source are grouped in a single source set in the flow diagram.
- A source set represents a group of datastores that can be extracted at the same time.
Staging Area:
- Datasets appear as yellow boxes in the Staging Area.
Target:
- Appears as a box that includes the different datasets, the target (if located on the same data server), and possibly some of the sources (if located on
the same data server)
- Appears as a separate box if it is located in a different schema from the staging area
You use the following KMs in the flow:
- Loading KM (LKM):
- Define how data is moved.
- One LKM is selected for each Source Set for moving data from the sources to the staging area.
- It can be also selected to move data from the Staging Area - when different from the Target - to the Target, when a
single technology IKM is selected for the Staging Area.- Integration KM (IKM): They define how data is integrated into the target. One IKM is typically
selected on the Target. When the staging area is different from the target, the
selected IKM can be a multi-technology IKM that moves and integrates data from
the Staging Area into the Target.
Procedure
A Procedure is a reusable component that groups a sequence of operations that do not fit in the interface concept.
Examples of procedures:
■ wait and unzip a file
■ send a batch of files via FTP
■ receive emails
■ purge a database
(a) Key Lookup Scenario
(b) Business Rule Violation Scenario
(c) Data error scenario
(d) Pivoting Scenario
(a) key Lookup Scenario
- You need to load sales transaction data into a retail Data Warehouse
- In the operational system, each product is identified by Uniform Price Codes (UPCs);
- In the Data Warehouse, product is identified by product_id.
- You need to
(1) Map UPC --> PRODUCT_ID; and
(2)As you load the source data, lookup the map table and transform the data accordingly
(b) Business Rule Violation Scenario
- Some UPC Codes are invalid
- In which ways can you implement the transformation and handle the invalid UPCs?
- i.e: new sales data might not have valid UPC codes
(1) useCreate Table As Select (CTAS)
to identify and log invalid UPCs
(2) use an outer join (+) and convert invalid UPCs to NULL
(3) use multi-table INSERT and separate the invalid UPCs to a different table
(1) use CTAS to identify and log invalid UPCs
(2) use an outer join (+) and convert invalid UPCs to NULL
(3)use multi-table INSERT and separate the invalid UPCs to a different table
(c) Data Error Scenarios
- If the quality of the data is unknown, you may face unexpected errors, such as a a failed datatype conversion, or a constraint violation.
- Such errors can happen independently of whether the UPC for the particular record is or not valid.
- Here you can use a DML error logging table to capture unexpected errors.
Challenges to data integration:
- Terabytes to Pentabytes of data
- Fenomenal data explosion
- DI/ETL market growing at 15%
- Most enterprises looking to moving away from scripts to solutions
- Demand to near real-time/real-time data integration
- Explosion in the number of sources: from 10s to 1000s
Use of ETL tools: (Oct 2011 Forrester survey)
- 90% of enterprise still make use of batch data loading into DW or BI platform
- 50% performing real-time/near real-time integration
- 47% performing batch database migrations
New data integration requirements:
- support for all types of data (structured, semi-structured, unstructured)
- Support for Big Data (integrate with Hadoop)
Top three benefits found of using Oracle Data Integrator
( source: here )
- improvement in project completion (resource management, ability to reuse codes, reduce manual hand-coding, reducing error and troubleshooting)
- shorter reporting cycle
- Deferred hardware upgrade
Designer module |
|
Operator module |
|
Topology module |
|
Security module |
|
(a) retrieves the code stored in the ODI repository,
(b) connects to the various source and target systems and
(c) orchestrates the overall data integration process
(d) when execution is completed, the agent updates the execution logs in the repository and reports error messages and executions statistics. These can be reviewed in the Operator navigator at ODI Studio, or through the web interfaces.
Two types of Agents:
- Standalone Agent - can be installed on the source or target systems. Requires a JVM.
- Java EE Agent - Deployed on Oracle WebLogic Server.
- The ODI Repository consists of a Master Repository and one or more Work repositories.
- Repositories are sets of tables that can be stored in RDBMS (i.e. Oracle, SQL Server, DB2, etc).
- All objects used by the various ODI modules are stored in the repositories
- The Master Repository contains
- Security information (user profiles and privileges)
- topology information
- source code for all version of all ODI objects
- Several work repositories can coexist in the same installation (development, quality control, production)
Work Repositories store information for
- Models (i.e. metadata) - datastores, columns, data integrity constraints, cross references, data lineage and impact analysis
- Projects - interfaces, packages, procedures, folders, knowlege models, variables
- Runtime information - scenarios, load plans, scheduling information and logs
- Research and specify business needs (Key Indicators)
- Identify data sources relevant to generate key indicators
- Define business rules to transform source information into key indicators
- Model the data structure of the target warehouse to store the key indicators
- Populate the indicators by implementing business rules
- Measure the overall accuracy of the data by setting up data quality rules
- Develop reports on key indicators
- Make key indicators and metadata available to business users through ad-hoc query tools or predefined reports
- Measure business users’ satisfaction and add/modify key indicators
Business User
- Access the final calculated key indicators
- Use reports and ad-hoc queries
- May need to understand the definition of an indicator
- May need to be aware of data quality issues
- when was the last time the table was updated?
- How many records were added, update, removed in the table?
- What are the rules that calculate a particular indicator?
- Where does the data come from, and how is it transformed?
Business Analyst
- Define the key indicators
- Identify the source applications
- How many different source applications need to be considered?
- Is the data needed for key indicators available in the selected pool of source applications?
- What data quality issues are present in the source systems?
- Specify business rules to transform source data into meaningful target indicators
- Projects can use the ODI Designer to directly specify the business rules
- For each target table, specify also:
- Target datastore - name the target datastore
- Description of transformation - describe its purpose
- Integration strategy - How data should be written to target (replace table, append, update, etc). Each strategy specified will correspond to an ODI Integration Knowledge Module.
- Refresh frequency
- Dependencies - what datastores need to be loaded or jobs executed prior to this one
- Source datastores - source databases, applications, etc used
- Source ODI Model -
- Datastore name, Fiel Mappings and transformations, Links or Join criteria, Filters, Data Quality requirements, constraint names and expressions, etc
- Maintain translation data from operational semantics to the Data Warehouse semantic
Developer
- Implement the business rules as specified by business analysts
- Provide executable scenarios to the production team
- Must understand infrastructure details and have business knowledge of source applications
Metadata Administrator
- Reverse engineer source and target applications
- Understand content and structure of source applications
- Connect to source applications and capture their metadata
- Define data quality business rules (specified by Business Analyst) in ODI repository
- What level of Data Quality is required?
- who are the business owners of source data?
- What should be done with rejected records?
- There should be an error recycling strategy?
- How would business users modify erroneous source data?
- Should a GUI be provided for source data correction?
- Guarantee the overall consistency of Metadata in the various environments (dev, test, prod: repository)
- Participate in the data modeling of key indicators
- Add comments, descriptions and integrity rules (PK, FK, Check, etc)in the metadata
- Provide version management
Database Administrator
- Define technical database structure supporting the various layers in the data warehouse project (and ODI structure)
- Create database profiles needed by ODI
- Create schemas and databases for the staging areas
- Describe columns inside the data dictionary of database - COMMENT ON TABLE/COLUMN
- Avoid using PKs of source systems as PKs in target tables. Use counter or identity columns instead.
- Design Referential Integrity and reverse engineerd FKs in ODI models.
- Do not implement RIs on target database (for performance). Data quality control should guarantee data integrity.
- Standardize obejct naming conventions
- Distributed and maintain the descriptions of the environments (topology)
System Admin
- Maintain technical resources for the Data Warehouse project
- May install and monitor schedule agents
- May backup and restore repositories
- Install and monitor ODI console
- Setup the various enviroments (dev, test, prod)
Security Admin
- Define the security policy for the ODI Repository
- Creates ODI users and grant rights on models, projects and contexts
Operator
- Import released and tested scenarios into production environment
- Schedule the execution of production scenarios
- Monitor execution logs and restart failed sessions
- The ODI Repository is the central component.
- ODI Repository stores configuration information about
- IT infrastructure and topology
- metadata of all applications
- projects
- Interfaces
- Models
- Packages
- Scenarios
- Solutions - versioning control
- You can have various repositories within an IT infrastructure, linked ti separated environments that exchange metadata and scenarios (i.e. development, test, user acceptance test, production, hot fix)
(1) A single master repository:
- holds all the topology and security information.
- All the work repositories are registered here.
- Contains all the versions of objects that are committed by the designers.
(2) “Development” work repository:
- Shared by all ODI designers
- Holds all the projects and models under development.
(3) “Testing” work repository
- shared by the IT testing team.
- Contains all the projects and models being tested for future release.
(4) “User Acceptance Tests” work repository:
- shared by the IT testing team and business analysts.
- Contains all the projects and models about to be released.
- Business analysts will use the ODI Console on top of this repository to validate the scenarios and transformations before releasing them to production.
(5) “Production” work repository
- shared by the production team, the operators and the business analysts.
- Contains all the projects and models in read-only mode for metadata lineage, as well as all the released scenarios.
(6) “Hot fix” work repository:
- shared by the maintenance team and the development team.
- Usually empty, but whenever a critical error happens in production, the maintenance team restores the corresponding projects and models in this repository and performs the corrections with the help of the development team.
- Once the problems are solved, the scenarios are released directly to the production repository and the new models and projects are versioned in the master repository.
(a) Full-featured, fully integrated multidimensional server
- Applications can query all data using a dimensional query language (Oracle OLAP API or MDX)
- OLAP engine runs within the kernel of Oracle Database.
- Dimensional objects (cubes, measures, dimensions, levels, hierarchies, attributes) are stored in Oracle Database in their native multidimensional format.
- Cubes and other dimensional objects are first class data objects represented in the data dictionary.
- Data security is administered in the standard way, by granting and revoking privileges to users and roles.
- Applications can query dimensional objects using SQL.
(b) A Summary management solution
- The Oracle cube is used to manage summary data and is accessed transparently by SQL-based BI applications as a Cube-organized Materialized View using Query rewrite feature.
- Cube-Organized MVs were introduced on Oracle 11g, and play the same role as table-based MVs
- Cube-organized MVs allow any application to transparently access summary data managed by the cube.
- Cube-organized MVs provide substantial query performance enhancement with much lower maintenance overhead than table-based MV alternatives
- With Cube-organized MVs, applications query the detail tables and the database automatically rewrites the query to access summary data in the materialized view.
To expose the cube as a Materialized View, you can use the Analytic Workspace Manager:
Both detail and summary data in the cube are exposed in the Materialized View.
When the detail tables are queried, the optimizer rewrite the query to use the cube-organized materialized view
SQL> select t.calendar_year_name, p.department_name, cu.region_name, sum(s.quantity) as quantity, sum(s.sales) as sales from times t, customers cu, products p, sales_fact s where cu.customer_key = s.customer and p.item_key = s.product and s.day_key = t.day_key group by t.calendar_year_name, p.department_name, cu.region_name; Explain plan: OPERATION ----------------------------------- SELECT STATEMENT HASH CUBE SCAN CB$SALES_CUBE
(c) A supplier of rich analytic content
- OLAP cubes are exposed as Cube views.
- Cube views include detail and aggregate data as individual columns (measures)
- Cube views can list complex analytic content, including hiearchical aggregations, statistical forecasts, additive and non-additive aggregations, calculated measures, etc
Since Oracle provides a complete SQL interface to the OLAP cube:
- Cubes and Dimensionas can be thought of as relational objects.
- Cubes as relational objects that offer improved performance and advanced analytic content, and
- Dimensions as relational objects that include columns with information useful to creating hiearchical queries.
- The cube is simply another data type in the database.
(a) Improved query performance for Ad-hoc query patterns
- As query patterns become less predictable, creation and maintenance of materialized views for specific queries becomes impractical.
- Consider a data model with four dimensions (time, customer, product, channel), each with six levels of summarization (i.e in the time dimension: day, week, month, quarter, half year and year).
- In this case, there are (4**6 -1) = 4095 possible combinations representing sumary level data that users might query.
(b) Improved query performance for summary data
(c) Fast incremental update
(d)Rich analytic content
(e)Metadata that describes the logical business model and the relational representations of the cube
(a) Metadata about the Cube's structure, data, and how it is calculated
(b) Metadata about how the cube is represented for query using SQL cube views
- Description of the cube's structure include information on: dimensions, hierarchies, levels, attributes and measures.
- For measures, metadata is available describing how the cube is calculated and the calculation expression of a measure.
SQL> select table_name from dict where table_name like '%CUBE%'; TABLE_NAME ------------------------------ ... DBA_CUBES DBA_CUBE_ATTRIBUTES DBA_CUBE_ATTR_VISIBILITY DBA_CUBE_BUILD_PROCESSES DBA_CUBE_CALCULATED_MEMBERS DBA_CUBE_DIMENSIONALITY DBA_CUBE_DIMENSIONS DBA_CUBE_DIM_LEVELS DBA_CUBE_DIM_MODELS DBA_CUBE_DIM_VIEWS DBA_CUBE_DIM_VIEW_COLUMNS TABLE_NAME ------------------------------ DBA_CUBE_HIERARCHIES DBA_CUBE_HIER_LEVELS DBA_CUBE_HIER_VIEWS DBA_CUBE_HIER_VIEW_COLUMNS DBA_CUBE_MEASURES DBA_CUBE_VIEWS DBA_CUBE_VIEW_COLUMNS ... => Checking the existing dimensions in SALESTRACK analytic workspace SQL> select * from dba_cube_dimensions; OWNER DIMENSION_NA DIMENSION_TYPE AW_NAME DEFAULT_HIERARCHY_NAME DESCRIPTION ---------- ------------ ----------------- ------------ ------------------------------ -------------------- OLAPTRAIN PRODUCT2 STANDARD SALESTRACK STANDARD product2 OLAPTRAIN CHANNEL STANDARD SALESTRACK SALES_CHANNEL Channel OLAPTRAIN GEOGRAPHY STANDARD SALESTRACK REGIONAL Geography OLAPTRAIN PRODUCT STANDARD SALESTRACK STANDARD Product OLAPTRAIN TIME TIME SALESTRACK CALENDAR Time
You can use SQL Developer to:
- check existing hierachies, levels and order within hierarchies in the 'PRODUCT2' dimension.
- List existing cubes, measures within the cube, whether a measure is derived or stored and the equation that generates that measure:
Cube Views
- OLAP Cubes are exposed through views on a star schema.
- As the structure of dimensions and cubes are updated,views are automatically maintained
- If you add new measures, the cube view will be updated
- Views are always in sync with the underlying dimensions and cubes
- A single cube view exposes (as columns) all the aggregations and all the calculations for a cube
SQL> select * from dba_cube_views; OWNER CUBE_NAME VIEW_NAME ---------- --------------- ------------------- OLAPTRAIN FORECAST FORECAST_VIEW OLAPTRAIN SALES_CUBE SALES_CUBE_VIEW OLAPTRAIN NEW_CUBE NEW_CUBE_VIEW SQL> select owner, cube_name, view_name, column_name from dba_cube_view_columns; OWNER CUBE_NAME VIEW_NAME COLUMN_NAME ---------- --------------- ------------------------------ ------------------------------ OLAPTRAIN FORECAST FORECAST_VIEW BEST_FIT OLAPTRAIN FORECAST FORECAST_VIEW LINEAR_REGRESSION OLAPTRAIN SALES_CUBE SALES_CUBE_VIEW SALES OLAPTRAIN SALES_CUBE SALES_CUBE_VIEW QUANTITY OLAPTRAIN SALES_CUBE SALES_CUBE_VIEW SALES_YTD OLAPTRAIN SALES_CUBE SALES_CUBE_VIEW SALES_YTD_PY ...
=> The Cube's SQL interface is also available in SQL Developer: