Showing posts with label Data Warehouse. Show all posts
Showing posts with label Data Warehouse. Show all posts

Q&As: Data Warehousing and Partitioning (II)



(Q) What is a Partition-Wise Join? How can SQL statements benefit from Full Partition-Wise Joins?

  • A P-W Join divides a large join into smaller joins between a pair of partitions from the two joined tables.
  • To use P-W Joins:
    • (a) equipartition BOTH tables on their JOIN keys or
    • (b) use Reference partitioning
  • P-W Joins reduce query response time by minimizing the amount of data exchanged among parallel execution servers when joins execute in parallel
  • Full P-W Joins: Divide a large join into smaller joins between a pair of partitions from the two joined tables
  • Tables MUST be equipartitioned on their join keys, or use reference partitioning.


(Q) What are the two most common types of indexes in Data Warehouse environments?

  • B-tree and bitmap indexes
  • Both can be created as local indexes on a partitioned table, inheriting the table's partitining strategy
  • Local nonprefixed indexes can improve performance because many index partitions can be scanned in parallel by range queries on the index key.
SQL> create index cust_last_name_ix ON customers(last_name) LOCAL PARALLEL NOLOGGING;


(Q) How can bitmap index improve performance in a Data Warehouse environment?

  • Bitmap indexes use a very efficient storage mechanism for low cardinality columns.
  • Fact table foreign key column are ideal candidates for bitmap indexes, since generally there are few distinct values relative to the total number of rows in the fact table. (Because dimension tables should not have many rows)


(Q) What is Partition Exchange Load (PEL)? When is it used? What benefits does it provides?

  • Oracle strongly recommends that the larger tables or fact tables in a data warehouse are partitioned.
  • with a partitioned table, you can load data quickly and easily with minimal impact on business users by using the exchange partition command.
  • With the exchange partition command you can swap the data in a non-partitioned table into a particular partition in your partitioned table.
  • NO physical data movement occurs with the exchange partition command.
  • Instead the data dictionary is updated to exchange a pointer from the partition to the table and vice versa.
  • Because there is no physical movement of data, an exchange does not generate redo and undo, making it a VERY fast operation.
  • You can convert by exchanging their data segments:
    • (a) a partition (or subpartition) into a nonpartitioned table and
    • (b) a nonpartitioned table into a partition (or subpartition) of a partitioned table
  • In DWH environments, exchanging partitions facilitates high-speed data loading of new, incremental data into an existing partitioned table.
  • OLTP and DWH also can benefit from exchanging old data partitions out of a partitioned table.
  • In the latter case, the data is purged from the partitioned table without being deleted and can be archived separately afterwards.

(1) create a separate table that looks exactly like a single partition in the existing table (including indexes, constraints and partition schema)
(2) Load the data in the new table, build indexes, enable constraints. This will not impact current users of the existing table
(3) Exchange the existing table partition with the newly created table
i.e. (a) create new partition on existing table 
 SQL> Alter table sales add partition p_sales_jun_2007 
                                                    values less than (to_date('01-JUL-2007', 'DD-MON-YYYY');

(b) create temp table that looks like existing partition 
 SQL> create table sales_jun_2007 COMPRESS FOR OLTP as select * from sales where 1=0;

(c) Populate table and create indexes and constraints 
 SQL> create bitmap index time_id_jun_2007_bix ON sales_jun_2007(time_id) nologging;
 SQL> create bitmap index cust_id_jun_2007_bix ON sales_jun_2007(cust_id) nologging;
 SQL> create bitmap index prod_id_jun_2007_bix ON sales_jun_2007(prod_id) nologging;
 SQL> create bitmap index promo_id_jun_2007_bix ON sales_jun_2007(promo_id) nologging;
 SQL> create bitmap index channel_id_jun_2007_bix ON sales_jun_2007(channel_id) nologging;
 
 SQL> ALTER TABLE sales_jun_2007 ADD CONSTRAINT prod_id_fk 
                                            FOREIGN KEY (prod_id) REFERENCES products(prod_id);
 SQL> ALTER TABLE sales_jun_2007 ADD CONSTRAINT cust_id_fk 
                                            FOREIGN KEY (cust_id) REFERENCES customers(cust_id);
 SQL> ALTER TABLE sales_jun_2007 ADD CONSTRAINT promo_id_fk 
                                            FOREIGN KEY (promo_id) REFERENCES promotions(promo_id);
 SQL> ALTER TABLE sales_jun_2007 ADD CONSTRAINT time_id_fk 
                                            FOREIGN KEY (time_id) REFERENCES times(time_id);
 SQL> ALTER TABLE sales_jun_2007 ADD CONSTRAINT channel_id_fk 
                                            FOREIGN KEY (channel_id) REFERENCES channels(channel_id);

(d) Exchange partition 
 SQL> Alter table sales EXCHANGE PARTITION p_sales_jun_2007 WITH TABLE sales_jun_2007
   including indexes;

Partition Exchange Load (PEL)



Q&As: Data Warehosing and Partitioning (I)




(Q) In which ways partitions facilitate administrative operations?

  • Partitioning enable administrative operations to work on subsets of data.
  • You can add, organize or drop a partition with minimal or zero interruption to read-only applications
  • Adding a partition: you can defer segment creation, which is useful when the segment is very large
  • Changing partition granularity: You can split or merge partitions.
  • Swapping partitions: you can easily add, remove, or swap a large amount of data quickly
  • Partitioning enhances data access and improve overall application performance:
    • More users can query the system, since each SQL is likely to access smaller amounts of data
    • Complex queries run faster. Less IO access


(Q) List three advantages of using partitioned objects

(1) Backup and recovery can be performed on a lower level of granularity (at the partition level)
(2) Part of the database objects can be placed in compressed storage, while others remain uncompressed
(3) You can transparently store data on different storage tiers to lower costs


(Q) How can the database take advantage of partitioning when executing an SQL?

  • Using parallel executions to speed up queries, DML and DDL.
  • Individual parallel execution servers can work on their own data sets, identified by partition boundaries


(Q) what is Partition pruning?

  • When executing a SQL statement, the Optimizer analyzes FROM and WHERE clauses to eliminate unneeded partitions from the execution plan's access list.
  • In the example below, the database scans only four partitions when executing the query. In an non-partitioned table, all data would have to be scanned.


(Q) What is the difference between static pruning and dynamic pruning?

  • Static pruning - occurs at compile-time.
    • example: SQL stmt with WHERE clause with a constant literal on the partition key column
  • Dynamic pruning - occurs at run-time, when the exact partitions to be accessed are not known before hand
    • example: SQL stmt with operator or function in the WHERE condition


(Q) How can you identify when static pruning is used in an execution plan?

  • In the execution plan, the OPERATION column shows PARTITION RANGE SINGLE, indicating that only a single (or range of) partition(s) is being accessed.
  • If OPERATION = PARTITION RANGE ALL, then ALL partitions are being accessed
select * from sales where time_id = to_date('01-jan-2001', 'dd-mon-yyyy');
... Returns 766 rows

explain plan for select * from sales where time_id = to_date('01-jan-2001', 'dd-mon-yyyy');
plan FOR succeeded

select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------------------------------------------
Plan hash value: 3733446669

---------------------------------------------------------------------------------------------------------------------  
| Id | Operation                          | Name           | Rows | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
--------------------------------------------------------------------------------------------------------------------- 
|  0 | SELECT STATEMENT                   |                |  766 | 22980 |    27   (0)| 00:00:01 |       |       |
|  1 |  PARTITION RANGE SINGLE            |                |  766 | 22980 |    27   (0)| 00:00:01 |    17 |    17 |
|  2 |   TABLE ACCESS BY LOCAL INDEX ROWID| SALES          |  766 | 22980 |    27   (0)| 00:00:01 |    17 |    17 |
|  3 |    BITMAP CONVERSION TO ROWIDS     |                |      |       |            |          |       |       |
|* 4 |     BITMAP INDEX SINGLE VALUE      | SALES_TIME_BIX |      |       |            |          |    17 |    17 |
--------------------------------------------------------------------------------------------------------------------- 

Predicate Information (identified by operation id):                
----------------------------------------
   4 - access("TIME_ID"=TO_DATE(' 2001-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))      

 16 rows selected 


(Q) What types of SQL statements are candidates for partition pruning?

  • On Range or List partitioned columns: When you use range, LIKE, equality, and IN-list predicates
  • On Hash Partitioned columns: When you use equlity and IN-list predicates


(Q) How can datatype conversions affect the type of partition pruning that may be used?

  • Datatype conversion typically lead to dynamic, when static pruning would have otherwise been possible
  • Static pruning provides better performance than dynamic pruning
  • example: Proper use of TO_DATE function guarantees that the database can uniquely determine the date value and thus the partition that should be accessed
Compare the following queries:
SELECT SUM(amount_sold) total_revenue FROM sh.sales
(clause 1)   WHERE time_id between '01-JAN-00' and '31-DEC-00' ;

(clause 2)   WHERE time_id between '01-JAN-2000' and '31-DEC-2000' ;

(clause 3)   WHERE time_id between to_date('01-JAN-2000', 'dd-MON-yyyy') and 
                                   to_date('31-DEC-2000', 'dd-MON-yyyy');

(Clause 1:) 
Leads to dynamic pruning, because date needs to be implicitly converted and does not match nls_date_format

SQL> explain plan for 
 SELECT SUM(amount_sold) total_revenue 
 FROM sales
 WHERE time_id between '01-JAN-00' and '31-DEC-00';

plan FOR succeeded.

SQL> select * from table(dbms_xplan.display);

PLAN_TABLE_OUTPUT
-----------------------------------------------------------------------------------------------------
Plan hash value: 3772476237   

---------------------------------------------------------------------------------------------------- 
| Id  | Operation                  | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | 
---------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT           |       |     1 |    13 |   592  (19)| 00:00:08 |       |       | 
|   1 |  SORT AGGREGATE            |       |     1 |    13 |            |          |       |       | 
|*  2 |   FILTER                   |       |       |       |            |          |       |       | 
|   3 |    PARTITION RANGE ITERATOR|       |   230K|  2932K|   592  (19)| 00:00:08 |   KEY |   KEY | 
|*  4 |     TABLE ACCESS FULL      | SALES |   230K|  2932K|   592  (19)| 00:00:08 |   KEY |   KEY | 
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
                                                  
   2 - filter(TO_DATE('01-JAN-00')<=TO_DATE('31-DEC-00'))
   4 - filter("TIME_ID">='01-JAN-00' AND "TIME_ID"<='31-DEC-00')

 17 rows selected 


(Clause 2:) 
Uses STATIC pruning, because although date needs to be implicitly converted is DOES match nls_date_format

SQL> explain plan for 
 SELECT SUM(amount_sold) total_revenue 
 FROM sales
 WHERE time_id between '01-JAN-2000' and '31-DEC-2000';

plan FOR succeeded.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT 
-------------------------------------------------------------------------------------------------
Plan hash value: 1500327972 

--------------------------------------------------------------------------------------------------- 
| Id  | Operation                 | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop | 
--------------------------------------------------------------------------------------------------- 
|   0 | SELECT STATEMENT          |       |     1 |    13 |   136  (11)| 00:00:02 |       |       | 
|   1 |  SORT AGGREGATE           |       |     1 |    13 |            |          |       |       | 
|   2 |   PARTITION RANGE ITERATOR|       |   230K|  2932K|   136  (11)| 00:00:02 |    13 |    16 | 
|*  3 |    TABLE ACCESS FULL      | SALES |   230K|  2932K|   136  (11)| 00:00:02 |    13 |    16 | 
--------------------------------------------------------------------------------------------------- 

Predicate Information (identified by operation id):  
---------------------------------------------------  

   3 - filter("TIME_ID"<=TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

 15 rows selected 

(Clause 3:) 
Uses STATIC pruning, because although date is explicitly converted

SQL> Explain plan for
SELECT SUM(amount_sold) total_revenue 
FROM sales
   WHERE time_id between to_date('01-JAN-2000', 'dd-MON-yyyy') and 
                         to_date('31-DEC-2000', 'dd-MON-yyyy');

plan FOR succeeded.

SQL> select * from table(dbms_xplan.display);
PLAN_TABLE_OUTPUT 
-------------------------------------------
Plan hash value: 1500327972

---------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name  | Rows  | Bytes | Cost (%CPU)| Time     | Pstart| Pstop |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |       |     1 |    13 |   136  (11)| 00:00:02 |       |       |
|   1 |  SORT AGGREGATE           |       |     1 |    13 |            |          |       |       |
|   2 |   PARTITION RANGE ITERATOR|       |   230K|  2932K|   136  (11)| 00:00:02 |    13 |    16 |
|*  3 |    TABLE ACCESS FULL      | SALES |   230K|  2932K|   136  (11)| 00:00:02 |    13 |    16 |
---------------------------------------------------------------------------------------------------
                                                                                                  
Predicate Information (identified by operation id): 
--------------------------------------------------- 

   3 - filter("TIME_ID"<=TO_DATE(' 2000-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

 15 rows selected 

Data Warehouse: Concepts on ETL Processes (Extraction)







(Q) Why defining and creating the extraction process may be a very time-consuming stage in the ETL process?

  • Often multiple and independent sources. May be poorly documented. 3rd party applications may not be modifiable.
  • Business requirements on source may not accept an extra overhead of the extraction process, nor accept changes in its availability.
  • Often you cannot add aditional logic to the source systems to support incremental extraction of data.


(Q) What are the two Logical extraction methods?

  • Full Extraction and Incremental Extraction
  • Full Extraction:
    • ALL data is extracted from the source system. Here you don’t need to keep track of what has changed sinc e the last extraction.
    • However, it is more costly in space, since the staging area might need to be much larger
  • Incremental Extraction:
    • Here you need to keep track of all data that has changed since the last extraction or some complex business event like the last booking day of a fiscal period.
    • This information may be provided by the source system (i.e. an application column, or a change table). In this case the source system would likely have to be modified to incorporate the change tracking functionality).
    • Alternatively, Oracle provides a Change Data Capture (CDC) mechanism (with Oracle Golden Gate)


(Q) What two methods of physical extraction are there?

  • Online extraction:
    • Data is extracted directly from the source itself. Extraction process connects directly to the source.
  • Offline extraction:
    • Data is staged explicitly outside the original source system.
    • In this case, the data either has an exisitng structure (i.e. redo logs, archive logs, or transportable tablespace) or was created by an extraction routine. (i.e stored in flat files)
    • The alternatives are: Flat files, dump files (imp/exp), redo and archived logs (how to read these?) or transportable tablespaces.


(Q) What is the physical extraction technique recommended by Oracle most of the time? Why is it recommended?

  • Oracle recommends the use of transportable tablespaces.
  • (For this, either the source system is in an Oracle DB or an extraction routine was used to copy the data to a "staging" transportable tablespace.


(Q) What is the key-enabling technology for providing near real-time, or on-time, data warehousing?

  • Change Data Capture mechanism
  • Used for incremental extractions (i.e, nightly extractions, which need to extract only data modified since last extraction).


(Q) What usually is the most challenging technical issue in data extraction?

  • Devising a mechanism of Change Data Capture.
(Q) What are some techniques for implementing a change capture mechanism on Oracle source databases?
  • You can use timestamps, partitioning or triggers. Each can be used either if the data is directly accessed through distributed queries or exported to flat file.
  • Timestamps columns are not always available in an operational source system.
  • In these cases, how could a timestamp solution for change capture be implemented?
    • You would have to add a new column in every sytem table and write triggers to update the new column each time an operation modifies the row data.
    • Note that trigger-based CDC mechanisms can affect performance of the source system.
(Q) What Oracle feature uses an internalized trigger-based mechanism to keep change capture?
  • Materialized view logs use an internalized trigger-based mechanism.
  • These logs are used by Materialized views to identify changed data.
(Q) What are the alternatives available for extracting data into files if the source system is an Oracle database?
  • You can use SQL*Plus,
  • Use OCI or a Pro*C program,
  • use the Export utility,
  • use External tables.
(Q) You want to extract data using SQL*Plus from a single partitioned table. How can you parallelize this process?
If the table is range partitioned, you can initiate one SQL*Plus session for each partition.
SPOOL order_jan.dat 
Select * from orders PARTITION (orders_jan1998); 
SPOOL OFF; 
(Q) You want to extract data using SQL*Plus from a single non-partitionied table. What two criteria can you use to parallelize this process?
Logical or physical criteria.
  • Logical method: based on logical ranges of column values.
Select … where order_date 
between to_date('01-JAN-99') and to_date('31-JAN-99'); 
  • Physical method: based on a range of values.
  • Check the data dictionary and find the set of data blocks that make up the table.
  • From here, derive a set of rowid-range query.
Select … where order_date where rowid between v1 and v2;
(Q) What are the differences between extracting data using data pump export utility and data pump external table mechanism?
  • Export utility:
    • Metadata + data are exported. Output must be processed with the Import utility
  • External table data pump utility:
    • Results of a SELECT can be stored in an oracle-internal data pump format and processed as a regular external table.
    • Metadata is NOT exported. Results can be stored in multiple extraction files.
    • If metadata is required, need to use the DBMS_METADATA package.
    Set LONG 2000 
    Select dbms_metadata.get_ddl('table', 'extract_cust') from dual; 
    
    (Q) What is an important scalability limitation to extraction through distributed operations?
    • The data is transported through a single Oracle Net connection.
    • For large amounts of data, file-based data extraction and transportation techniques are often more scalable.