## Pages

### Data Warehouse: Concepts on ETL Processes (Extraction)

(Q) Why defining and creating the extraction process may be a very time-consuming stage in the ETL process?

• Often multiple and independent sources. May be poorly documented. 3rd party applications may not be modifiable.
• Business requirements on source may not accept an extra overhead of the extraction process, nor accept changes in its availability.
• Often you cannot add aditional logic to the source systems to support incremental extraction of data.

(Q) What are the two Logical extraction methods?

• Full Extraction and Incremental Extraction
• Full Extraction:
• ALL data is extracted from the source system. Here you don’t need to keep track of what has changed sinc e the last extraction.
• However, it is more costly in space, since the staging area might need to be much larger
• Incremental Extraction:
• Here you need to keep track of all data that has changed since the last extraction or some complex business event like the last booking day of a fiscal period.
• This information may be provided by the source system (i.e. an application column, or a change table). In this case the source system would likely have to be modified to incorporate the change tracking functionality).
• Alternatively, Oracle provides a Change Data Capture (CDC) mechanism (with Oracle Golden Gate)

(Q) What two methods of physical extraction are there?

• Online extraction:
• Data is extracted directly from the source itself. Extraction process connects directly to the source.
• Offline extraction:
• Data is staged explicitly outside the original source system.
• In this case, the data either has an exisitng structure (i.e. redo logs, archive logs, or transportable tablespace) or was created by an extraction routine. (i.e stored in flat files)
• The alternatives are: Flat files, dump files (imp/exp), redo and archived logs (how to read these?) or transportable tablespaces.

(Q) What is the physical extraction technique recommended by Oracle most of the time? Why is it recommended?

• Oracle recommends the use of transportable tablespaces.
• (For this, either the source system is in an Oracle DB or an extraction routine was used to copy the data to a "staging" transportable tablespace.

(Q) What is the key-enabling technology for providing near real-time, or on-time, data warehousing?

• Change Data Capture mechanism
• Used for incremental extractions (i.e, nightly extractions, which need to extract only data modified since last extraction).

(Q) What usually is the most challenging technical issue in data extraction?

• Devising a mechanism of Change Data Capture.
(Q) What are some techniques for implementing a change capture mechanism on Oracle source databases?
• You can use timestamps, partitioning or triggers. Each can be used either if the data is directly accessed through distributed queries or exported to flat file.
• Timestamps columns are not always available in an operational source system.
• In these cases, how could a timestamp solution for change capture be implemented?
• You would have to add a new column in every sytem table and write triggers to update the new column each time an operation modifies the row data.
• Note that trigger-based CDC mechanisms can affect performance of the source system.
(Q) What Oracle feature uses an internalized trigger-based mechanism to keep change capture?
• Materialized view logs use an internalized trigger-based mechanism.
• These logs are used by Materialized views to identify changed data.
(Q) What are the alternatives available for extracting data into files if the source system is an Oracle database?
• You can use SQL*Plus,
• Use OCI or a Pro*C program,
• use the Export utility,
• use External tables.
(Q) You want to extract data using SQL*Plus from a single partitioned table. How can you parallelize this process?
If the table is range partitioned, you can initiate one SQL*Plus session for each partition.
SPOOL order_jan.dat
Select * from orders PARTITION (orders_jan1998);
SPOOL OFF; 
(Q) You want to extract data using SQL*Plus from a single non-partitionied table. What two criteria can you use to parallelize this process?
Logical or physical criteria.
• Logical method: based on logical ranges of column values.
Select … where order_date
between to_date('01-JAN-99') and to_date('31-JAN-99');

• Physical method: based on a range of values.
• Check the data dictionary and find the set of data blocks that make up the table.
• From here, derive a set of rowid-range query.
Select … where order_date where rowid between v1 and v2;
(Q) What are the differences between extracting data using data pump export utility and data pump external table mechanism?
• Export utility:
• Metadata + data are exported. Output must be processed with the Import utility
• External table data pump utility:
• Results of a SELECT can be stored in an oracle-internal data pump format and processed as a regular external table.
• Metadata is NOT exported. Results can be stored in multiple extraction files.
Set LONG 2000