dbms-notes: writing blocks to disk: Data Warehouse: Concepts on ETL Processes (Extraction)

(Q) Why defining and creating the extraction process may be a very time-consuming stage in the ETL process?

Often multiple and independent sources. May be poorly documented. 3rd party applications may not be modifiable.

Business requirements on source may not accept an extra overhead of the extraction process, nor accept changes in its availability.

Often you cannot add aditional logic to the source systems to support incremental extraction of data.

(Q) What are the two Logical extraction methods?

Full Extraction and Incremental Extraction

Full Extraction:

ALL data is extracted from the source system. Here you don’t need to keep track of what has changed sinc e the last extraction.

However, it is more costly in space, since the staging area might need to be much larger

Incremental Extraction:

Here you need to keep track of all data that has changed since the last extraction or some complex business event like the last booking day of a fiscal period.

This information may be provided by the source system (i.e. an application column, or a change table). In this case the source system would likely have to be modified to incorporate the change tracking functionality).

Alternatively, Oracle provides a Change Data Capture (CDC) mechanism (with Oracle Golden Gate)

(Q) What two methods of physical extraction are there?

Online extraction:

Data is extracted directly from the source itself. Extraction process connects directly to the source.

Offline extraction:

Data is staged explicitly outside the original source system.

In this case, the data either has an exisitng structure (i.e. redo logs, archive logs, or transportable tablespace) or was created by an extraction routine. (i.e stored in flat files)

The alternatives are: Flat files, dump files (imp/exp), redo and archived logs (how to read these?) or transportable tablespaces.

(Q) What is the physical extraction technique recommended by Oracle most of the time? Why is it recommended?

Oracle recommends the use of transportable tablespaces.

(For this, either the source system is in an Oracle DB or an extraction routine was used to copy the data to a "staging" transportable tablespace.

(Q) What is the key-enabling technology for providing near real-time, or on-time, data warehousing?

Change Data Capture mechanism

Used for incremental extractions (i.e, nightly extractions, which need to extract only data modified since last extraction).

(Q) What usually is the most challenging technical issue in data extraction?

Devising a mechanism of Change Data Capture.

(Q) What are some techniques for implementing a change capture mechanism on Oracle source databases?

You can use timestamps, partitioning or triggers. Each can be used either if the data is directly accessed through distributed queries or exported to flat file.

Timestamps columns are not always available in an operational source system.

In these cases, how could a timestamp solution for change capture be implemented?

You would have to add a new column in every sytem table and write triggers to update the new column each time an operation modifies the row data.

Note that trigger-based CDC mechanisms can affect performance of the source system.

(Q) What Oracle feature uses an internalized trigger-based mechanism to keep change capture?

Materialized view logs use an internalized trigger-based mechanism.

These logs are used by Materialized views to identify changed data.

(Q) What are the alternatives available for extracting data into files if the source system is an Oracle database?

You can use SQL*Plus,

Use OCI or a Pro*C program,

use the Export utility,

use External tables.

(Q) You want to extract data using SQL*Plus from a single partitioned table. How can you parallelize this process?

If the table is range partitioned, you can initiate one SQL*Plus session for each partition.
SPOOL order_jan.dat 
Select * from orders PARTITION (orders_jan1998); 
SPOOL OFF; 

(Q) You want to extract data using SQL*Plus from a single non-partitionied table. What two criteria can you use to parallelize this process?

Logical or physical criteria.
Logical method: based on logical ranges of column values.
Select … where order_date 
between to_date('01-JAN-99') and to_date('31-JAN-99'); 
Physical method: based on a range of values.

Check the data dictionary and find the set of data blocks that make up the table.

From here, derive a set of rowid-range query.
Select … where order_date where rowid between v1 and v2;

(Q) What are the differences between extracting data using data pump export utility and data pump external table mechanism?

Export utility:

Metadata + data are exported. Output must be processed with the Import utility

External table data pump utility:

Results of a SELECT can be stored in an oracle-internal data pump format and processed as a regular external table.

Metadata is NOT exported. Results can be stored in multiple extraction files.

If metadata is required, need to use the DBMS_METADATA package.
Set LONG 2000 
Select dbms_metadata.get_ddl('table', 'extract_cust') from dual; 

(Q) What is an important scalability limitation to extraction through distributed operations?

The data is transported through a single Oracle Net connection.

For large amounts of data, file-based data extraction and transportation techniques are often more scalable.

Pages

Data Warehouse: Concepts on ETL Processes (Extraction)

No comments:

Post a Comment