Data Warehouse: Concepts on ETL Processes (Extraction)







(Q) Why defining and creating the extraction process may be a very time-consuming stage in the ETL process?

  • Often multiple and independent sources. May be poorly documented. 3rd party applications may not be modifiable.
  • Business requirements on source may not accept an extra overhead of the extraction process, nor accept changes in its availability.
  • Often you cannot add aditional logic to the source systems to support incremental extraction of data.


(Q) What are the two Logical extraction methods?

  • Full Extraction and Incremental Extraction
  • Full Extraction:
    • ALL data is extracted from the source system. Here you don’t need to keep track of what has changed sinc e the last extraction.
    • However, it is more costly in space, since the staging area might need to be much larger
  • Incremental Extraction:
    • Here you need to keep track of all data that has changed since the last extraction or some complex business event like the last booking day of a fiscal period.
    • This information may be provided by the source system (i.e. an application column, or a change table). In this case the source system would likely have to be modified to incorporate the change tracking functionality).
    • Alternatively, Oracle provides a Change Data Capture (CDC) mechanism (with Oracle Golden Gate)


(Q) What two methods of physical extraction are there?

  • Online extraction:
    • Data is extracted directly from the source itself. Extraction process connects directly to the source.
  • Offline extraction:
    • Data is staged explicitly outside the original source system.
    • In this case, the data either has an exisitng structure (i.e. redo logs, archive logs, or transportable tablespace) or was created by an extraction routine. (i.e stored in flat files)
    • The alternatives are: Flat files, dump files (imp/exp), redo and archived logs (how to read these?) or transportable tablespaces.


(Q) What is the physical extraction technique recommended by Oracle most of the time? Why is it recommended?

  • Oracle recommends the use of transportable tablespaces.
  • (For this, either the source system is in an Oracle DB or an extraction routine was used to copy the data to a "staging" transportable tablespace.


(Q) What is the key-enabling technology for providing near real-time, or on-time, data warehousing?

  • Change Data Capture mechanism
  • Used for incremental extractions (i.e, nightly extractions, which need to extract only data modified since last extraction).


(Q) What usually is the most challenging technical issue in data extraction?

  • Devising a mechanism of Change Data Capture.
(Q) What are some techniques for implementing a change capture mechanism on Oracle source databases?
  • You can use timestamps, partitioning or triggers. Each can be used either if the data is directly accessed through distributed queries or exported to flat file.
  • Timestamps columns are not always available in an operational source system.
  • In these cases, how could a timestamp solution for change capture be implemented?
    • You would have to add a new column in every sytem table and write triggers to update the new column each time an operation modifies the row data.
    • Note that trigger-based CDC mechanisms can affect performance of the source system.
(Q) What Oracle feature uses an internalized trigger-based mechanism to keep change capture?
  • Materialized view logs use an internalized trigger-based mechanism.
  • These logs are used by Materialized views to identify changed data.
(Q) What are the alternatives available for extracting data into files if the source system is an Oracle database?
  • You can use SQL*Plus,
  • Use OCI or a Pro*C program,
  • use the Export utility,
  • use External tables.
(Q) You want to extract data using SQL*Plus from a single partitioned table. How can you parallelize this process?
If the table is range partitioned, you can initiate one SQL*Plus session for each partition.
SPOOL order_jan.dat 
Select * from orders PARTITION (orders_jan1998); 
SPOOL OFF; 
(Q) You want to extract data using SQL*Plus from a single non-partitionied table. What two criteria can you use to parallelize this process?
Logical or physical criteria.
  • Logical method: based on logical ranges of column values.
Select … where order_date 
between to_date('01-JAN-99') and to_date('31-JAN-99'); 
  • Physical method: based on a range of values.
  • Check the data dictionary and find the set of data blocks that make up the table.
  • From here, derive a set of rowid-range query.
Select … where order_date where rowid between v1 and v2;
(Q) What are the differences between extracting data using data pump export utility and data pump external table mechanism?
  • Export utility:
    • Metadata + data are exported. Output must be processed with the Import utility
  • External table data pump utility:
    • Results of a SELECT can be stored in an oracle-internal data pump format and processed as a regular external table.
    • Metadata is NOT exported. Results can be stored in multiple extraction files.
    • If metadata is required, need to use the DBMS_METADATA package.
    Set LONG 2000 
    Select dbms_metadata.get_ddl('table', 'extract_cust') from dual; 
    
    (Q) What is an important scalability limitation to extraction through distributed operations?
    • The data is transported through a single Oracle Net connection.
    • For large amounts of data, file-based data extraction and transportation techniques are often more scalable.

    No comments:

    Post a Comment