Data Warehouse: Concepts on ETL Processes (Extraction)







(Q) Why defining and creating the extraction process may be a very time-consuming stage in the ETL process?

  • Often multiple and independent sources. May be poorly documented. 3rd party applications may not be modifiable.
  • Business requirements on source may not accept an extra overhead of the extraction process, nor accept changes in its availability.
  • Often you cannot add aditional logic to the source systems to support incremental extraction of data.


(Q) What are the two Logical extraction methods?

  • Full Extraction and Incremental Extraction
  • Full Extraction:
    • ALL data is extracted from the source system. Here you don’t need to keep track of what has changed sinc e the last extraction.
    • However, it is more costly in space, since the staging area might need to be much larger
  • Incremental Extraction:
    • Here you need to keep track of all data that has changed since the last extraction or some complex business event like the last booking day of a fiscal period.
    • This information may be provided by the source system (i.e. an application column, or a change table). In this case the source system would likely have to be modified to incorporate the change tracking functionality).
    • Alternatively, Oracle provides a Change Data Capture (CDC) mechanism (with Oracle Golden Gate)


(Q) What two methods of physical extraction are there?

  • Online extraction:
    • Data is extracted directly from the source itself. Extraction process connects directly to the source.
  • Offline extraction:
    • Data is staged explicitly outside the original source system.
    • In this case, the data either has an exisitng structure (i.e. redo logs, archive logs, or transportable tablespace) or was created by an extraction routine. (i.e stored in flat files)
    • The alternatives are: Flat files, dump files (imp/exp), redo and archived logs (how to read these?) or transportable tablespaces.


(Q) What is the physical extraction technique recommended by Oracle most of the time? Why is it recommended?

  • Oracle recommends the use of transportable tablespaces.
  • (For this, either the source system is in an Oracle DB or an extraction routine was used to copy the data to a "staging" transportable tablespace.


(Q) What is the key-enabling technology for providing near real-time, or on-time, data warehousing?

  • Change Data Capture mechanism
  • Used for incremental extractions (i.e, nightly extractions, which need to extract only data modified since last extraction).


(Q) What usually is the most challenging technical issue in data extraction?

  • Devising a mechanism of Change Data Capture.
(Q) What are some techniques for implementing a change capture mechanism on Oracle source databases?
  • You can use timestamps, partitioning or triggers. Each can be used either if the data is directly accessed through distributed queries or exported to flat file.
  • Timestamps columns are not always available in an operational source system.
  • In these cases, how could a timestamp solution for change capture be implemented?
    • You would have to add a new column in every sytem table and write triggers to update the new column each time an operation modifies the row data.
    • Note that trigger-based CDC mechanisms can affect performance of the source system.
(Q) What Oracle feature uses an internalized trigger-based mechanism to keep change capture?
  • Materialized view logs use an internalized trigger-based mechanism.
  • These logs are used by Materialized views to identify changed data.
(Q) What are the alternatives available for extracting data into files if the source system is an Oracle database?
  • You can use SQL*Plus,
  • Use OCI or a Pro*C program,
  • use the Export utility,
  • use External tables.
(Q) You want to extract data using SQL*Plus from a single partitioned table. How can you parallelize this process?
If the table is range partitioned, you can initiate one SQL*Plus session for each partition.
SPOOL order_jan.dat 
Select * from orders PARTITION (orders_jan1998); 
SPOOL OFF; 
(Q) You want to extract data using SQL*Plus from a single non-partitionied table. What two criteria can you use to parallelize this process?
Logical or physical criteria.
  • Logical method: based on logical ranges of column values.
Select … where order_date 
between to_date('01-JAN-99') and to_date('31-JAN-99'); 
  • Physical method: based on a range of values.
  • Check the data dictionary and find the set of data blocks that make up the table.
  • From here, derive a set of rowid-range query.
Select … where order_date where rowid between v1 and v2;
(Q) What are the differences between extracting data using data pump export utility and data pump external table mechanism?
  • Export utility:
    • Metadata + data are exported. Output must be processed with the Import utility
  • External table data pump utility:
    • Results of a SELECT can be stored in an oracle-internal data pump format and processed as a regular external table.
    • Metadata is NOT exported. Results can be stored in multiple extraction files.
    • If metadata is required, need to use the DBMS_METADATA package.
    Set LONG 2000 
    Select dbms_metadata.get_ddl('table', 'extract_cust') from dual; 
    
    (Q) What is an important scalability limitation to extraction through distributed operations?
    • The data is transported through a single Oracle Net connection.
    • For large amounts of data, file-based data extraction and transportation techniques are often more scalable.

    Snapshot Standby: Role Transition cycle





    Create, test, resynchronize with production, create again




    Q&As: Data Guard, Role Transitions (I)



    (Q) Which 6 steps must be performed to prepare for a role transition?


    (1) Verify that each database is properly configured for the role that it is about to assume

    • Check that necessary initialization parameters are properly configured in each database
      • FAL_SERVER, DB_FILE_NAME_CONVERT, LOG_FILE_NAME_CONVERT
      • LOG_ARCHIVE_DEST_n and LOG_ARVHIVE_DEST_STATE_n
    • Check that ONLINE REDO LOGs and STANDBY REDO LOGs exist and are properly configured on all databases
    SQL> select status, type, member, group#
      2* from v$logfile;
    
    STATUS TYPE MEMBER         GROUP#
    ------- ------- ------------------------------------------------------------ ----------
     ONLINE /u01/app/oracle/oradata/fresno/redo03.log         3
     ONLINE /u01/app/oracle/oradata/fresno/redo02.log         2
     ONLINE /u01/app/oracle/oradata/fresno/redo01.log         1
     STANDBY /u01/app/oracle/oradata/fresno/standby_redo01.log        4
     STANDBY /u01/app/oracle/oradata/fresno/standby_redo02.log        5
     STANDBY /u01/app/oracle/oradata/fresno/standby_redo03.log        6
     STANDBY /u01/app/oracle/oradata/fresno/standby_redo04.log        7
    
    7 rows selected.
    


    (2) Verify that there are no Redo Transport errors or Redo Gaps at the standby database

    • For LOG_ARCHIVE_DEST_2, STATUS should be VALID and GAP_STATUS should be NO GAP
    (On the Physical Standby:)
    SQL> select dest_name, status, type, gap_status
      2  from v$archive_dest_status;
    
    DEST_NAME            STATUS    TYPE            GAP_STATUS
    ------------------- --------- --------------- ----------
    LOG_ARCHIVE_DEST_1   VALID     LOCAL
    LOG_ARCHIVE_DEST_2   VALID     UNKNOWN         NO GAP
    ...
    STANDBY_ARCHIVE_DEST VALID     UNKNOWN         NO GAP
    
    (On the primary database)
    SQL> select dest_name, status, type, gap_status
      2* from v$archive_dest_status;
    
    DEST_NAME            STATUS    TYPE           GAP_STATUS
    -------------------- --------- -------------- ------------------------
    LOG_ARCHIVE_DEST_1   VALID     LOCAL
    LOG_ARCHIVE_DEST_2   VALID     PHYSICAL       NO GAP
    LOG_ARCHIVE_DEST_3   INACTIVE  LOCAL
    


    (3) Verify that temporary files exist on the Standby that match the temporary files on the primary database



    (4) Remove any delay in applying redo that may be in effect on the standby database that will become the new primary database

    SQL> ALTER DATABASE RECOVER MANAGED STANDBY DATABASE NODELAY DISCONNECT FROM SESSION;


    (5) Before performing a switchover from an Oracle RAC primary databsae to a physical standby, shut down all but one primary db instance



    (6) If the physical standby that will become primary on the switchover is in real-time query mode, consider bringing all instances of the standby database to the mounted but not open state. This will allow:

    • (a) the fastest possible role transition and/li>
    • (b) to cleanly terminate any user sessions connected to the physical standby prior to the role transition