Wednesday 20 March 2013

Oracle SQL Queries

To fetch ALTERNATE records from a table. (EVEN NUMBERED)select * from emp where rowid in (select decode(mod(rownum,2),0,rowid, null) from emp);
To select ALTERNATE records from a table. (ODD NUMBERED)select * from emp where rowid in (select decode(mod(rownum,2),0,null ,rowid) from emp);
Find the 3rd MAX salary in the emp table.select distinct sal from emp e1 where 3 = (select count(distinct sal) from emp e2 where e1.sal <= e2.sal);
Find the 3rd MIN salary in the emp table.select distinct sal from emp e1 where 3 = (select count(distinct sal) from emp e2where e1.sal >= e2.sal);
Select FIRST n records from a table.select * from emp where rownum <= &n;
Select LAST n records from a tableselect * from emp minus select * from emp where rownum <= (select count(*) - &n from emp);
List dept no., Dept name for all the departments in which there are no employees in the department.select * from dept where deptno not in (select deptno from emp);
alternate solution: select * from dept a where not exists (select * from emp b where a.deptno = b.deptno);
altertnate solution: select empno,ename,b.deptno,dname from emp a, dept b where a.deptno(+) = b.deptno and empno is null;
How to get 3 Max salaries ?select distinct sal from emp a where 3 >= (select count(distinct sal) from emp b where a.sal <= b.sal) order by a.sal desc;
How to get 3 Min salaries ?select distinct sal from emp a where 3 >= (select count(distinct sal) from emp b where a.sal >= b.sal);
How to get nth max salaries ?
select distinct hiredate from emp a where &n = (select count(distinct sal) from emp b where a.sal >= b.sal);
Select DISTINCT RECORDS from emp table.select * from emp a where rowid = (select max(rowid) from emp b where a.empno=b.empno);
How to delete duplicate rows in a table?delete from emp a where rowid != (select max(rowid) from emp b where a.empno=b.empno);
Count of number of employees in department wise.select count(EMPNO), b.deptno, dname from emp a, dept b where a.deptno(+)=b.deptno group by b.deptno,dname;
Suppose there is annual salary information provided by emp table. How to fetch monthly salary of each and every employee?

select ename,sal/12 as monthlysal from emp;
Select all record from emp table where deptno =10 or 40.

select * from emp where deptno=30 or deptno=10;
Select all record from emp table where deptno=30 and sal>1500.

select * from emp where deptno=30 and sal>1500;
Select all record from emp where job not in SALESMAN or CLERK.

select * from emp where job not in ('SALESMAN','CLERK');
Select all record from emp where ename in 'BLAKE','SCOTT','KING'and'FORD'.

select * from emp where ename in('JONES','BLAKE','SCOTT','KING','FORD');
Select all records where ename starts with ‘S’ and its lenth is 6 char.

select * from emp where ename like'S____';
Select all records where ename may be any no of character but it should end with ‘R’.

select * from emp where ename like'%R';
Count MGR and their salary in emp table.

select count(MGR),count(sal) from emp;
In emp table add comm+sal as total sal .

select ename,(sal+nvl(comm,0)) as totalsal from emp;
Select any salary <3000 from emp table.

select * from emp where sal> any(select sal from emp where sal<3000);
Select all salary <3000 from emp table.

select * from emp where sal> all(select sal from emp where sal<3000);
Select all the employee group by deptno and sal in descending order.

select ename,deptno,sal from emp order by deptno,sal desc;
How can I create an empty table emp1 with same structure as emp?

Create table emp1 as select * from emp where 1=2;
How to retrive record where sal between 1000 to 2000?
Select * from emp where sal>=1000 And sal<2000
Select all records where dept no of both emp and dept table matches.
select * from emp where exists(select * from dept where emp.deptno=dept.deptno)
If there are two tables emp1 and emp2, and both have common record. How can I fetch all the recods but common records only once?
(Select * from emp) Union (Select * from emp1)
How to fetch only common records from two tables emp and emp1?
(Select * from emp) Intersect (Select * from emp1)
How can I retrive all records of emp1 those should not present in emp2?
(Select * from emp) Minus (Select * from emp1)
Count the totalsa deptno wise where more than 2 employees exist.
SELECT deptno, sum(sal) As totalsal
FROM emp
GROUP BY deptno
HAVING COUNT(empno) > 2

Friday 12 October 2012

DataStage Data integration with salesforce

Salesforce is a leading on-demand customer relationship management (CRM) system. It offers a complete hosted solution for companies to manage their customer contacts, track sales orders, and streamline their sales processes. It also allows customers to tailor their software as a service, therefore gaining a competitive advantage.

IBM Information Server is a unified and comprehensive information integration platform. It profiles, cleanses, and transforms data from heterogeneous data sources to deliver consistent and accurate business data. IBM Information Server is an ideal solution to integrate and synchronize Salesforce CRM data with the business data stored within back end enterprise resource planning (ERP) systems and other data analytical applications.

This article illustrates the main features of the new certified IBM Salesforce solution for Information Server. The article also contains examples to demonstrate the daily business integration scenarios. These examples provide step-by-step guides that consultants can use to solve their Salesforce data integration cases.

Product prerequisite and installation

IBM Information Server includes many software products for data integration and analysis tasks. Those products include Information Server DataStage®, Information Server QualityStage™, Information Server Information Analyzer, Information Server Federation Server, and other companion products. Depending on the specific project requirements, users can choose to install a subset of the products in the IBM Information Server.

Figure 1. Software products needed for designing ETL jobs for Salesforce.com

Figure 1 shows the minimum set of IBM Information Server products needed to design ETL jobs for the Salesforce.com CRM.

IBM Information Server DataStage, which includes:
- DataStage Client
- DataStage Server
- DataStage Metadata Repository
- DataStage Domain Server
Information Server DataStage products can be installed separately on different hosts or installed on the same host.
Information Server Pack for Salesforce.com (IS Salesforce Pack)
The IS Salesforce Pack is a companion product of the IBM Information Server. The pack has been developed to connect to Salesforce.com through Web services. The GUIs of the IS Salesforce Pack are installed on the DataStage Client. The runtime part of the pack is installed on the DataStage Server.
Apache Axis
Apache Axis is an external component that is used to manage the transport layer of the IS Salesforce Pack. It contains libraries for the HTTP/HTTPS protocols, as shown in Figure 1a.

Figure 1a. Apache Axis as external component

Architecture overview

Each software components in Figure 1 plays a different role in designing and executing the ETL jobs for Salesforce.com:

The Information Server DataStage Client and IS Salesforce Pack GUI components provide a smart UI to allow users to design ETL jobs, import and export metadata, and set up data operations to be performed on Salesforce.com
The Information Server DataStage Server and IS Salesforce Pack server components enable users to schedule and run the ETL jobs.
The Information Server Domain Server manages Information Server user accounts and authorization.
The Information Server Metadata Repository is a single repository database for storing and sharing metadata definitions.

The Information Server Salesforce Pack includes three major components:

Salesforce metadata browser for browsing and importing the definitions of the Salesforce objects from Salesforce.com to DataStage.
Salesforce Load Stage for loading data from non-Salesforce data sources to Salesforce.com. It is a Salesforce certified loading solution.
Salesforce Extract Stage for extracting data from Salesforce.com to non-Salesforce data sources. It is a Salesforce certified extraction solution. Both Salesforce load and extract stages depend on the Salesforce.com Web service APIs.

Salesforce metadata browser

The pack has a smart metadata browser. It supports browsing both Salesforce objects and custom objects created by Salesforce users. It is capable of importing the selected object definitions from the Salesforce back to IBM Information Server for future usages. The metadata browser displays only the objects and fields that are applicable to the user-selected data operations. Figure 2 shows the metadata browser for the query operation. The metadata browser shows only the Salesforce objects on which users can perform query operations.

Figure 2. Metadata browser and query table objects in end-userâ€™s Salesforce account

Metadata browser and query table objects in end-user’s Salesforce account

In Figure 3, the metadata browser displays the Salesforce objects that allow users to update their data or insert new data.

Figure 3. Metadata browser and Upsertable objects in end-userâ€™s Salesforce account

Metadata browser and Upsertable objects in end-user’s Salesforce account

As shown in Figure 2 and Figure 3, different sets of objects, including custom objects, are displayed based on user-selected operations. This feature eliminates the guessing game for the end user regarding what operation is associated with which object and the reverse.

The metadata browser shows not only the Salesforce objects but also the field attributes related to those objects. It further simplifies the end userâ€™s task at the DataStage job design phase. In Figure 4, the Opportunity object is selected with UPSERT operation. Only the fields in the object that are qualified for the UPSERT operation are displayed.

Figure 4. Opportunity fields with Upsert operation

In Figure 5, the same Opportunity object is selected with the Query operation. Compared to Figure 4, a different set of fields is shown for the Opportunity object. Only the fields in the object that are qualified for the Query operation are visible.

Figure 5. Opportunity fields with Query operation

The metadata browser imports the Salesforce object and field definitions into DataStage. You first select the interested Salesforce object and its data fields, as shown in Figure 6. Then you click Import to convert the selected Salesforce object definitions to the DataStage table definitions. The created table definition is shown in Figure 7. You can also save the table definitions to the DataStage repository for future usages.

Figure 6. Importing Salesforce metadata

Figure 7. Saving the metadata definitions in DataStage

Extract data from Salesforce

The Salesforce Extract Stage extracts data from Salesforce.com based on the userâ€™s Salesforce.com credentials. The extracted data can be loaded into other data sources, such as Oracle®, DB2, SAP®, or just flat files. The Extract Stage depends on the Salesforce.com Web service API and Salesforce Object Query Language (SOQL).

In this section, a simple ETL job is designed to illustrate the design steps and the extract functions. Figure 8 shows the sample job. This sample job replicates the Salesforce Account object into the Oracle database. You can manually create the Oracle Account table or it can be automatically created by the DataStage Open Database Connectivity (ODBC) connector at the job run time.

Figure 8. DataStage job for extracting data from Salesforce.com to Oracle 9g with reject

Figure 9. Table Account in Oracle database

To set up the Salesforce extraction job, you open up the stage GUI first and select the operation type. Four operation types are defined: Query, QueryAll, Get Deleted Delta, and Get Updated Delta. Figure 10 shows the selection of the Query operation.

Figure 10. Select the Query operation for extraction

After the operation is selected, you can either launch the metadata browser to select the business object that you want to query on, or manually enter the object name and SOQL statement. The metadata browser generates the SOQL statement as a reference. You can use the generated SOQL statement as is, or manually modify it. Figures 11 and 12 show the generated SOQL statements.

Figure 11. Select the object and fields use the metadata browser

Figure 12. SOQL statement is automatically generated

Run the data extraction operation

Figure 13 shows the setup of the ODBC connector to insert the extracted Salesforce data into the Oracle database.

Figure 13. Set up a connection to the Oracle database

Figure 14 shows the job run results.

Figure 14. Schedule and run the extraction job

Load data into Salesforce

This section uses a simple ETL job to illustrate the steps necessary for loading data into Salesforce.com. Figure 15 shows the sample job. The job extracts customer data from an Oracle database table using an ODBC Stage. It passes the extracted data to the Salesforce Load Stage (Salesforce_Contact), which loads the processed data into the contact object in the userâ€™s Salesforce account. Figure 16 shows the sample data in the Oracle database table.

Figure 15. Loading data from the Oracle database to the Salesforce contact

Figure 16. Sample data in an Oracle database table

You need to map the source from the Oracle database to the Salesforce contact object. You can do so after you have imported all the metadata from both the Oracle database and Salesforce.com.

To set up the Salesforce Load Stage Salesforce_Contact, you choose the Upsert operation, as shown in Figure 17.

Figure 17. Choose the operation

Then, you import the fields from the Salesforce object contact through the metadata browser. Figure 18 shows the imported result.

Figure 18. Imported Salesforce contact fields
Imported Salesforce ontact fields

Figure 19 shows the setup of the ODBC connector to extract the contact information from the Oracle database.

Figure 19. Extracting data from Oracle

In Figure 20, the data fields from the Oracle database are mapped to the fields in the Salesforce contact object.

Figure 20. Mapping between an Oracle database and Salesforce contact

In Figure 20, The REPID field from the Oracle database is mapped to the Salesforce OwnerId field. The REPID from Oracle only has a length of four. The Salesforce OwnerId requires having length of 18. This mapping generates errors that cause the Salesforce Load Stage to reject the records. Figure 21 shows that all the records to be loaded into the Salesforce are rejected and sent to the reject link.

Figure 21. Loading failed with rejection

As shown in Figure 22, the output file for the rejection link records the rejection cause for each rejected record. Users can use this rejection file and correct the errors, and then reload the records to Salesforce.

Figure 22. Rejection cause

Parallel loading

Parallel loading into Salesforce enables the end user to speed up the loading process using Web services calls over the Internet. Typically, loading half of a million rows of data into Salesforce through sequential loading and single connection can take as long as two to four hours. This kind of performance is rarely acceptable in the real world of data warehousing. With the build-in DataStage parallel capability, you can setup multiple parallel load node configuration files depends on end usersâ€™ needs. Then, use the proper configuration file at run time without changing any designs in jobs. This process not only simplifies the operator's task of designing jobs, but also speeds up the loading process. With multiple nodes, the data are automatically partitioned by the Salesforce Pack, and loaded through multiple connections to Salesforce. A typical two hour loading job can be done in 30 minutes. This kind of performance can improve the customer's data warehousing process and increase their productivity.

Data enrichment with Salesforce

The quality stages within the DataStage can cleanse the data and remove erroneous records. In the sample job shown in Figure 23, a quality stage is used to remove the duplicated records with the same external IDs before the records are updated or inserted into Salesforce.

Figure 23. Remove duplicated records

Figure 24 defines the column to be used as the key to check the duplicate records. In this case, the ExternalID__C column is selected.

Figure 24. ExternalID__C as the key defines duplicated records

Thursday 11 October 2012

DataWare House Testing Best Practice

Abstract

One of the greatest risks to success any company implementing business intelligence

system can make is rushing a data warehouse into service without testing it effectively.

Even wise IT managers, who follow the Old Russian proverb, "trust, but verify," need to,

maintain their vigilance. There are pitfalls in the testing process, too.

Many organizations create test plans and assume that testing is over when every single test

condition passes their expected results. In reality, this is a very difficult bar to meet. Often,

some requirements turn out to be unattainable when tested against production information.

Some business rules turn out to be false or incredibly more complex than originally thought.

Data warehousing applications keep on changing with changing requirements. This white paper shares

some of the best practices of the experiences of data warehouse testing.

“If you torture data sufficiently, it will confess to almost anything.”

1 About Data Warehousing Testing

We know how critical the data is in a data warehouse when it integrates data from

different sources. For example, in the healthcare industry, it helps users to answer

business questions about physicians, plan the performance, market share and geographic

variations in clinical practice, health outcomes etc. Thus if the data is so sensitive, critical

and vast, we can understand how much challenging it would be. Thus this is a menial

effort to write about some of the best practices we learned while doing it on ground to

share it with others.

How much confident a company can be to implement its data warehouse in the market

without actually testing it thoroughly. The organizations gain the real confidence once the

data warehouse is verified and validated by the independent group of experts known as

“Data warehouse testers”.

Many organizations don’t follow the right kind of testing methodology and are never sure

about how much testing is enough to implement their data warehouse in the market. In

reality, this is a very difficult bar to meet. Often, some requirements are difficult to test.

Some business rules turn out to be erroneous, wrongly understood or highly complex than

originally thought.

Data warehousing applications keep on changing with changing requirements. This white

paper explains the need to have a data warehouse application testing in place and it

mentions the various steps of the testing process covering the best practices.

2 Need for Data Warehouse testing: Best Practices

As we all know that a data warehouse is the main repository of any organization's

historical data. It contains the material and information for management's decision

support system.

Most of the organization runs their businesses on the basis of collection of data for

strategic decision- making. To take a competitive edge the organization should have the

ability to review historical trends and monitor real-time functional data.

Most importantly, we never appreciate if the bug is detected at the later stage of testing

cycles because it could easily lead to very high financial losses to the project. So data

warehouse testing following the best practice is unavoidable to remain at the top of the

business.

3 Data warehousing testing phases

While implementing the best practices at our testing we follow the various phases in our

data warehouse testing. They are:

1) Business understanding

a. High Level Test Approach

b. Test Estimation

c. Review Business Specification

d. Attend Business Specification and Technical Specification

walkthroughs

2) Test plan creation, review and walkthrough

3) Test case creation, review and walkthrough

4) Test Bed & Environment setup

5) Receiving test data file from the developers

6) Test predictions creation, review (Setting up the expected results)

7) Test case execution and (regression testing if required).

a. Comparing the predictions with the actual results by testing the

business rules in the test environment.

b. Displaying the compare result in the separate worksheet.

8) Deployment

a. Validating the business rule in the production environment.

When we test, we take sample data from the designed architecture and the test data files

are usually provided to the testers by the developers. Of course, the test data should be

able to cover all the possible scenarios w.r.t the requirements while we do the predictions

to define the expected test case results, however if the data provided to us is not

supportive enough to cover all the business rules then we go for the data mocking that is

explained in the later section of the paper.

Before we move further let me share a very interesting example which we came across

recently when performing our testing.

On May, 2008 our Data warehouse implemented the logic to use a field, Calendar

Year/Plan Year (field name) which was passed by the Source, for which the source

considered the Policy year as 'Y' and Calendar year as 'N', however the ETL used

different notation to populate Calendar Year and Policy Year details on the target table,

as the logic change was not properly informed. After the project was implemented it was

discovered that since there was a miscommunication between source system and

Business Analysts, it did not recognize the change and millions of rows which were

populated with incorrect data, i.e.

The expected data was Policy year = Y and

Calendar year = N

The actual data was Policy year = N and

Calendar year = Y

Because of wrong notation, data is behaving oddly and in reporting, it leads to huge loss

to the project. Thus effective communication as a best practice could have helped prevent

this situation.

4 What is ETL?

ETL stands for extract, transform, and load. It can consolidate the scattered data for any

organization while working with different departments. It can very well handle the data

coming from different departments.

For example, a health insurance organization might have information on a customer in

several departments and each department might have that customer's information listed in

a different way. The membership department might list the customer by name, whereas

the claims department might list the customer by number. ETL can bundle all this data

and consolidate it into a uniform presentation, such as for storing in a database or data

warehouse.

ETL can transform not only data from different departments but also data from different

sources altogether. For example, any organization is running its business on different

environments like SAP and Oracle Apps for their businesses. If the higher management

wants to take discussion on their business, they want to make the data integrated and used

it for their reporting purposes. ETL can take these two source system data and make it

integrated in to single format and load it into the tables.

5 How is data warehouse testing different from normal testing?

Generally the normal testing steps are:

• Requirements Analysis

• Testing Methodologies

• Test Plans and approach

• Test Cases

• Test Execution

• Verification and Validation

• Reviews and Walkthroughs

The main difference in testing a data warehouse (DW) is that we basically involve the

SQL queries in our test case documents. It is vital to test both the initial loads of the Data

Warehouse from the source i.e. when it gets extracted and then updating it on the target

table i.e. the loading step. In specific cases, where trouble shooting is required, we verify

intermediate steps as well.

A defect or bug detection can be appreciated if and only if it is detected early and is fixed

at the right time without leading to a high cost. So to achieve it, it is very important to set

some basic testing rules. They are:

• No Data losses

• Correct transformation rules

• Data validation

• Regression Testing

• Oneshot/ retrospective testing

• Prospective testing

• View testing

• Sampling

• Post implementation

We are now going to talk with reference to the practices/strategies we implement in our

current project on each of them.

6 No Data losses

• We verify that all expected data gets loaded into the data warehouse. This

includes validating that all records, all fields and the full contents of each field are

loaded without any truncation occurs at any step in the process.

• As and when required negative scenarios are also validated. Some of the

examples are validating special characters etc

7 Correct transformation Rules

• We ensure that all data is transformed correctly according to business rules, it

could be straight move, simple transformation or the complex transformation.

• The best method could be to pick some sample records, use the “EXACT”

formula in the excel worksheet and compare the results to validate data

transformations manually. This should be ideally done once the testers are

absolutely clear about the transformation rules in the business specification.

8 Data Validation

• We say we have achieved the quality when we successfully fulfill customer’s

requirements. In other words we basically achieve a value for our customer. Since

in data warehouse testing; the test execution revolves around the data, so it is

important to achieve the degree of excellence for the data and for that we do the

data validation for both the data extracted from the source and then getting loaded

at the table.

Heading level Example Font size and style

Title (centered) Title of paper 14 point, bold; U1

1st-level heading 1 Introduction 12 point, bold;

heading1

2nd-level heading 1.1 Subsection 12 point, bold;

heading2

3rd-level heading 1.1.1 Headings. 12 point, bold, italics

Text Text goes here … 12 point, justified,

single line spacing

Table 1 Give the table caption below the table. (Source: give the reference of the source, if the table is

adopted from elsewhere)

4 Citations

The surnames of authors and year of publication should be given to the corresponding

text. The references should be left aligned in 6-point spacing (after) and 0-point spacing

(before) should be there for each reference.

ETL Process Definitions and Deliverables

1.0 Define Requirements – In this process you should understand the business needs by gathering information from the user. You should understand the data needed and if it is available. Resources should be identified for information or help with the process.

Deliverables

A logical description of how you will extract, transform, and load the data.

Sign-off of the customer(s).

Standards

Document ETL business requirements specification using either the ETL Business Requirements Specification Template, your own team-specific business requirements template or system, or Oracle Designer.

Templates

ETL Business Requirements Specification Template

2.0 Create Physical Design – In this process you should define your inputs and outputs by documenting record layouts. You should also identify and define your location of source and target, file/table sizing information, volume information, and how the data will be transformed.

Deliverables

Input and output record layouts

Location of source and target

File/table sizing information

File/table volume information

Documentation on how the data will be transformed, if at all

Standards

Complete ETL Business Requirements Specification using one of the methods documented in the previous steps.

Start ETL Mapping Specification

Templates

ETL Business Requirements Specification Template

ETL Mapping Specification Template

3.0 Design Test Plan – Understand what the data combinations are and define what results are expected. Remember to include error checks. Decide how many test cases need to be built. Look at technical risk and include security. Test business requirements.

Deliverables

ETL Test Plan

ETL Performance Test Plan

Standards

Document ETL test plan and performance plan using either the standard templates listed below or your own team-specific template(s).

Templates

ETL Test Plan Template

ETL Performance Test Plan Template

4.0 Create ETL Process – Start creating the actual Informatica ETL process. The developer is actually doing some testing in this process.

Deliverables

Mapping Specification

Mapping

Workflow

Session

Standards

Start the ETL Object Migration Form

Start Database Object Migration Form (if applicable)

Complete ETL Mapping Specification

Complete cleanup process for log and bad files – Refer to Standard_ETL_File_Cleanup.doc

Follow Informatica Naming Standards

Templates

ETL Object Migration Form

ETL Mapping Specification Template

Database Object Migration Form (if applicable)

5.0 Test Process – The developer does the following types of tests: unit, volume, and performance.

Deliverables

ETL Test Plan

ETL Performance Test Plan

Standards

Complete ETL Test Plan

Complete ETL Performance Test Plan

Templates

ETL Test Plan Template

ETL Performance Test Plan

6.0 Walkthrough ETL Process – Within the walkthrough the following factors should be addressed: Identify common modules (reusable objects), efficiency of the ETL code, the business logic, accuracy, and standardization.

Deliverables

ETL process that has been reviewed

Standards

Conduct ETL Process Walkthrough

Templates

ETL Mapping Walkthrough Checklist Template

7.0 Coordinate Move to QA – The developer works with the ETL Administrator to organize ETL Process move to QA.

Deliverables

ETL process moved to QA

Standards

Complete ETL Object Migration Form

Complete Unix Job Setup Request Form

Complete Database Object Migration Form (if applicable)

Templates

ETL Object Migration Form

Unix Job Setup Request Form

Database Object Migration Form

8.0 Test Process – At this point, the developer once again tests the process after it has been moved to QA.

Deliverables

Tested ETL process

Standards

Developer validates ETL Test Plan and ETL Performance Test Plan

Templates

ETL Test Plan Template

ETL Performance Test Plan Template

9.0 User Validates Data – The user validates the data and makes sure it satisfies the business requirements.

Deliverables

Validated ETL process

Standards

Validate Business Requirement Specifications with the data

Templates

ETL Business Requirement Specifications Template

10.0 Coordinate Move to Production - The developer works with the ETL Administrator to organize ETL Process move to Production.

Deliverables

Accurate and efficient ETL process moved to production

Standards

Complete ETL Object Migration Form

Complete Unix Job Setup Request Form

Complete Database Object Migration Form (if applicable)

Templates

ETL Object Migration Form

Unix Job Setup Request Form

Database Object Migration Form (if applicable)

11.0 Maintain ETL Process – There are a couple situations to consider when maintaining an ETL process. There is maintenance when an ETL process breaks and there is maintenance when and ETL process needs updated.

Deliverables

Accurate and efficient ETL process in production

Standards

Updated Business Requirements Specification (if needed)

Updated Mapping Specification (if needed)

Revised mapping in appropriate folder

Updated ETL Object Migration Form

Developer checks final results in production

All monitoring (finding problems) of the ETL process is the responsibility of the project team

Templates

Business Requirements Specification Template

Mapping Specification Template

ETL Object Migration Form

Unix Job Setup Request Form

Database Object Migration Form (if applicable)

Informatica CANGE DATA CAPTURE

Informatica CDC for Real Time Data Capture

Introduction

Traditional Data Warehousing and BI systems rely on batch data extraction and load routines (ETL or ELT) to acquire their data from sources. Depending on data timeliness needs, this may work for intervals down to perhaps one hour for small data sets. However as data timeliness needs go below what is reasonable with a batch process, a complete shift in technology is required. There are tools out there that acquire data in a “real-time” manner directly from the database log files and send individual transactions out over the wire to a target database.

This post is about Informatica’s CDC product, but the lessons and the manner in which it works are similar for another popular product called Golden Gate from Oracle. Note the name Change Data Capture is not the best; this really is more about a real-time solution as any DW has to implement change data capture as part of its Data Acquisition approach.

Architecture

CDC works by installing essentially a log reader application called the PowerExchange Listener on each source database server. As each database/OS is different (Oracle, DB2 Unix, DB2 AS/400, SQL Server, etc), the particulars of each software installation is slightly different.

The PowerExchange Listener software will continually scan database log files and pull data records out in a sequential manner. Power Exchange Listener then links directly to the PowerExchange Client that is part of the PowerCenter main ETL hub.

The Log files that it reads capture not the SQL statement, but rather the resulting record. This is an important distinction; an update statement that operates on 1 million records will generate 1 million transactions over the CDC wire. It’s also important to note that CDC keeps track of where in the sequential time based log files it is; this is ultimately used for startup/shutdown and re-load operations. Sequential also means that if a parent record is created before a child record, it guarantees you to receive and process the parent before the child record, helping you to mirror whatever RI the source may or may not have.

Configuration of CDC involves registering which tables you wish to “listen” to in PowerExchange Navigator client tool and publishing them to the PowerExchange Listener on the source database. This is a simple operation.

After data records flow through PowerExchange Listener on the Source database, they are sent across the network into PowerCenter. From there, data is sent to its target in a similar manner as regular Informatica ETL. This is where traditional Informatica development picks up. You develop workflows and mappings using the traditional INFA PowerCenter client tools (i.e. designer, workflow manager). When workflows are run, they are expected to run continually – remember this is not batch.

Note that using Oracle database as a source is technically much more difficult than using SQL server or DB2 – there are extra pieces of software that are needed to make it work. For SQL Server, CDC uses its Replication server and not the actual real source. For DB2, CDC uses journals with concurrent processes for each journal. Note there is a configuration option to send a record after a certain time period or number of UOWs (i.e. commits). This might be useful for example if a system happens to have many small updates to a single record. The coalesce option will send that record say every 1 minute as opposed to the individual 25 transactions on it during that period.

Development

One of the main benefits of using CDC over a competing technology is the familiarity Informatica developers have with it. To a developer, the development tools are identical; it’s the infrastructure side where the differences come into play. Mappings are built in the regular PowerCenter Designer; tables are imported from a PowerExchange data source instead of a regular database source.

For a given source, typically you’ll have one mapping and one workflow. As each source has multiple tables in it that you’ll be listening to, you’ll need parallel “pipelines” in the mapping. If you need 20 tables, you have 20 pipelines in the same mapping, each with a different source and its associated target. Informatica will ensure to process each record in its associated pipeline, and remember the records will come in the same order as they were in the source.

As you are using regular PowerCenter Designer, the same kinds of functionality are available to you for your transformations. There are three main differences in development approach:

1. You are always operating on 1 record at a time

2. Lookup caching doesn’t make sense anymore (You can’t cache real-time data)

3. There is no debugging facility

CDC will tag each record coming in with some metadata about the transaction:

· Log record time (i.e. time when the log entry was created)

· Journal/LogFile

· Operation: Insert/Update/Delete

With this information you can determine what to do with each transaction type. A large portion of your design will focus on what to do for an Update or a Delete operation. For example, we elected to soft-delete our records by updating a DELETE_FLG to ‘Y” for each Delete operation.

One thing to note on the initial deployment: If you wish to bring in historical data, CDC will not be able to help you due to the limited lifetime of log entries in the database. You will have to develop traditional, bulk/batch ETL jobs for the historical/initial load. This is important to note from a development and QA perspective. In a section below I’ll discuss how you use Historical and CDC together.

Infrastructure, Deployments and Source Upgrades

Being a completely different technology from traditional Informatica, there is a fair amount of additional infrastructure to setup and tune. There are numerous options which you will need to play around with to ensure optimal configuration. Additionally, you will have to prepare for various infrastructure events that you should expect to see. Some of the items which you’ll need to consider:

· TimeOut Window

· Commit points

· Records per transmit

· Automatic Restarts

· Various ports

· Stops and Restarts

· Recover

· Source Database failures and failovers

· Source database upgrades

· Source database batch operations

It is very important to understand that once you have this table replication set-up, you are now tightly coupled to anything the source system does. This is particularly true during source upgrades. Some things to consider:

· Table changes require updated CDC Registrations and probably code changes (some changes may be ignored in code)

· Primary key changes are very important and will require target changes and code changes

· Data changes volume – for tables with a large # of records changed, you will have to rely on your bulk load instead of CDC

· Are Source changes even captured in the logs/journals – If the source team pulls a table out of a logged area, renames the table, or does a Create-Table-AS operation, CDC will not know about any of these changes. When this is done, you will need to rely on your Historical load to reload the table.

Our Implementation

I highly recommend the approach that we took when setting up CDC. Early on in the development cycle we brought in an Informatica CDC consultant for a week to help us with setup and scenario solutions. Aside from setting the software up with the sysadmins and DBAs, he was able to work on the basic configuration. However the bulk of his time was to help us determine how we would support our various scenarios on both DB2 and SQL Server.

An example of what we had to figure out was the Initial Load Scenario. We needed to know how we would actually turn on CDC in conjunction with the historical load. How do we ensure we don’t miss any transactions? How do we ensure we don’t overwrite a new record with an old record? What happens when CDC gets an update, but you haven’t inserted the base historical record yet?

Ultimately, we turned CDC on first then ran the historical load on top of it. With both of these running at the same time, this is the logic we employed to ensure an accurate version of the table:

Code Module	Incoming TXN	Existing Target Record?	Action
CDC	INSERT	NO	Insert
		YES	Ignore
	UPDATE	NO	Insert
		YES	Update
	DELETE	NO	Insert, Delete=Y
		YES	Update, Delete=Y
Historical	INSERT	NO	Insert
		YES	Ignore

As the project progressed, we elected to deploy our CDC solution to Production a few months before the Data Warehouse and OBI layers. This allowed us to use the Prod environment in a non-critical state to shake things out and tweak performance. When we did this, we brought back the Informatica consultant to help us tweak settings and get around any deployment issues we might have had. All in all, this approach worked out very well for us.

Performance

Performance for us involved several measurements. Of primary concern was the impact to the source systems, which in some cases were already at a very large load. Our findings showed that for DB2 the impacts reached a peak of 10% during the initial startup state when it was playing catch up while going through the log file, but settled down to 3-5% impact in a steady state.

Another attribute was lag time – the delay that we could expect from a record on a source to when it’s committed in our database. This typically is only a few seconds. However, at night time the source systems run several batch updates and deletes on their systems, all of which are consumed by CDC. Our largest table consists of nearly 500 Million rows, and typically has 1 million or so deletes each night as part of a batch job on the source application. We are able to completely consume 1 million deletes in only a half hour while consuming all other systems simultaneously.

DWBILEARNING

Pages