DWBILEARNING: Datastage Interview Question Part 1

1) how can you do incremental load in datastage?
You can create a table where u can store the last successfull refresh time for each table/Dimension.Then in the source query take the delta of the last successful and sysdate should give you incremental load

2) How can we improve the performance of DataStage jo...
Performance and tuning of DS jobs: 1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment 4.Evaluate data skew 5.Isolate and solve 6.Distribute file systems to eliminate bottlenecks 7.Do not involve the RDBMS in intial testing 8.Understand and evaluate the tuning knobs available.

3) There are 3 types of views in Datastage Director---
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages

4) Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic
ii) Specific.
Default Hased file is "Dynamic - Type Random 30 D"

5) Orchestrate Vs Datastage Parallel Extender?
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.

6) What r the different types of Type2 dimension maping?
Type2 Dimension/Version Data Maping: In this maping the updated dimension in the source will gets inserted in target along with a new version number.And newly added dimension
in source will inserted into target with a primary key.

Type2 Dimension/Flag current Maping: This maping is also used for slowly changing dimensions.In addition it creates a flag value for changed or new dimension.
Flag indiactes the dimension is new or newlyupdated.Recent dimensions will gets saved with cuurent flag value 1. And updated dimensions r saved with the value 0.

Type2 Dimension/Effective Date Range Maping: This is also one flavour of Type2 maping used for slowly changing dimensions.This maping also inserts both new and changed dimensions in to the target.And changes r tracked by the effective date range for each version of each dimension.

7) What is aggregate cache in aggregator transforamtion?
The aggregator stores data in the aggregate cache until it completes aggregate calculations.When u run a session that uses an aggregator transformation.

8) What are Sequencers?
Sequencers are job control programs that execute other jobs with preset Job parameters.

9) What will you in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.

10) How do you rename all of the jobs to support your new File-naming conventions?
Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers

11) What are other Performance tunings you have done in your last project to increase the performance of slowly running jobs?
1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum    performance also for data recovery in case job aborts.
2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3. Tuned the 'Project Tunables' in Administrator for better performance.
4. Used sorted data for Aggregator.
5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
6. Removed the data not used from the source as early as possible in the job.
7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard        routines supplied in the sdk or ds utilities categories.
11. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the     constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
12. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
13. Tuning should occur on a job-by-job basis.
14. Use the power of DBMS.
15. Try not to use a sort stage when you can use an ORDER BY clause in the database.
16. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
    Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC     or OLE.
Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
       This would also minimize overflow on the hash file.

If possible, break the input into multiple threads and run multiple instances of the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.

12) How did you handle an 'Aborted' sequencer?

13) How do you merge two files in DS?
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.

14) how can you do incremental load in datastage?

Incremental load means daily load.

when ever you are selecting data from source, select the records which are loaded or updated between the timestamp of lastsuccessful load and todays load start date and time.

for this u have to pass parameters for those two dates.

store the last rundate and time in a file and read the parameter through job parameters and state second argument as currentdate and time.

15) What is Full load & Incremental or Refresh load?
Full Load is the entire data dump load taking place the very first time.
Gradually to synchronize the target data with source data, there are further 2 techniques:-
Refresh load - Where the existing data is truncated and reloaded completely.
Incremental - Where delta or difference between target and source data is dumped at regular intervals. Timsetamp for previous delta load has to be maintained.

16) What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format.
OConv() - Converts an expression to an output format.

17) what is difference between serverjobs & paraller jobs?

Server jobs. These are available if you have installed DataStage Server. They run on the DataStage Server, connecting to other data sources as necessary.

Parallel jobs. These are only available if you have installed Enterprise Edition. These run on DataStage servers that are SMP, MPP, or cluster systems. They can also run on a separate z/OS (USS) machine if required.

What is Overloading of procedures ?
--------------------------------------------------------------------------------

Overloading procs are 2 or more procs with the same name but different arguments.

Arguments needs to be different by class it self. ie char and Varchar2 are from same class.

Packages -

The main advantages of packages are -

1- Since packages has specification and body separate so, whenever any ddl is run and if any proc/func(inside pack) is dependent on that, only body gets invalidated and not the spec. So any other proc/func dependent on package does not gets invalidated.

2- Whenever any func/proc from package is called, whole package is loaded into memory and hence all objects of pack is availaible in memory which means faster execution if any is called. And since we put all related proc/func in one package this feature is useful as we may need to run most of the objects.

UNIX

FILE OPERATIONS:

ls                      list files
cp                      copy files: cp /path/name newname
mv                      move or rename files: mv name newname
rm                      remove (i.e. delete) files: rm name
chmod                   change mode of file permissions: chmod xxx name
cat                     scroll file contents: cat name
more                    page file contents (spacebar to continue): more name
less                    better pager than more? (q to quit): less name
view   view file contents (:q to quit): view name
vi                      visual text editor (:wq to save and quit): vi name
pico                    pico text editor (Ctrl-X to quit): pico name

DIRECTORY OPERATIONS:

mkdir                   make (create) new directory: mkdir Name
cd                      change directory: cd /path/name
cd    change to your home directory: cd
rmdir                   remove directory (if empty): rmdir Name
pwd                     print working directory (show directory name)
quota                   check disk space quota: quota -v

SYSTEM OPERATIONS:

df                      show free disk space
du                      show disk usage
ps                      list your processes
kill                    kill a process: kill ###
passwd                  change your password
date                    show date and time
w                       who is doing what on the system
who                     who is connected to the system
cal                     display a calendar
ping                    ping another computer (is it alive?)
finger                  get information on users
exit                    exit, or logout, from the system

COMMUNICATIONS:

write                   write messages to another user's screen
talk                    talk split-screen with another user: talk username
pine                    send or read E-mail with pine mail system
mail                    UNIX mail system
rtin                    read news with tin newsreader
telnet                  connect to another computer via the network
ftp                     file transfer over the network

Most commands have options, which are described in the online man pages.

DataStage PX questions

What is difference between server jobs & parallel jobs?

--- Server generates DataStage BASIC, parallel generates Orchestrate shell script (osh) and C++, mainframe generates COBOL and JCL.

--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.

--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.

What is Parallel Processing Environment? Types of Parallel Processing
Environments?

--- It is the environment in which you run the DataStage jobs. The following are

the types Parallel Processing Environments:

SMP (Symmetric Multi Processing): In SMP, multiple processors share the same memory; i.e the processors communicate via shared memory and have a single Operating System.

MPP (Massively Parallel Processing): In MPP, each processor has exclusive access to its own memory(also known as Shared Nothing) and has its own Operating System. MPP systems are physically housed in the same box, whereas ‘Cluster Systems’ are physically dispersed and communicate via a high-speed network.

What is orchestrate?

--- Orchestrate is the old name of the underlying parallel execution engine. Ascential re-named the technology "Parallel Extender".

DataStage PX GUI generates OSH (Orchestrate Shell) scripts for the jobs you run.

An OSH script is a quoted string which specifies the operators and connections of a single Orchestrate step. In its simplest form, it is:

Osh "op < in.ds > out.ds".

Where op – Orchestrate operator.

In.ds - input dataset.

Out.ds – Output dataset.

Orchestrate Vs DataStage Parallel Extender?

What are the types of Parallelism?

--- There are 2 types of Parallel Processing. They are:

Pipeline Parallelism –

It is the ability for a downstream stage to begin processing a row as soon as an upstream stage has finished processing that row (rather than processing one row completely through the job before beginning the next row). In Parallel jobs, it is managed automatically.

For example, consider a job(srcà Tranformer à Tgt) running on a system having three processors:

--- The source stage starts running on one processor, reads the data from the source and starts filling a pipeline with the read data.

--- Simultaneously, the Transformer stage starts running on another processor, processes the data in the pipeline and starts filling another pipeline.

--- At the same time, the target stage starts running on another processor, writes data to the target as soon as the data is available.

Partitioning Parallelism –

Partitioning parallelism means that entire record set is partitioned into small sets and processed on different nodes. That is, several processors can run the same job simultaneously, each handling a separate subset of the total data.

For example if there are 100 records, then if there are 4 logical nodes then each node would process 25 records each. This enhances the speed at which loading takes place.

Is Pipeline parallelism in PX is same what Inter-processes does in Server?

YES. IPC stage is a stage, which helps one passive stage read data from another as soon as data is available. In other words, stages do not have to wait for the entire set of records to be read first and then transferred to the next stage. It means as soon as the data is available between stages( in pipes or links), it can be exchanged between them without waiting for the entire record set to be read.

Note:- Link partitioner and Link collector stages can be used to achieve a certain degree of partitioning parallelism.

What are partitioning methods available in PX?

The Partitioning methods available in PX are:

Auto:

--- It chooses the best partitioning method depending on:

The mode of execution of the current stage and the preceding stage.
The number of nodes available in the configuration file.

2. Round robin:

--- Here, the first record goes to the first processing node, the second to the second processing node, and so on. This method is useful for resizing partitions of an input dataset that are not equal in size to approximately equal-sized partitions.

Data Stage uses ‘Round robin’ when it partitions the data initially.

Same:

--- It implements the Partitioning method same as the one used by the preceding stage. The records stay on the same processing node; that is, data is not redistributed or repartitioned. Same is considered as the fastest Partitioning method.

Data Stage uses ‘Same’ when passing data between stages in a job.

Random:

--- It distributes the records randomly across all processing nodes and guarantees that each processing node receives approximately equal-sized partitions.

Entire:

--- It distributes the complete dataset as input to every instance of a stage on every processing node. It is mostly used with stages that create lookup tables for their input.

Hash:

--- It distributes all the records with identical key values to the same processing node so as to ensure that related records are in the same partition. This does not necessarily mean that the partitions will be equal in size.

--- When Hash Partitioning, hashing keys that create a large number of partitions should be selected.

Reason: For example, if you hash partition a dataset based on a zip code field, where a large percentage of records are from one or two zip codes, it can lead to bottlenecks because some nodes are required to process more records than other nodes.

Modulus:

--- Partitioning is based on a key column modulo the number of partitions. The modulus partitioner assigns each record of an input dataset to a partition of its output dataset as determined by a specified key field in the input dataset.

Range:

--- It divides a dataset into approximately equal-sized partitions, each of which contains records with key columns within a specific range. It guarantees that all records with same partitioning key values are assigned to the same partition.

Note: In order to use a Range partitioner, a range map has to be made using the ‘Write range map’ stage.

DB2:

--- Partitions an input dataset in the same way that DB2 would partition it.

For example, if this method is used to partition an input dataset containing update information for an existing DB2 table, records are assigned to the processing node containing the corresponding DB2 record. Then during the execution of the parallel operator, both the input record and the DB2 table record are local to the processing node.

4. What is Re-Partitioning? When actually re-partition will occur?

--- Re-Partitioning is the rearranging of data among the partitions. In a job, the Parallel-to-Parallel flow results in Re-Partitioning.

For example, consider the EMP data that is initially processed based on SAL, but now you want to process the data grouped by DEPTNO. Then you will need to Repartition to ensure that all the employees falling under the same DEPTNO are in the same group.

5. What are IConv () and Oconv () functions and where are they used?

Can we use these functions in PX?

--- ‘Iconv()’ converts a string to an internal storage format.

Syntax:- Iconv(string,code[@VM code]…)

string evaluates to the string to be converted.
Code indicates the conversion code which specifies how the data needs to be formatted for output or internal storage. Like,

MCA – Extracts alphabetic characters from a field.

MCN – Extracts numeric characters from a field.

MCL - Converts Uppercase Letters to Lowercase….

--- ‘Oconv()’ converts an expression to an output format.

Syntax:- Oconv(expression, conversion[@VM conversion]…)

Expression is a string stored in internal format that needs to be converted to output format.
Code indicates the conversion code which specifies how the string needs to be formatted.

MCA – Extracts alphabetic characters from a field.

MCN – Extracts numeric characters from a field.

MCL - Converts Uppercase Letters to Lowercase….

These functions can’t be used directly in PX. The only stage which allows the

usage of Iconv() and Oconv() in PX is ‘Basic Transformer’ stage. It gives

access to the functions supported by the DataStage Server Engine.

Note:- Basic Transformer can be used only on SMP systems but not on MPP or

Cluster syntems.

1. What does a Configuration File in parallel extender consist of?

The Configuration File consists of all the processing nodes of a parallel system.
It can be defined and edited using the DataStage Manager. It describes every processing node that DataStage uses to run an application. When you run a job, DataStage first reads the Configuration File to determine the available nodes.
When a system is modified by adding or removing processing nodes or by reconfiguring nodes, the DataStage jobs need not be altered or even recompiled. Instead, editing the Configuration file alone will suffice.
The configuration file also gives control over parallelization of a job during development cycle. For example, by editing the Configuration file, first a job can be run on a single processing node, then on two nodes, then four, and so on.

What is difference between file set and data set?

Dataset:- Datasets are operating system files, each referred to by a control file, which has the suffix .ds. PX jobs use datasets to manage data within a job. The data in the Datasets is stored in internal format.

A Dataset consists of two parts:

Descriptor file: It contains metadata and data location.
Data file: It contains the data.

--The Dataset Stage is used to read data from or write data to a dataset. It allows you to store data in persistent form, which can then be used by other jobs.

Fileset:- DataStage can generate and name exported files, write them to their destination, and list the files it has generated in a file with extension, .fs. The data files and the file that lists them is called a ‘Fileset’.

A fileset consists of two parts:

Descriptor file: It contains location of raw data files and the meta data.
Individual raw Data files.

-- The Fileset Stage is used to read data from or write data to a fileset.

3. Lookup Stage :Is it Persistent or non-persistent? (What is happening behind the scene)

--- Lookup stage is non-persistent.

How can we maintain the partitioning in Sort stage?

-- Partitioning in Sort stage can be maintained using the Partitioning method, ‘Same’.

For example, assume you sort a dataset on a system with four processing nodes and store the results to a Dataset stage. The dataset will therefore have four partitions. You then use that dataset as input to a stage executing on a different number of nodes. DataStage automatically repartitions the dataset to spread it out to all the processing nodes. This destroys the sort order of the data.

This can be avoided by specifying the Same Partitioning method so that the original partitions are preserved.

Where we need partitioning (In processing or some where)

--- Partitioning is needed in processing. It means we need Partitioning where we have huge volumes of data to process.

If we use SAME partitioning in the first stage which partitioning method it will take?

--- Data Stage uses ‘Round robin’ when it partitions the data initially.

What is the symbol we will get when we are using round robin partitioning method?

-- BOW TIE.

Given below is the list of icons that appear on the link based on the mode of execution, parallel or sequential, of the current stage and the preceding stage, and the type of Partitioning method:

Preceding Stage Current Stage

Sequential mode-à (FAN OUT) --à Parallel mode (Indicates partitioning)

Parallel mode --à (FAN IN) --à Sequential mode(Indicates Collecting)

Parallel mode --à (BOX) --à Parallel mode (Indicates AUTO method)

Parallel mode --à (BOW TIE)--à Parallel mode (Indicates Repartitioning)

Parallel modeà (PARALLEL LINES)-> Parallel mode(Indicates SAME partitioning)

If we check the preserve partitioning in one stage and if we don’t give any partitioning method (Auto) in the next stage which partition method it will use?

-- In this case, the partitioning method used by the preceding stage is used.

-- Preserve Partitioning indicates whether the stage wants to preserve the

partitioning at the next stage of the job. Options in this tab are:

Set – Sets the Preserve partitioning flag.
Clear – Clears the preserve partitioning flag.
Propagate – Sets the flag to Set or Clear depending on the option selected in the previous stage.

Can we give node allocations i.e. for one stage 4 nodes and for next stage 3 nodes?

--- YES. It can be done using the option, ‘Node map constraint’ available in

Stage page -à Advanced Tab.

--- Checking this option allows you to constrain Parallel execution to the nodes defined in a node map. A ‘Node map’ can be defined by either of the below 2 options:

By clicking the browse button to open the available nodes dialog box and selecting nodes from there. (or)
By typing node numbers into the text box.

What is combinability, non-combinability?

--- Using Combinability, DataStage combines the operators that underlie parallel stages so that they run in the same process. It lets the DataStage compiler potentially 'optimize' the number of processes used at runtime by combining operators. This saves significant amount of data copying and preparation in passing data between operators. It has three options to set:

Auto: Use the default combination method.
Combinable: Combine all possible operators.
Don’t combine: Never combine operators.

Usually this setting is left to its default so that DataStage can tune jobs for performance automatically.

What are schema files?

-- ‘Schema file’ is a plain text file in which meta data for a stage is specified.

--- It is stored outside the DataStage Repository in a document management system or a source code control system. In contrast, Table definitions are stored in DataStage Repository and can be loaded into stages as and when required.

--- Schema is an alternative way to specify column definitions for the data used by the parallel jobs. By default most parallel job stages take their meta data from the columns tab. For some stages you can specify a property that causes the stages to take its meta data from the specified schema file.

--- A Schema consists of a record definition. The following is an example for record schema:

record(

name:string[];

address:nullable string[];

date:date[];

)

--- Import any table definition, load it into a parallel job, then choose "Show Schema". The schema defines the columns and their characteristics in a data set, and may also contain information about key columns.

21. Why we need datasets rather than sequential files?

--- A Sequential file as the source or target needs to be repartitioned as it is(as name suggests) a single sequential stream of data. A dataset can be saved across nodes using partitioning method selected, so it is always faster when we used as a source or target. The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other

DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds. Using datasets wisely can be key to good performance in a set of linked jobs.

22. Is look-up stage returns multi-rows or single row?

---. Lookup stage returns the rows related to the key values. It can be multiple rows depends on the keys you are mentioning for the lookup in the stage

23. Why we need sort stage other than sort-merge collective method and perform sort option in the stage in advanced properties?

--- Sort Stage is used to perform more complex sort operations which is not possible using stages Advanced tab properties..

Many stages have an optional sort function via the partition tab. This means if you are partitioning your data in a stage you can define the sort at the same time. The sort stage is for use when you don't have any stage doing partitioning in your job but you still want to sort your data, or if you want to sort your data in descending order, or if you want to use one of the sort stage options such as "Allow Duplicates" or "Stable Sort". If you are processing very large volumes and need to sort you will find the sort stage is more flexible then the partition tab sort.

24. For surrogate key generator stage where will be the next value stored?

25. In surrogate key generator stage how it generates the number? (Based on nodes or Based on rows)

--- Based on the nodes we are generating key values, and the input data partitions should be perfectly balanced across the nodes. This can be achieved using round robin partitioning method when your starting point is sequential

26. What is the preserve partioning flag in Advanced tab?

--- It indicates whether the stage wants to preserve partitioning at the next stage of the job. There a three options 1. Set 2.Clear 3.Propagate.

Set. Sets the preserve partitioning flag, this indicates to the next stage in the job that it should preserve existing partitioning if possible.

Clear. Clears the preserve partitioning flag, this indicates that this stage doesn’t care which partitioning method the next stage uses.

Propagate. Set the flag to set or clear depending on what in previous stage of the job has set(or if that is set to propagate the stage before that and so on until a preserve partitioning flag setting is encountered).

27. What is the difference between stages and operators?

--- Stages are generic user interface from where we can read and write from files and databases, trouble shoot and develop jobs, also it's capable of doing processing of data. Different types of stages are

Database. These are stages that read or write data contained in a database. Examples of database stages are the Oracle Enterprise and DB2/UDB Enterprise stages.

Development/Debug. These are stages that help you when you are developing and troubleshooting parallel jobs. Examples are the Peek and Row Generator stages.

File. These are stages that read or write data contained in a file or set of files. Examples of file stages are the Sequential File and Data Set stages.

Processing. These are stages that perform some processing on the data that is passing through them. Examples of processing stages are the Aggregator and Transformer stages.

Real Time. These are the stages that allow Parallel jobs to be made available as RTI services. They comprise the RTI Source and RTI Target stages. These are part of the optional Web Services package.

Restructure. These are stages that deal with and manipulate data containing columns of complex data type. Examples are Make Sub record and Make Vector stages.

--- Operators are the basic functional units of an orchestrate application. In orchestrate framework DataStage stages generates an orchestrate operator directly.

28. Why we need filter, copy and column export stages instead of transformer stage?

--- In parallel jobs we have specific stage types for performing specialized tasks. Filter, copy, column export stages are operator stages. These operators are the basic functional units of an orchestrate application. The operators in your Orchestrate application pass data records from one operator to the next, in pipeline fashion. For example, the operators in an application step might start with an import operator, which reads data from a file and converts it to an Orchestrate data set. Subsequent operators in the sequence could perform various processing and analysis tasks. The processing power of Orchestrate derives largely from its ability to execute operators in parallel on multiple processing nodes. By default, Orchestrate operators execute on all processing nodes in your system. Orchestrate dynamically scales your application up or down in response to system configuration changes, without requiring you to modify your application. Thus using operator stages will increase the speed of data processing applications rather than using transformer stages.

29. Describe the types of Transformers used in DataStage PX for processing and uses?

Transformer

Basic Transformer

Transformer-: The Transformer stage is a processing stage. Transformer stages allow you to create transformations to apply to your data. These transformations can be simple or complex and can be applied to individual columns in your data. Transformations are specified using a set of functions. Transformer stages can have a single input and any number of outputs. It can also have a reject link that takes any rows, which have not been written to any of the outputs links by reason of a write failure or expression evaluation failure.

Basic Transformer:- The BASIC Transformer stage is a also a processing stage. It is similar in appearance and function to the Transformer stage in parallel jobs. It gives access to BASIC transforms and functions (BASIC is the language supported by the DataStage server engine and available in server jobs). BASIC Transformer stage can have a single input and any number of outputs.

30. What is aggregate cache in aggregator transformation?

--- Aggregate cache is the memory used for grouping operations by the aggregator stage.

31. What will you do in a situation where somebody wants to send you a file and use that file as an input or reference and then run job?

--- This situation can be handled using ‘Wait-for-file’ activity between the job- activities in a Job sequence.

Under the ‘Wait For File’ tab set the following options:

File name.
Check any of the options:

- Wait for file to appear.

- Wait for file to disappear.

Timeout length - (hh:mm:ss).

Under the ‘Triggers’ tab set the:

Expression Type and
Expression

For the corresponding Input Links.

The available ‘Expression Types’ are:

Expression Type Expression

Unconditional ---à N/A

Otherwise ---à N/A

OK (Conditional) ---à "Executed OK"

Failed (Conditional) ---à "Execution Failed"

Return Value (Conditional) ---à =

Custom (Conditional) ---à <LHS> = <RHS>

32. How do you rename all of the jobs to support your new File-naming conventions?

--- Create a file with new and old names. Export the whole project as a dsx. Write a script, which can do a simple rename of the strings looking up the file. Then import the new dsx file and recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers.

33. How do you merge two files in DS?

--- We can merge two files in 3 different ways. Either go for Merge stage, Join Stage or Lookup Stage All these merge or join occurs based on the key values. The three stages differ mainly in the memory they use, the treatment of rows with unmatched keys, and their requirements for data being input of key columns.

34. How did you handle an 'Aborted' sequencer?

--- By using check point information we can restart the sequence from failure. if u enabled the check point information reset the aborted job and run again.

35. What are Performance tunings you have done in your last project to increase the performance of slowly running jobs?

1. Using Dataset stage instead of sequential files wherever necessary.

2. Use Join stage instead of Lookup stage when the data is huge.

3.Use Operator stages like remove duplicate, Filter, Copy etc instead of transformer stage.

4. Sort the data before sending to change capture stage or remove duplicate stage.

5. Key column should be hash partitioned and sorted before aggregate operation.

6.Filter unwanted records in beginning of the job flow itself.

36. If data is partitioned in your job on key 1 and then you aggregate on key 2, what issues could arise?

--- It will result in false output even though job runs successfully. In aggregator key value should be hash partitioned so that identical key values will be in the same node, which gives the desired result and also eases grouping operation.

37. What is Full load & Incremental or Refresh load?

38. Describe cleanup resource and clear status file?

The Cleanup Resources command is to:

• View and end job processes

• View and release the associated locks

Clear Status file command is for resetting the status records associated with all stages in that job.

Cleanup Resources command and Clear Status file command is available in Director under Job menu.

39. What is lookup stage? Can you define derivations in the lookup stage output?

Lookup stage is used to perform lookup operations on a data set read into memory from any other Parallel job stage that can output data. It can also perform lookups directly in a DB2 or Oracle database or in a lookup table contained in a Lookup File Set stage.

The most common use for a lookup is to map short codes in the input data set onto expanded information from a lookup table which is then joined to the incoming data and output.

Lookups can also be used for validation of a row. If there is no corresponding entry in a lookup table to the key’s values, the row is rejected.

The Lookup stage can have a reference link, a single input link, a single output link, and a single rejects link. Depending upon the type and setting of the stage(s) providing the look up information, it can have multiple reference links (where it is directly looking up a DB2 table or Oracle table, it can only have a single reference link).

The input link carries the data from the source data set and is known as the primary link. For each record of the source data set from the primary link, the Lookup stage performs a table lookup on each of the lookup tables attached by reference links. The table lookup is based on the values of a set of lookup key columns, one set for each table. The keys are defined on the Lookup stage.

Lookup stages do not require data on the input link or reference links to be sorted. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links. The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.

There are some special partitioning considerations for lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method. Another way is to partition it in the same way as the input data (although this implies sorting of the data).

Yes, we can define derivations in the lookup stage output.

40. Parallel DataStage jobs can have many sources of reference data for lookups including database tables, sequential files or native datasets. Which is the most efficient?

--- In DataStage server jobs the answer is quite simple, local hash files are the fastest method of a key based lookup, as long as the time taken to build the hash file does not wipe out your benefits from using it.

In a DataStage Parallel jobs there are a very large number of stages that can be used as a lookup, a much wider variety than server jobs, this includes most

data sources and the parallel staging format of datasets and lookup filesets. Database lookups can be discounted as the overhead of the database connectivity and any network passage makes them slower than most local storage.

So, choosing sources as reference data for Lookups depends size of the sources for Lookup:

-- If your lookup size is low enough to fit into memory then the source is irrelevant, they all load up very quickly, even database lookups are fast.

-- If you have very large lookup files spilling into lookup table resources then the Lookup fileset outstrips the other options.

-- A Join becomes a viable option. They are a bit harder to design as you can only join one source at a time whereas a lookup can join multiple sources.

41. What is copy stage? When do you use that?

The Copy stage copies a single input data set to a number of output data sets. Each record of the input data set is copied to every output data set. Records can be copied without modification or you can drop or change the order of columns.

Copy stage is useful when we want to make a backup copy of a data set on disk while performing an operation on another copy.

Copy stage with a single input and a single output, needs Force set to be TRUE. This prevents DataStage from deciding that the Copy operation is superfluous and optimizing it out of the job.

42. What is Change Capture stage? Which execution mode would you use when you used for comparison of data?

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set.

The stage produces a change data set, whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other.

The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

We can use both Sequential as well as parallel modes of execution for change capture stage.

43. What is Dataset Stage?

The Data Set stage is a file stage. It allows you to read data from or write data to a data set. It can be configured to execute in parallel or sequential mode. DataStage parallel extender jobs use data sets to manage data within a job.

The Data Set stage allows you to store data being operated on in a persistent form, which can then be used by other DataStage jobs. Data sets are operating system files, each referred to by a control file, which by convention has the suffix .ds.

Using data sets wisely can be key to good performance in a set of linked jobs. You can also manage data sets independently of a job using the Data Set Management utility, available from the DataStage Designer, Manager, or Director.

44. How do you drop dataset?

There two ways of dropping a data set, first is by using Data Set Management Utility (GUI) located in the Manager, Director, Designer and second is by using Unix command-line utility orchadmin.

45. How do you eliminate duplicate in a Dataset?

The simplest way to remove duplicate is by the use of Remove Duplicate Stage. The Remove Duplicates stage takes a single sorted data set as input, removes all duplicate rows, and writes the results to an output data set.

46. What is Peek Stage? When do you use it?

The Peek stage is a Development/Debug stage. It can have a single input link and any number of output links.

The Peek stage lets you print record column values either to the job log or to a separate output link as the stage copies records from its input data set to one or more output data sets, like the Head stage and the Tail stage.

The Peek stage can be helpful for monitoring the progress of your application or to diagnose a bug in your application.

47. What are the different join options available in Join Stage?

There are four join option available in the join stage.

Inner: Transfers records from input data sets whose key columns contain equal values to the output data set. Records whose key columns do not contain equal values are dropped

Left Outer: Transfers all values from the left data set but transfers values from the right data set and intermediate data sets only where key columns match. The stage drops the key column from the right and intermediate data sets.

Right Outer: Transfers all values from the right data set and transfers values from the left data set and intermediate data sets only where key columns match. The stage drops the key column from the left and intermediate data sets.

Full Outer: Transfers records in which the contents of the key columns are equal from the left and right input data sets to the output data set. It also transfers records whose key columns contain unequal values from both input data sets to the output data set. (Full outer joins do not support more than two input links.)

The default is inner.

48. What is the difference between Lookup, Merge & Join Stage?

These "three Stages" combine two or more input links according to values of user-designated "key" column(s).

They differ mainly in:

Memory usage

Treatment of rows with unmatched key values

Input requirements (sorted, de-duplicated)

The main difference between joiner and lookup is in the way they handle the data and the reject links. In joiner, no reject links are possible. So we cannot get the rejected records directly. Lookup provides a reject link.

Also lookup is used if the data being looked up can fit in the available temporary memory. If the volume of data is quite huge, then it is safe to go for Joiner

Join requires the input dataset to be key partitioned and sorted. Lookup does not have this requirement

Lookup allows reject links. Join does not allow reject links. If the volume of data is huge to be fit into memory you go for join and avoid lookup as paging can occur when lookup is used.

Merge stage allow us to capture failed lookups from each reference input separately. It also requires identically sorted and partitioned inputs and, if more than one reference input, de-duplicated reference inputs

In case of merge stage as part of pre processing step duplicates should be removed from master dataset. If there are more than one update dataset then duplicates should be removed from update datasets as well.

The above-mentioned step is not required for join and lookup stages.

49. What is RCP? How it is implemented?

DataStage is flexible about Meta data. It can cope with the situation where Meta data isn’t fully defined. You can define part of your schema and specify that, if your job encounters extra columns that are not defined in the meta data when it actually runs, it will adopt these extra columns and propagate them through the rest of the job. This is known as runtime column propagation (RCP).

This can be enabled for a project via the DataStage Administrator, and set for individual links via the Outputs Page Columns tab for most stages, or in the Outputs page General tab for Transformer stages. You should always ensure that runtime column propagation is turned on.

RCP is implemented through Schema File. The schema file is a plain text file contains a record (or row) definition.

50. What is row generator? When do you use it?

The Row Generator stage is a Development/Debug stage. It has no input links, and a single output link.

The Row Generator stage produces a set of mock data fitting the specified meta data.

This is useful where we want to test our job but have no real data available to process. Row

Generator is also useful when we want processing stages to execute at least once in absence of data from the source.

51. How to extract data from more than 1 heterogeneous Sources?

We can extract data from different sources (i.e. ORACL, DB2 , Sequential file etc) in a job. After getting the data we can use Join, Merge, Aggregator or Lookup stage to unify the incoming data.

52. How can we pass parameters to job by using file?

This can be done through the shell script where we can read the different parameter from the file and call dsjob command to execute the job for those interpreted parameters.

DWBILEARNING

Pages

Sunday, 7 October 2012

Datastage Interview Question Part 1

No comments:

Post a Comment