DataStage PX questions
- What
is difference between server jobs & parallel jobs?
--- Server generates DataStage
BASIC, parallel generates Orchestrate shell script (osh ) and C++, mainframe generates COBOL and
JCL.
--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.
--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.
--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.
--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.
- What
is orchestrate?
--- Orchestrate
is the old name of the underlying parallel execution engine. Ascential re-named
the technology "Parallel Extender".
DataStage PX GUI generates OSH (Orchestrate Shell)
scripts for the jobs you run.
An OSH script is a quoted string which specifies
the operators and connections of a single Orchestrate step. In its simplest
form, it is:
Where op – Orchestrate operator.
In.ds - input dataset.
Out.ds – Output dataset.
- Orchestrate
Vs DataStage Parallel Extender?
- What
are the types of Parallelism?
--- There
are 2 types of Parallel Processing. They are:
a. Pipeline
Parallelism –
It
is the ability
for a downstream stage to begin processing a row as soon as an upstream stage
has finished processing that row (rather than processing one row completely
through the job before beginning the next row). In Parallel jobs, it is managed
automatically.
For example,
consider a job(srcà
Tranformer à
Tgt) running on a system having three
processors:
--- The source stage starts running on one
processor, reads the data from the source and starts filling a pipeline with
the read data.
---
Simultaneously, the Transformer stage starts running on another processor,
processes the data in the pipeline and starts filling another pipeline.
---
At the same time, the target stage starts running on another processor, writes
data to the target as soon as the data is available.
b.
Partitioning Parallelism
–
Partitioning parallelism
means that entire record set is partitioned into small sets and processed on
different nodes. That is, several processors can run the
same job simultaneously, each handling a separate subset of the total data.
For
example if there are 100 records, then if there are 4 logical nodes then each
node would process 25 records each. This enhances the speed at which loading
takes place.
- Is
Pipeline parallelism in PX is same what Inter-processes does in Server?
YES. IPC stage is a stage, which
helps one passive stage read data from another as soon as data is available. In
other words, stages do not have to wait for the entire set of records to be
read first and then transferred to the next stage.
It means as soon as the data is
available between stages( in pipes or links), it can be exchanged between them
without waiting for the entire record set to be read.
Note:-
Link partitioner and Link collector stages can be used to achieve
a certain degree of partitioning parallelism.
- What
are partitioning methods available in PX?
The Partitioning
methods available in PX are:
1. Auto:
--- It chooses
the best partitioning method depending on:
n The
mode of execution of the current stage and the preceding stage.
n The
number of nodes available in the configuration file.
2. Round robin:
---
Here, the first record goes to the first processing node, the second to the
second processing node, and so on. This method is useful for resizing
partitions of an input dataset that are not equal in size to approximately
equal-sized partitions.
Data
Stage uses ‘Round robin’ when it partitions the data initially.
3. Same:
---
It implements the Partitioning method same as the one used by the preceding
stage. The records stay on the same processing node; that is, data is not
redistributed or repartitioned. Same is considered as the fastest Partitioning
method.
Data
Stage uses ‘Same’ when passing data between stages in a job.
4.
Random:
---
It distributes the records randomly across all processing nodes and guarantees
that each processing node receives approximately equal-sized partitions.
5.
Entire:
---
It distributes the complete dataset as input to every instance of a stage on
every processing node. It is mostly used with stages that create lookup tables
for their input.
6.
Hash:
---
It distributes all the records with identical key values to the same processing
node so as to ensure that related records are in the same partition. This does
not necessarily mean that the partitions will be equal in size.
---
When Hash Partitioning, hashing keys that create a large number of partitions
should be selected.
Reason:
For example, if you hash partition a dataset based on a zip code field, where a
large percentage of records are from one or two zip codes, it can lead to
bottlenecks because some nodes are required to process more records than other
nodes.
7.
Modulus:
---
Partitioning is based on a key column modulo the number of partitions. The
modulus partitioner assigns each record of an input dataset to a partition of
its output dataset as determined by a specified key field in the input dataset.
8.
Range:
---
It divides a dataset into approximately equal-sized partitions, each of which
contains records with key columns within a specific range. It guarantees that
all records with same partitioning key values are assigned to the same
partition.
Note:
In order to use a Range partitioner, a range map has to be made using the
‘Write range map’ stage.
9.
DB2:
---
Partitions an input dataset in the same way that DB2 would partition it.
For
example, if this method is used to partition an input dataset containing update
information for an existing DB2 table, records are assigned to the processing
node containing the corresponding DB2 record. Then during the execution of the
parallel operator, both the input record and the DB2 table record are local to
the processing node.
- What
is Re-Partitioning? When actually re-partition will occur?
--- Re-Partitioning
is the rearranging of data among the partitions. In a job, the
Parallel-to-Parallel flow results in Re-Partitioning.
For example, consider the EMP data that is
initially processed based on SAL, but now you want to process the data grouped
by DEPTNO. Then you will need to Repartition to ensure that all the employees
falling under the same DEPTNO are in the same group.
- What
are IConv () and Oconv () functions and where are they used?
Can we use these
functions in PX?
--- ‘Iconv()’ converts
a string to an internal storage format.
Syntax:-
Iconv(string,code[@VM code]…)
-
string
evaluates to the string to be converted.
-
Code
indicates the conversion code which specifies how the
data needs to be formatted for output or internal storage. Like,
MCA
– Extracts alphabetic characters from a
field.
MCN
– Extracts numeric characters from a field.
MCL
- Converts
Uppercase Letters to Lowercase….
--- ‘Oconv()’
converts an expression to an output format.
Syntax:-
Oconv(expression, conversion[@VM conversion]…)
-
Expression
is a string stored in internal format that needs to be
converted to output format.
-
Code
indicates the conversion code which specifies how the
string needs to be formatted.
MCA
– Extracts alphabetic characters from a
field.
MCN
– Extracts numeric characters from a field.
MCL
- Converts
Uppercase Letters to Lowercase….
These functions
can’t be used directly in PX. The only stage which allows the usage
of Iconv() and Oconv() in PX is ‘Basic
Transformer’ stage. It gives access to the
functions supported by the DataStage Server
Engine.
Note:- Basic Transformer can be used
only on SMP systems but not on MPP or
Cluster syntems.
- What
does a Configuration File in parallel extender consist of?
·
The Configuration File consists of
all the processing nodes of a parallel system.
·
It can be defined and edited using
the DataStage Manager. It describes every processing node that DataStage uses
to run an application. When you run a job, DataStage first reads the
Configuration File to determine the available nodes.
·
When a system is modified by adding
or removing processing nodes or by reconfiguring nodes, the DataStage jobs need
not be altered or even recompiled. Instead, editing the Configuration file
alone will suffice.
·
The configuration file also gives
control over parallelization of a job during development cycle. For example, by
editing the Configuration file, first a job can be run on a single processing
node, then on two nodes, then four, and so on.
- What is difference between file set and
data set?
Dataset:- Datasets
are operating system files, each referred to by a control file, which has the
suffix .ds. PX jobs use datasets to manage data within a job. The data in the
Datasets is stored in internal format.
A Dataset
consists of two parts:
n Descriptor
file: It contains metadata and data location.
n Data
file: It contains the data.
--The
Dataset Stage is used to read data from or write data to a dataset. It
allows you to store data in persistent form, which can then be used by other
jobs.
Fileset:- DataStage
can generate and name exported files, write them to their destination, and list
the files it has generated in a file with extension, .fs. The data files and
the file that lists them is called a ‘Fileset’.
A fileset
consists of two parts:
n Descriptor
file: It contains location of raw data files and the meta data.
n Individual
raw Data files.
--
The Fileset Stage is used to read data from or write data to a fileset.
11. Lookup Stage :Is it Persistent or
non-persistent? (What is happening behind the scene)
--- Lookup
stage is non-persistent.
- How can we maintain the partitioning in
Sort stage?
-- Partitioning
in Sort stage can be maintained using the Partitioning method, ‘Same’.
For example, assume you sort a dataset on a system
with four processing nodes and store the results to a Dataset stage. The
dataset will therefore have four partitions. You then use that dataset as input
to a stage executing on a different number of nodes. DataStage automatically
repartitions the dataset to spread it out to all the processing nodes. This
destroys the sort order of the data.
This can be
avoided by specifying the Same Partitioning method so that the original
partitions are preserved.
- Where we need partitioning (In processing
or some where)
--- Partitioning
is needed in processing. It means we need Partitioning where we have huge
volumes of data to process.
- If we use SAME partitioning in the first
stage which partitioning method it will take?
--- Data
Stage uses ‘Round robin’ when it partitions the data initially.
- What
is the symbol we will get when we are using round robin partitioning
method?
-- BOW TIE.
Given below is the list of icons that appear on the
link based on the mode of execution, parallel or sequential, of the current
stage and the preceding stage, and the type of Partitioning method:
Preceding
Stage
Current Stage
Sequential mode--à (FAN OUT) --à Parallel mode (Indicates Partitioning)
Parallel
mode --à
(FAN IN) --à Sequential mode(Indicates Collecting)
Parallel
mode --à
(BOX) --à Parallel
mode (Indicates AUTO method)
Parallel mode --à
(BOW TIE)--à Parallel mode (Indicates Repartitioning)
Parallel
mode --à
(PARALLEL LINES) Parallel mode (Indicates SAME partitioning)
- If
we check the preserve partitioning in one stage and if we don’t give any
partitioning method (Auto) in the next stage which partition method it
will use?
-- In this
case, the partitioning method used by the preceding stage is used.
-- Preserve
Partitioning indicates whether the stage wants to preserve the
partitioning at the next stage of the job. Options in this tab are:
n
Set – Sets the Preserve partitioning flag.
n
Clear – Clears the preserve partitioning flag.
n
Propagate – Sets the flag to Set or Clear
depending on the option selected in the previous stage.
- Can we give node allocations i.e. for one
stage 4 nodes and for next stage 3 nodes?
--- YES. It
can be done using the option, ‘Node map constraint’ available in
Stage page -à Advanced Tab.
--- Checking this
option allows you to constrain Parallel execution to the nodes defined in a
node map. A ‘Node map’ can be defined by either of the below 2 options:
n
By clicking the browse button to open the available nodes dialog box and
selecting nodes from there. (or)
n
By typing node numbers into the text box.
- What
is combinability, non-combinability?
---
Using Combinability, DataStage combines the operators that
underlie parallel stages so that they run in the same process. It lets the DataStage compiler potentially 'optimize' the
number of processes used at runtime by combining operators. This saves
significant amount of data copying and preparation in passing data between
operators. It has three options to set:
n Auto:
Use the default combination method.
n Combinable:
Combine all possible operators.
n Don’t
combine: Never combine operators.
Usually
this setting is left ot its default so that DataStage can tune jobs for
performance automatically.
- What
are schema files?
--
‘Schema file’ is a plain text file in which meta data for
a stage is specified.
---
It is stored outside the DataStage Repository in a document management system
or a source code control system. In contrast, Table definitions are stored in
DataStage Repository and can be loaded into stages as and when required.
---
Schema is an alternative way to specify column definitions for the data used by
the parallel jobs. By default most parallel job stages take their meta data
from the columns tab. For some stages you can specify a property that causes
the stages to take its meta data from the specified schema file.
--- A Schema
consists of a record definition. The following is an example for record schema:
record(
name:string[];
address:nullable
string[];
date:date[];
)
--- Import any table
definition, load it into a parallel job, then choose "Show Schema".
The schema defines the columns and their characteristics in a data set,
and may also contain information about key columns.
20. Why we need
datasets rather than sequential files?
--- A Sequential
file as the source or target needs to be repartitioned as it is(as name
suggests) a single sequential stream of data. A dataset can be saved across
nodes using partitioning method selected, so it is always faster when we used
as a source or target. The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other
DataStage jobs.
Data sets are operating system files, each referred to by a control file, which
by convention has the suffix .ds. Using datasets wisely can be key to good
performance in a set of linked jobs.
21. Is look-up
stage returns multi-rows or single rows?
---. Lookup
stage returns the rows related to the key values. It can be multiple rows depends on the keys you are
mentioning for the lookup in the stage
22. Why we need
sort stage other than sort-merge collective method and perform sort option in
the stage in advanced properties?
--- Sort Stage is
used to perform more complex sort operations which is not possible using stages
Advanced tab properties..
Many stages have
an optional sort function via the partition tab. This means if you are
partitioning your data in a stage you can define the sort at the same time. The
sort stage is for use when you don't have any stage doing partitioning in your
job but you still want to sort your data, or if you want to sort your data in descending
order, or if you want to use one of the sort stage options such as "Allow
Duplicates" or "Stable Sort". If you are processing very large
volumes and need to sort you will find the sort stage is more flexible then the
partition tab sort.
23. For surrogate
key generator stage where will be the next value stored?
24. In surrogate
key generator stage how it generates the number? (Based on nodes or Based on
rows)
--- Based on the
nodes we are generating key values, and the input data partitions should be
perfectly balanced across the nodes. This can be achieved using round
robin partitioning method when your starting point is sequential
25. What is the
preserve partioning flag in Advanced tab?
--- It
indicates whether the stage wants to preserve partitioning at the next stage of
the job. There a three options 1. Set 2.Clear 3.Propagate.
Set. Sets the preserve partitioning flag, this
indicates to the next stage in the job that it should preserve existing
partitioning if possible.
Clear.
Clears the preserve partitioning flag, this indicates that this stage doesn’t
care which partitioning method the next stage uses.
Propagate.
Set the flag to set or clear depending on what in previous stage of the job has
set(or if that is set to propagate the stage before that and so on until a
preserve partitioning flag setting is encountered).
26. What is the difference between stages
and operators?
--- Stages are
generic user interface from where we can read and write from files and
databases, trouble shoot and develop jobs, also it's capable of doing
processing of data. Different types of stages are
Database.
These are stages that read or write data contained in a database. Examples of
database stages are the Oracle Enterprise and DB2/UDB Enterprise stages.
Development/Debug.
These are stages that help you when you are developing and troubleshooting
parallel jobs. Examples are the Peek and Row Generator stages.
File.
These are stages that read or write data contained in a file or set of files.
Examples of file stages are the Sequential File and Data Set stages.
Processing.
These are stages that perform some processing on the data that is passing through them. Examples of
processing stages are the Aggregator and Transformer stages.
Real Time.
These are the stages that allow Parallel jobs to be made available as RTI
services. They comprise the RTI Source and RTI Target stages. These are part of
the optional Web Services package.
Restructure.
These are stages that deal with and manipulate data containing columns of
complex data type. Examples are Make Sub record and Make Vector stages.
--- Operators are
the basic functional units of an orchestrate application. In orchestrate
framework DataStage stages generates an orchestrate operator directly.
27. Why we need filter, copy and column
export stages instead of transformer stage?
--- In parallel
jobs we have specific stage types for performing specialized tasks. Filter,
copy, column export stages are operator stages. These operators are the basic
functional units of an orchestrate application. The operators in your
Orchestrate application pass data records from one operator to the next, in
pipeline fashion. For example, the operators in an application step might start
with an import operator, which reads data from a file and converts it to an
Orchestrate data set. Subsequent operators in the sequence could perform
various processing and analysis tasks. The processing power of Orchestrate
derives largely from its ability to execute operators in parallel on multiple
processing nodes. By default, Orchestrate operators execute on all processing
nodes in your system. Orchestrate dynamically scales your application up or
down in response to system configuration changes, without requiring you to
modify your application. Thus using operator stages will increase the speed of
data processing applications rather than using transformer stages.
28. Describe the types of Transformers used
in DataStage PX for processing and uses?
Transformer
Basic Transformer
Transformer-:
The Transformer stage is a processing stage.
Transformer stages allow you to create transformations to apply to your
data. These transformations can be simple or complex and can be applied to
individual columns in your data. Transformations are specified using a set of
functions. Transformer stages can have a single input and any number of
outputs. It can also have a reject link that takes any rows, which have not
been written to any of the outputs links by reason of a write failure or
expression evaluation failure.
Basic
Transformer:- The BASIC Transformer stage is a also a processing stage. It is similar in
appearance and function to the Transformer stage in parallel jobs. It gives
access to BASIC transforms and functions (BASIC is the language supported by
the DataStage server engine and available in server jobs). BASIC Transformer
stage can have a single input and any number of outputs.
29. What is aggregate cache in aggregator
transformation?
---
Aggregate cache is the memory used for grouping operations by the
aggregator stage.
30. What will you do in a situation where
somebody wants to send you a file and use that file as an input or reference
and then run job?
--- Use wait for file activity stage
between job activity stages in job sequencer.
31. How do you rename all of the jobs to
support your new File-naming conventions?
--- Create a file with new and old names. Export the whole
project as a dsx. Write a script, which can do a simple rename of the strings
looking up the file. Then import the new dsx file and recompile all jobs. Be
cautious that the name of the jobs has also been changed in your job control
jobs or Sequencer jobs. So you have to make the necessary changes to these
Sequencers.
32. How do you merge two files in DS?
--- We can merge two files in 3 different ways. Either go
for Merge stage, Join Stage or Lookup Stage All these merge or join occurs
based on the key values. The three stages differ mainly in the memory they use,
the treatment of rows with unmatched keys, and their requirements for data
being input of key columns.
33. How did you handle an 'Aborted'
sequencer?
--- By using check point information we can
restart the sequence from failure. if u enabled the check point information
reset the aborted job and run again.
34. What are Performance tunings you have
done in your last project to increase the performance of slowly running jobs?
1. Using Dataset stage instead of sequential
files wherever necessary.
2. Use Join stage
instead of Lookup stage when the data is huge.
3.Use Operator
stages like remove duplicate, Filter, Copy etc instead of transformer stage.
4. Sort the data
before sending to change capture stage or remove duplicate stage.
5. Key column
should be hash partitioned and sorted before aggregate operation.
6.Filter unwanted
records in beginning of the job flow itself.
35. If data is partitioned in your job on
key 1 and then you aggregate on key 2, what issues could arise?
--- It will
result in false output even though job runs successfully. In aggregator key
value should be hash partitioned so that identical key values will be in the
same node, which gives the desired result and also eases grouping operation.
36. What is Full load & Incremental or
Refresh load?
37. Describe cleanup resource and clear
status file?
The Cleanup
Resources command is to
•
View and end job processes
•
View and release the associated locks
Clear Status file
command is for resetting the status records associated with all stages in that
job.
Cleanup Resources
command and Clear Status file command is available in Director under Job menu.
38. What is lookup stage? Can you define
derivations in the lookup stage output?
Lookup
stage is used to perform lookup operations on a data set read into memory from
any other Parallel job stage that can output data. It can also perform lookups
directly in a DB2 or Oracle database or in a lookup table contained in a Lookup
File Set stage.
The most common
use for a lookup is to map short codes in the input data set onto expanded
information from a lookup table which is then joined to the incoming data and
output.
Lookups can also
be used for validation of a row. If there is no corresponding entry in a lookup
table to the key’s values, the row is rejected.
The Lookup stage
can have a reference link, a single input link, a single output link, and a
single rejects link. Depending upon the type and setting of the stage(s)
providing the look up information, it can have multiple reference links (where
it is directly looking up a DB2 table or Oracle table, it can only have a
single reference link).
The input link
carries the data from the source data set and is known as the primary link. For
each record of the source data set from the primary link, the Lookup stage
performs a table lookup on each of the lookup tables attached by reference
links. The table lookup is based on the values of a set of lookup key columns,
one set for each table. The keys are defined on the Lookup stage.
Lookup stages do
not require data on the input link or reference links to be sorted. Each record of the output data set contains
columns from a source record plus columns from all the corresponding lookup
records where corresponding source and lookup records have the same value for
the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links. The optional reject link carries
source records that do not have a corresponding entry in the input lookup
tables.
There are some
special partitioning considerations for lookup stages. You need to ensure that
the data being looked up in the lookup table is in the same partition as the
input data referencing it. One way of doing this is to partition the lookup
tables using the Entire method. Another way is to partition it in the same way
as the input data (although this implies sorting of the data).
Yes. define
derivations in the lookup stage output.
39. What is copy stage? When do you use
that?
The
Copy stage copies a single input data set to a number of output data sets. Each
record of the input data set is copied to every output data set. Records can be
copied without modification or you can drop or change the order of columns.
Copy stage is
useful when we want to make a backup copy of a data set on disk while
performing an operation on another copy.
Copy stage with a single input and a single output, needs
Force set to be TRUE. This prevents DataStage from deciding that the Copy
operation is superfluous and optimizing it out of the job.
40. What is Change Capture stage? Which
execution mode would you use when you used for comparison of data?
The
Change Capture stage takes two input data sets, denoted before and after, and
outputs a single data set whose records represent the changes made to the
before data set to obtain the after data set.
The stage
produces a change data set, whose table definition is transferred from the
after data set’s table definition with the addition of one column: a change
code with values encoding the four actions: insert, delete, copy, and edit. The
preserve-partitioning flag is set on the change data set.
The compare is based on a set a set of key columns, rows
from the two data sets are assumed to be copies of one another if they have the
same values in these key columns. You can also optionally specify change
values. If two rows have identical key columns, you can compare the value
columns in the rows to see if one is an edited copy of the other.
The stage assumes
that the incoming data is key-partitioned and sorted in ascending order. The
columns the data is hashed on should be the key columns used for the data
compare. You can achieve the sorting and partitioning using the Sort stage or
by using the built-in sorting and partitioning abilities of the Change Capture
stage.
We can use both
Sequential as well as parallel modes of execution for change capture stage.
41. What is Dataset Stage?
The Data Set
stage is a file stage. It allows you to read data from or write data to a data
set. It can be configured to execute in parallel or sequential mode. DataStage
parallel extender jobs use data sets to manage data within a job.
The Data Set
stage allows you to store data being operated on in a persistent form, which
can then be used by other DataStage jobs. Data sets are operating system files,
each referred to by a control file, which by convention has the suffix .ds.
Using data sets
wisely can be key to good performance in a set of linked jobs. You can also
manage data sets independently of a job using the Data Set Management utility,
available from the DataStage Designer, Manager, or Director.
42. How do you drop dataset?
There
two ways of dropping a data set, first is by using Data Set Management Utility
(GUI) located in the Manager, Director, Designer and second is by using Unix
command-line utility orchadmin.
43. How do you eliminate duplicate in a
Dataset?
The simplest way
to remove duplicate is by the use of Remove Duplicate Stage. The Remove
Duplicates stage takes a single sorted data set as input, removes all duplicate
rows, and writes the results to an output data set.
44. What is Peek Stage? When do you use it?
The Peek stage is
a Development/Debug stage. It can have a single input link and any number of
output links.
The Peek stage
lets you print record column values either to the job log or to a separate
output link as the stage copies records from its input data set to one or more
output data sets, like the Head stage and the Tail stage.
The Peek stage
can be helpful for monitoring the progress of your application or to diagnose a
bug in your application.
45. What are the different join options
available in Join Stage?
There are four join option available in the
join stage.
Inner: Transfers
records from input data sets whose key columns contain equal values to the
output data set. Records whose key columns do not contain equal values are
dropped
Left Outer: Transfers all
values from the left data set but transfers values from the right data set and
intermediate data sets only where key columns match. The stage drops the key
column from the right and intermediate data sets.
Right Outer: Transfers all
values from the right data set and transfers values from the left data set and
intermediate data sets only where key columns match. The stage drops the key
column from the left and intermediate data sets.
Full Outer: Transfers
records in which the contents of the key columns are equal from the left and
right input data sets to the output data set. It also transfers records whose
key columns contain unequal values from both input data sets to the output data
set. (Full outer joins do not support more than two input links.)
The default is inner.
46. What is the difference between Lookup,
Merge & Join Stage?
These "three Stages"
combine two or more input links according to values of user-designated
"key" column(s).
They differ mainly in:
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted,
de-duplicated)
The
main difference between joiner and lookup is in the way they handle the data
and the reject links. In joiner, no reject links are possible. So we cannot get
the rejected records directly. Lookup provides a reject link.
Also
lookup is used if the data being looked up can fit in the available temporary
memory. If the volume of data is quite huge, then it is safe to go for Joiner
Join
requires the input dataset to be key partitioned and sorted. Lookup does not
have this requirement
Lookup
allows reject links. Join does not allow reject links. If the volume of data is
huge to be fit into memory you go for join and avoid lookup as paging can occur
when lookup is used.
Merge stage allow us to capture
failed lookups from each reference input separately. It also requires
identically sorted and partitioned inputs and, if more than one reference
input, de-duplicated reference inputs
In case of merge stage as part of
pre processing step duplicates should be removed from master dataset. If there
are more than one update dataset then duplicates should be removed from update
datasets as well.
The above-mentioned step is not required
for join and lookup stages.
47. What is RCP? How it is implemented?
DataStage
is flexible about Meta data. It can cope with
the situation where Meta data isn’t fully
defined. You can define part of your schema and specify that, if your job
encounters extra columns that are not defined in the meta data when it actually
runs, it will adopt these extra columns and propagate them through the rest of
the job. This is known as runtime column propagation (RCP).
This can be enabled for a project via the DataStage
Administrator, and set for individual links via the Outputs Page Columns tab
for most stages, or in the Outputs page General tab for Transformer stages. You
should always ensure that runtime column propagation is turned on.
RCP is implemented through Schema File. The
schema file is a plain text file contains a record
(or row) definition.
48. What is row generator? When do you use
it?
The
Row Generator stage is a Development/Debug stage. It has no input links, and a
single output link.
The Row Generator
stage produces a set of mock data fitting the specified meta data.
This is useful
where we want to test our job but have no real data available to process. Row
Generator is also
useful when we want processing stages to execute at least once in absence of
data from the source.
Thanks for sharing datastage interview questions.
ReplyDeleteMainframe Development
I admire this article for the well-researched content and excellent wording. I got so involved in this material that I couldn’t stop reading. I am impressed with your work and skill. Thank you so much.It can be helpful for people who wants to know more about app Modernization services.
ReplyDelete