1)
how can you do incremental load in datastage?
You can create a table where u can store the last successfull refresh time for each table/Dimension.Then in the source query take the delta of the last successful and sysdate should give you incremental load
You can create a table where u can store the last successfull refresh time for each table/Dimension.Then in the source query take the delta of the last successful and sysdate should give you incremental load
2)
How can we improve the performance of DataStage jo...
Performance and tuning of DS jobs: 1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment 4.Evaluate data skew 5.Isolate and solve 6.Distribute file systems to eliminate bottlenecks 7.Do not involve the RDBMS in intial testing 8.Understand and evaluate the tuning knobs available.
Performance and tuning of DS jobs: 1.Establish Baselines 2.Avoid the Use of only one flow for tuning/performance testing 3.Work in increment 4.Evaluate data skew 5.Isolate and solve 6.Distribute file systems to eliminate bottlenecks 7.Do not involve the RDBMS in intial testing 8.Understand and evaluate the tuning knobs available.
3)
There are 3 types of views in Datastage Director---
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages
a) Job View - Dates of Jobs Compiled.
b) Log View - Status of Job last run
c) Status View - Warning Messages, Event Messages, Program Generated Messages
4)
Hashed File is classified broadly into 2 types.
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic
ii) Specific.
Default Hased file is "Dynamic - Type Random 30 D"
a) Static - Sub divided into 17 types based on Primary Key Pattern.
b) Dynamic - sub divided into 2 types
i) Generic
ii) Specific.
Default Hased file is "Dynamic - Type Random 30 D"
5)
Orchestrate Vs Datastage Parallel Extender?
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.
Orchestrate itself is an ETL tool with extensive parallel processing capabilities and running on UNIX platform. Datastage used Orchestrate with Datastage XE (Beta version of 6.0) to incorporate the parallel processing capabilities. Now Datastage has purchased Orchestrate and integrated it with Datastage XE and released a new version Datastage 6.0 i.e Parallel Extender.
6)
What r the different types of Type2 dimension maping?
Type2 Dimension/Version Data Maping: In this maping the updated dimension in the source will gets inserted in target along with a new version number.And newly added dimension
in source will inserted into target with a primary key.
Type2 Dimension/Version Data Maping: In this maping the updated dimension in the source will gets inserted in target along with a new version number.And newly added dimension
in source will inserted into target with a primary key.
Type2
Dimension/Flag current Maping: This maping is also used for slowly changing
dimensions.In addition it creates a flag value for changed or new dimension.
Flag indiactes the dimension is new or newlyupdated.Recent dimensions will gets saved with cuurent flag value 1. And updated dimensions r saved with the value 0.
Flag indiactes the dimension is new or newlyupdated.Recent dimensions will gets saved with cuurent flag value 1. And updated dimensions r saved with the value 0.
Type2
Dimension/Effective Date Range Maping: This is also one flavour of Type2 maping
used for slowly changing dimensions.This maping also inserts both new and
changed dimensions in to the target.And changes r tracked by the effective date
range for each version of each dimension.
7)
What is aggregate cache in aggregator transforamtion?
The aggregator stores data in the aggregate cache until it completes aggregate calculations.When u run a session that uses an aggregator transformation.
The aggregator stores data in the aggregate cache until it completes aggregate calculations.When u run a session that uses an aggregator transformation.
8)
What are Sequencers?
Sequencers are job control programs that execute other jobs with preset Job parameters.
Sequencers are job control programs that execute other jobs with preset Job parameters.
9)
What will you in a situation where somebody wants to send you a file and use
that file as an input or reference and then run job?
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.
A. Under Windows: Use the 'WaitForFileActivity' under the Sequencers and then run the job. May be you can schedule the sequencer around the time the file is expected to arrive.
B. Under UNIX: Poll for the file. Once the file has start the job or sequencer depending on the file.
10)
How do you rename all of the jobs to support your new File-naming conventions?
Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers
Create a Excel spreadsheet with new and old names. Export the whole project as a dsx. Write a Perl program, which can do a simple rename of the strings looking up the Excel file. Then import the new dsx file probably into a new project for testing. Recompile all jobs. Be cautious that the name of the jobs has also been changed in your job control jobs or Sequencer jobs. So you have to make the necessary changes to these Sequencers
11)
What are other Performance tunings you have done in your last project to
increase the performance of slowly running jobs?
1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3. Tuned the 'Project Tunables' in Administrator for better performance.
4. Used sorted data for Aggregator.
5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
6. Removed the data not used from the source as early as possible in the job.
7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
11. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
12. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
13. Tuning should occur on a job-by-job basis.
14. Use the power of DBMS.
15. Try not to use a sort stage when you can use an ORDER BY clause in the database.
16. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
This would also minimize overflow on the hash file.
1. Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
2. Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
3. Tuned the 'Project Tunables' in Administrator for better performance.
4. Used sorted data for Aggregator.
5. Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
6. Removed the data not used from the source as early as possible in the job.
7. Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
8. Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
9. If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
10. Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
11. Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
12. Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
13. Tuning should occur on a job-by-job basis.
14. Use the power of DBMS.
15. Try not to use a sort stage when you can use an ORDER BY clause in the database.
16. Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Minimise the usage of Transformer (Instead of this use Copy, modify, Filter, Row Generator)
Use SQL Code while extracting the data
Handle the nulls
Minimise the warnings
Reduce the number of lookups in a job design
Use not more than 20stages in a job
Use IPC stage between two passive stages Reduces processing time
Drop indexes before data loading and recreate after loading data into tables
Gen\'ll we cannot avoid no of lookups if our requirements to do lookups compulsory.
There is no limit for no of stages like 20 or 30 but we can break the job into small jobs then we use dataset Stages to store the data.
IPC Stage that is provided in Server Jobs not in Parallel Jobs
Check the write cache of Hash file. If the same hash file is used for Look up and as well as target, disable this Option.
If the hash file is used only for lookup then \"enable Preload to memory\". This will improve the performance. Also, check the order of execution of the routines.
Don\'t use more than 7 lookups in the same transformer; introduce new transformers if it exceeds 7 lookups.
Use Preload to memory option in the hash file output.
Use Write to cache in the hash file input.
Write into the error tables only after all the transformer stages.
Reduce the width of the input record - remove the columns that you would not use.
Cache the hash files you are reading from and writting into. Make sure your cache is big enough to hold the hash files.
Use ANALYZE.FILE or HASH.HELP to determine the optimal settings for your hash files.
This would also minimize overflow on the hash file.
If
possible, break the input into multiple threads and run multiple instances of
the job.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
Staged the data coming from ODBC/OCI/DB2UDB stages or any database on the server using Hash/Sequential files for optimum performance also for data recovery in case job aborts.
Tuned the OCI stage for 'Array Size' and 'Rows per Transaction' numerical values for faster inserts, updates and selects.
Tuned the 'Project Tunables' in Administrator for better performance.
Used sorted data for Aggregator.
Sorted the data as much as possible in DB and reduced the use of DS-Sort for better performance of jobs
Removed the data not used from the source as early as possible in the job.
Worked with DB-admin to create appropriate Indexes on tables for better performance of DS queries
Converted some of the complex joins/business in DS to Stored Procedures on DS for faster execution of the jobs.
If an input file has an excessive number of rows and can be split-up then use standard logic to run jobs in parallel.
Before writing a routine or a transform, make sure that there is not the functionality required in one of the standard routines supplied in the sdk or ds utilities categories.
Constraints are generally CPU intensive and take a significant amount of time to process. This may be the case if the constraint calls routines or external macros but if it is inline code then the overhead will be minimal.
Try to have the constraints in the 'Selection' criteria of the jobs itself. This will eliminate the unnecessary records even getting in before joins are made.
Tuning should occur on a job-by-job basis.
Use the power of DBMS.
Try not to use a sort stage when you can use an ORDER BY clause in the database.
Using a constraint to filter a record set is much slower than performing a SELECT … WHERE….
Make every attempt to use the bulk loader for your particular database. Bulk loaders are generally faster than using ODBC or OLE.
12)
How did you handle an 'Aborted' sequencer?
13)
How do you merge two files in DS?
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.
Either used Copy command as a Before-job subroutine if the metadata of the 2 files are same or created a job to concatenate the 2 files into one if the metadata is different.
14)
how can you do incremental load in datastage?
Incremental
load means daily load.
when
ever you are selecting data from source, select the records which are loaded or
updated between the timestamp of lastsuccessful load and todays load start date
and time.
for
this u have to pass parameters for those two dates.
store
the last rundate and time in a file and read the parameter through job
parameters and state second argument as currentdate and time.
15)
What is Full load & Incremental or Refresh load?
Full Load is the entire data dump load taking place the very first time.
Gradually to synchronize the target data with source data, there are further 2 techniques:-
Refresh load - Where the existing data is truncated and reloaded completely.
Incremental - Where delta or difference between target and source data is dumped at regular intervals. Timsetamp for previous delta load has to be maintained.
Full Load is the entire data dump load taking place the very first time.
Gradually to synchronize the target data with source data, there are further 2 techniques:-
Refresh load - Where the existing data is truncated and reloaded completely.
Incremental - Where delta or difference between target and source data is dumped at regular intervals. Timsetamp for previous delta load has to be maintained.
16)
What are OConv () and Iconv () functions and where are they used?
IConv() - Converts a string to an internal storage format.
OConv() - Converts an expression to an output format.
IConv() - Converts a string to an internal storage format.
OConv() - Converts an expression to an output format.
17)
what is difference between serverjobs & paraller jobs?
Server
jobs. These are available if you have installed DataStage Server. They run on
the DataStage Server, connecting to other data sources as necessary.
Parallel
jobs. These are only available if you have installed Enterprise Edition. These
run on DataStage servers that are SMP, MPP, or cluster systems. They can also
run on a separate z/OS (USS) machine if required.
What is Overloading of procedures ?
--------------------------------------------------------------------------------
Overloading procs are 2 or more procs with the same name but different arguments.
Arguments
needs to be different by class it self. ie char and Varchar2 are from same
class.
Packages
-
The
main advantages of packages are -
1-
Since packages has specification and body separate so, whenever any ddl is run
and if any proc/func(inside pack) is dependent on that, only body gets
invalidated and not the spec. So any other proc/func dependent on package does
not gets invalidated.
2-
Whenever any func/proc from package is called, whole package is loaded into
memory and hence all objects of pack is availaible in memory which means faster
execution if any is called. And since we put all related proc/func in one
package this feature is useful as we may need to run most of the objects.
UNIX
FILE
OPERATIONS:
ls
list files
cp copy files: cp /path/name newname
mv move or rename files: mv name newname
rm remove (i.e. delete) files: rm name
chmod change mode of file permissions: chmod xxx name
cat scroll file contents: cat name
more page file contents (spacebar to continue): more name
less better pager than more? (q to quit): less name
view view file contents (:q to quit): view name
vi visual text editor (:wq to save and quit): vi name
pico pico text editor (Ctrl-X to quit): pico name
cp copy files: cp /path/name newname
mv move or rename files: mv name newname
rm remove (i.e. delete) files: rm name
chmod change mode of file permissions: chmod xxx name
cat scroll file contents: cat name
more page file contents (spacebar to continue): more name
less better pager than more? (q to quit): less name
view view file contents (:q to quit): view name
vi visual text editor (:wq to save and quit): vi name
pico pico text editor (Ctrl-X to quit): pico name
DIRECTORY
OPERATIONS:
mkdir
make (create) new directory: mkdir Name
cd change directory: cd /path/name
cd change to your home directory: cd
rmdir remove directory (if empty): rmdir Name
pwd print working directory (show directory name)
quota check disk space quota: quota -v
cd change directory: cd /path/name
cd change to your home directory: cd
rmdir remove directory (if empty): rmdir Name
pwd print working directory (show directory name)
quota check disk space quota: quota -v
SYSTEM
OPERATIONS:
df
show free disk space
du show disk usage
ps list your processes
kill kill a process: kill ###
passwd change your password
date show date and time
w who is doing what on the system
who who is connected to the system
cal display a calendar
ping ping another computer (is it alive?)
finger get information on users
exit exit, or logout, from the system
du show disk usage
ps list your processes
kill kill a process: kill ###
passwd change your password
date show date and time
w who is doing what on the system
who who is connected to the system
cal display a calendar
ping ping another computer (is it alive?)
finger get information on users
exit exit, or logout, from the system
COMMUNICATIONS:
write
write messages to another user's screen
talk talk split-screen with another user: talk username
pine send or read E-mail with pine mail system
mail UNIX mail system
rtin read news with tin newsreader
telnet connect to another computer via the network
ftp file transfer over the network
talk talk split-screen with another user: talk username
pine send or read E-mail with pine mail system
mail UNIX mail system
rtin read news with tin newsreader
telnet connect to another computer via the network
ftp file transfer over the network
Most
commands have options, which are described in the online man pages.
- What
is difference between server jobs & parallel jobs?
--- Server generates DataStage BASIC, parallel generates Orchestrate shell
script (osh )
and C++, mainframe generates COBOL and JCL.
--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.
--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.
--- In server and mainframe you tend to do most of the work in Transformer stage. In parallel you tend to use specific stage types for specific tasks (and the Transformer stage doesn't do lookups). There are many more stage types for parallel than server or mainframe, and parallel stages correspond to Orchestrate operators.
--- Also there's the automatic partitioning and collection of data in the parallel environment, which would have to be managed manually (if at all) in the server environment.
- What is Parallel Processing Environment? Types of Parallel
Processing
Environments?
--- It is the
environment in which you run the DataStage jobs. The following are
the types Parallel Processing Environments:
- SMP (Symmetric Multi Processing): In SMP, multiple processors share the same memory; i.e the
processors communicate via shared memory and have a single Operating
System.
- MPP (Massively Parallel Processing): In MPP, each processor has exclusive access to its own memory(also
known as Shared Nothing) and has its own Operating System. MPP
systems are physically housed in the same box, whereas ‘Cluster
Systems’ are physically dispersed and communicate via a high-speed
network.
- What
is orchestrate?
--- Orchestrate is
the old name of the underlying parallel execution engine. Ascential re-named
the technology "Parallel Extender".
DataStage PX GUI generates OSH (Orchestrate Shell) scripts for the jobs
you run.
An OSH
script is a quoted string which specifies the operators and connections of a
single Orchestrate step. In its simplest form, it is:
Where op – Orchestrate operator.
In.ds - input dataset.
Out.ds – Output dataset.
- Orchestrate
Vs DataStage Parallel Extender?
- What
are the types of Parallelism?
--- There are 2
types of Parallel Processing. They are:
- Pipeline Parallelism –
It is the ability for a downstream stage to begin
processing a row as soon as an upstream stage has finished processing that row
(rather than processing one row completely through the job before beginning the
next row). In Parallel jobs, it is managed automatically.
For example, consider a job(srcà Tranformer à Tgt) running on a system having three processors:
--- The source stage starts running on one
processor, reads the data from the source and starts filling a pipeline with
the read data.
--- Simultaneously, the Transformer stage starts
running on another processor, processes the data in the pipeline and starts
filling another pipeline.
--- At the same time, the target stage starts
running on another processor, writes data to the target as soon as the data is
available.
- Partitioning Parallelism –
Partitioning parallelism means that entire record
set is partitioned into small sets and processed on different nodes. That is, several processors can run the same job simultaneously, each
handling a separate subset of the total data.
For example if there are 100 records, then if there
are 4 logical nodes then each node would process 25 records each. This enhances
the speed at which loading takes place.
- Is
Pipeline parallelism in PX is same what Inter-processes does in Server?
YES. IPC stage is a
stage, which helps one passive stage read data from another as soon as data is
available. In other words, stages do not have to wait for the entire set of
records to be read first and then transferred to the next stage. It means as soon as the
data is available between stages( in pipes or links), it can be exchanged
between them without waiting for the entire record set to be read.
Note:- Link partitioner
and Link collector stages can be used to achieve a certain degree of
partitioning parallelism.
- What
are partitioning methods available in PX?
The Partitioning methods available in PX are:
- Auto:
--- It chooses the best partitioning method
depending on:
- The mode of execution of the current stage and the preceding
stage.
- The number of nodes available in the configuration file.
2. Round robin:
--- Here, the first record goes to the first
processing node, the second to the second processing node, and so on. This
method is useful for resizing partitions of an input dataset that are not equal
in size to approximately equal-sized partitions.
Data Stage uses ‘Round robin’ when it partitions
the data initially.
- Same:
--- It implements the Partitioning method same as
the one used by the preceding stage. The records stay on the same processing
node; that is, data is not redistributed or repartitioned. Same is considered
as the fastest Partitioning method.
Data Stage uses ‘Same’ when passing data between
stages in a job.
- Random:
--- It distributes the records randomly across all
processing nodes and guarantees that each processing node receives
approximately equal-sized partitions.
- Entire:
--- It distributes the complete dataset as input to
every instance of a stage on every processing node. It is mostly used with
stages that create lookup tables for their input.
- Hash:
--- It distributes all the records with identical
key values to the same processing node so as to ensure that related records are
in the same partition. This does not necessarily mean that the partitions will
be equal in size.
--- When Hash Partitioning, hashing keys that
create a large number of partitions should be selected.
Reason: For example, if you hash partition a
dataset based on a zip code field, where a large percentage of records are from
one or two zip codes, it can lead to bottlenecks because some nodes are
required to process more records than other nodes.
- Modulus:
--- Partitioning is based on a key column modulo
the number of partitions. The modulus partitioner assigns each record of an
input dataset to a partition of its output dataset as determined by a specified
key field in the input dataset.
- Range:
--- It divides a dataset into approximately
equal-sized partitions, each of which contains records with key columns within
a specific range. It guarantees that all records with same partitioning key
values are assigned to the same partition.
Note: In order to use a Range partitioner, a range
map has to be made using the ‘Write range map’ stage.
- DB2:
--- Partitions an input dataset in the same way
that DB2 would partition it.
For example, if this method is used to partition an
input dataset containing update information for an existing DB2 table, records
are assigned to the processing node containing the corresponding DB2 record.
Then during the execution of the parallel operator, both the input record and
the DB2 table record are local to the processing node.
4.
What is Re-Partitioning?
When actually re-partition will occur?
--- Re-Partitioning
is the rearranging of data among the partitions. In a job, the
Parallel-to-Parallel flow results in Re-Partitioning.
For example, consider the EMP data that is
initially processed based on SAL, but now you want to process the data grouped
by DEPTNO. Then you will need to Repartition to ensure that all the employees
falling under the same DEPTNO are in the same group.
5.
What are IConv () and Oconv
() functions and where are they used?
Can we use these functions in PX?
--- ‘Iconv()’ converts
a string to an internal storage format.
Syntax:- Iconv(string,code[@VM
code]…)
- string evaluates to the string
to be converted.
- Code indicates the conversion code which specifies how the data needs
to be formatted for output or internal storage. Like,
MCA – Extracts
alphabetic characters from a field.
MCN – Extracts numeric
characters from a field.
MCL - Converts
Uppercase Letters to Lowercase….
--- ‘Oconv()’ converts an expression to an
output format.
Syntax:- Oconv(expression,
conversion[@VM conversion]…)
- Expression is a string stored in internal format that needs to be converted
to output format.
- Code indicates the conversion code which specifies how the string
needs to be formatted.
MCA – Extracts
alphabetic characters from a field.
MCN – Extracts numeric
characters from a field.
MCL - Converts
Uppercase Letters to Lowercase….
These functions can’t be used directly in PX. The
only stage which allows the
usage of Iconv() and Oconv() in PX is ‘Basic
Transformer’ stage. It gives
access to the functions supported by the DataStage
Server Engine.
Note:- Basic
Transformer can be used only on SMP systems but not on MPP or
Cluster syntems.
1.
What does a Configuration
File in parallel extender consist of?
- The Configuration File consists of all the processing nodes of a
parallel system.
- It can be defined and edited using the DataStage Manager. It
describes every processing node that DataStage uses to run an
application. When you run a job, DataStage first reads the Configuration
File to determine the available nodes.
- When a system is modified by adding or removing processing nodes
or by reconfiguring nodes, the DataStage jobs need not be altered or
even recompiled. Instead, editing the Configuration file alone will
suffice.
- The configuration file also gives control over parallelization of
a job during development cycle. For example, by editing the Configuration
file, first a job can be run on a single processing node, then on two
nodes, then four, and so on.
- What is difference between file set and data set?
Dataset:- Datasets are
operating system files, each referred to by a control file, which has the
suffix .ds. PX jobs use datasets to manage data within a job. The data in the
Datasets is stored in internal format.
A Dataset consists of two parts:
- Descriptor file: It contains metadata and data location.
- Data file: It contains the data.
--The Dataset
Stage is used to read data from or write data to a dataset. It allows you
to store data in persistent form, which can then be used by other jobs.
Fileset:- DataStage can
generate and name exported files, write them to their destination, and list the
files it has generated in a file with extension, .fs. The data files and the
file that lists them is called a ‘Fileset’.
A fileset consists of two parts:
- Descriptor file: It contains location of raw data files and the
meta data.
- Individual raw Data files.
-- The Fileset
Stage is used to read data from or write data to a fileset.
3. Lookup Stage :Is it Persistent or non-persistent? (What is happening
behind the scene)
--- Lookup stage is
non-persistent.
- How can we maintain the partitioning in Sort stage?
-- Partitioning in
Sort stage can be maintained using the Partitioning method, ‘Same’.
For example, assume you sort a dataset on a system
with four processing nodes and store the results to a Dataset stage. The
dataset will therefore have four partitions. You then use that dataset as input
to a stage executing on a different number of nodes. DataStage automatically
repartitions the dataset to spread it out to all the processing nodes. This
destroys the sort order of the data.
This can be avoided by specifying the Same
Partitioning method so that the original partitions are preserved.
- Where we need partitioning (In processing or some where)
--- Partitioning is
needed in processing. It means we need Partitioning where we have huge volumes
of data to process.
- If we use SAME partitioning in the first stage which partitioning
method it will take?
--- Data Stage uses
‘Round robin’ when it partitions the data initially.
- What is the symbol we will get when we are using round robin
partitioning method?
-- BOW TIE.
Given below is the list of icons that appear on the
link based on the mode of execution, parallel or sequential, of the current
stage and the preceding stage, and the type of Partitioning method:
Preceding Stage Current Stage
Sequential mode-à (FAN OUT) --à Parallel mode
(Indicates partitioning)
Parallel mode --à (FAN IN) --à Sequential
mode(Indicates Collecting)
Parallel mode --à (BOX) --à Parallel mode (Indicates AUTO method)
Parallel mode --à (BOW TIE)--à Parallel mode
(Indicates Repartitioning)
Parallel modeà (PARALLEL LINES)-> Parallel mode(Indicates SAME partitioning)
- If we check the preserve partitioning in one stage and if we don’t
give any partitioning method (Auto) in the next stage which partition
method it will use?
-- In this case,
the partitioning method used by the preceding stage is used.
-- Preserve Partitioning indicates whether
the stage wants to preserve the
partitioning at the next stage of the job. Options
in this tab are:
- Set – Sets the Preserve partitioning flag.
- Clear – Clears the preserve partitioning flag.
- Propagate – Sets the flag to Set or Clear depending on the
option selected in the previous stage.
- Can we give node allocations i.e. for one stage 4 nodes and for
next stage 3 nodes?
--- YES. It can be done
using the option, ‘Node map constraint’ available in
Stage page -à Advanced Tab.
--- Checking this option allows you to constrain
Parallel execution to the nodes defined in a node map. A ‘Node map’ can
be defined by either of the below 2 options:
- By clicking the browse button to open the available nodes dialog
box and selecting nodes from there. (or)
- By typing node numbers into the text box.
- What is combinability, non-combinability?
--- Using Combinability, DataStage combines the operators that underlie parallel
stages so that they run in the same process. It lets the DataStage compiler
potentially 'optimize' the number of processes used at runtime by combining
operators. This saves significant amount of data copying and preparation in
passing data between operators. It has three options to set:
- Auto: Use the default combination method.
- Combinable: Combine all
possible operators.
- Don’t combine: Never
combine operators.
Usually this setting is left to its default so that
DataStage can tune jobs for performance automatically.
- What are schema files?
-- ‘Schema file’ is a
plain text file in which meta data for a stage is specified.
--- It is stored outside the DataStage Repository
in a document management system or a source code control system. In contrast,
Table definitions are stored in DataStage Repository and can be loaded into
stages as and when required.
--- Schema is an alternative way to specify column
definitions for the data used by the parallel jobs. By default most parallel
job stages take their meta data from the columns tab. For some stages you can
specify a property that causes the stages to take its meta data from the
specified schema file.
--- A Schema consists of a record definition. The
following is an example for record schema:
record(
name:string[];
address:nullable string[];
date:date[];
)
--- Import any table definition, load it into a
parallel job, then choose "Show Schema". The schema
defines the columns and their characteristics in a data set, and may also
contain information about key columns.
21. Why we need datasets rather than sequential
files?
--- A Sequential file as the source or target needs
to be repartitioned as it is(as name suggests) a single sequential stream of
data. A dataset can be saved across nodes using partitioning method selected,
so it is always faster when we used as a source or target. The Data Set
stage allows you to store data being operated on in a persistent form, which
can then be used by other
DataStage jobs. Data sets are operating system
files, each referred to by a control file, which by convention has the suffix
.ds. Using datasets wisely can be key to good performance in a set of linked
jobs.
22. Is look-up stage returns multi-rows or single
row?
---. Lookup stage returns the rows related to the
key values. It can be multiple rows depends on the keys you are mentioning for
the lookup in the stage
23. Why we need sort stage other than sort-merge
collective method and perform sort option in the stage in advanced properties?
--- Sort Stage is used to perform more complex sort
operations which is not possible using stages Advanced tab properties..
Many stages have an optional sort function via the
partition tab. This means if you are partitioning your data in a stage you can
define the sort at the same time. The sort stage is for use when you don't have
any stage doing partitioning in your job but you still want to sort your data,
or if you want to sort your data in descending order, or if you want to use one
of the sort stage options such as "Allow Duplicates" or "Stable
Sort". If you are processing very large volumes and need to sort you will
find the sort stage is more flexible then the partition tab sort.
24. For surrogate key generator stage where will be
the next value stored?
25. In surrogate key generator stage how it
generates the number? (Based on nodes or Based on rows)
--- Based on the nodes we are generating key
values, and the input data partitions should be perfectly balanced across the
nodes. This can be achieved using round robin partitioning method when
your starting point is sequential
26. What is the preserve partioning flag in
Advanced tab?
--- It indicates
whether the stage wants to preserve partitioning at the next stage of the job.
There a three options 1. Set 2.Clear 3.Propagate.
Set. Sets the
preserve partitioning flag, this indicates to the next stage in the job that it
should preserve existing partitioning if possible.
Clear. Clears the
preserve partitioning flag, this indicates that this stage doesn’t care which
partitioning method the next stage uses.
Propagate. Set the flag to
set or clear depending on what in previous stage of the job has set(or if that
is set to propagate the stage before that and so on until a preserve
partitioning flag setting is encountered).
27. What is the difference between stages and
operators?
--- Stages are generic user interface from where we
can read and write from files and databases, trouble shoot and develop jobs,
also it's capable of doing processing of data. Different types of stages are
Database. These are
stages that read or write data contained in a database. Examples of database
stages are the Oracle Enterprise and DB2/UDB Enterprise stages.
Development/Debug. These
are stages that help you when you are developing and troubleshooting parallel
jobs. Examples are the Peek and Row Generator stages.
File. These are
stages that read or write data contained in a file or set of files. Examples of
file stages are the Sequential File and Data Set stages.
Processing. These are
stages that perform some processing on the data that is passing through them.
Examples of processing stages are the Aggregator and Transformer stages.
Real Time. These are the
stages that allow Parallel jobs to be made available as RTI services. They
comprise the RTI Source and RTI Target stages. These are part of the optional
Web Services package.
Restructure. These are
stages that deal with and manipulate data containing columns of complex data
type. Examples are Make Sub record and Make Vector stages.
--- Operators are the basic functional units of an
orchestrate application. In orchestrate framework DataStage stages generates an
orchestrate operator directly.
28. Why we need filter, copy and column export
stages instead of transformer stage?
--- In parallel jobs we have specific stage types
for performing specialized tasks. Filter, copy, column export stages are
operator stages. These operators are the basic functional units of an
orchestrate application. The operators in your Orchestrate application pass
data records from one operator to the next, in pipeline fashion. For example,
the operators in an application step might start with an import operator, which
reads data from a file and converts it to an Orchestrate data set. Subsequent
operators in the sequence could perform various processing and analysis tasks.
The processing power of Orchestrate derives largely from its ability to execute
operators in parallel on multiple processing nodes. By default, Orchestrate
operators execute on all processing nodes in your system. Orchestrate
dynamically scales your application up or down in response to system
configuration changes, without requiring you to modify your application. Thus
using operator stages will increase the speed of data processing applications
rather than using transformer stages.
29. Describe the types of Transformers used in
DataStage PX for processing and uses?
Transformer
Basic Transformer
Transformer-: The
Transformer stage is a processing stage. Transformer stages allow you to create
transformations to apply to your data. These transformations can be simple or
complex and can be applied to individual columns in your data. Transformations
are specified using a set of functions. Transformer stages can have a single
input and any number of outputs. It can also have a reject link that takes any
rows, which have not been written to any of the outputs links by reason of a
write failure or expression evaluation failure.
Basic Transformer:- The BASIC Transformer stage is a also a processing stage. It is similar
in appearance and function to the Transformer stage in parallel jobs. It gives
access to BASIC transforms and functions (BASIC is the language supported by
the DataStage server engine and available in server jobs). BASIC Transformer
stage can have a single input and any number of outputs.
30. What is aggregate cache in aggregator
transformation?
--- Aggregate cache is the memory used for grouping
operations by the aggregator stage.
31. What will you do in a situation where
somebody wants to send you a file and use that file as an input or reference
and then run job?
--- This situation can be handled using ‘Wait-for-file’
activity between the job- activities in a Job sequence.
Under the ‘Wait For File’ tab set the
following options:
- File name.
- Check any of the options:
- Wait for file to appear.
- Wait for file to disappear.
- Timeout length - (hh:mm:ss).
Under the ‘Triggers’ tab set the:
- Expression Type and
- Expression
For the corresponding Input Links.
The available ‘Expression Types’ are:
Expression Type Expression
Unconditional ---à N/A
Otherwise ---à N/A
OK (Conditional) ---à "Executed OK"
Failed (Conditional) ---à "Execution Failed"
Return Value (Conditional) ---à =
Custom (Conditional) ---à <LHS> = <RHS>
32. How do you rename all of the jobs to support
your new File-naming conventions?
--- Create a file with new and old names. Export
the whole project as a dsx. Write a script, which can do a simple rename of the
strings looking up the file. Then import the new dsx file and recompile all
jobs. Be cautious that the name of the jobs has also been changed in your job
control jobs or Sequencer jobs. So you have to make the necessary changes to
these Sequencers.
33. How do you merge two files in DS?
--- We can merge two files in 3 different ways.
Either go for Merge stage, Join Stage or Lookup Stage All these merge or join
occurs based on the key values. The three stages differ mainly in the memory
they use, the treatment of rows with unmatched keys, and their requirements for
data being input of key columns.
34. How did you handle an 'Aborted' sequencer?
--- By using check point information we can restart
the sequence from failure. if u enabled the check point information reset the
aborted job and run again.
35. What are Performance tunings you have done in
your last project to increase the performance of slowly running jobs?
1. Using Dataset stage instead of sequential files
wherever necessary.
2. Use Join stage instead of Lookup stage when the
data is huge.
3.Use Operator stages like remove duplicate,
Filter, Copy etc instead of transformer stage.
4. Sort the data before sending to change capture
stage or remove duplicate stage.
5. Key column should be hash partitioned and sorted
before aggregate operation.
6.Filter unwanted records in beginning of the job
flow itself.
36. If data is partitioned in your job on key 1 and
then you aggregate on key 2, what issues could arise?
--- It will result in false output even though job
runs successfully. In aggregator key value should be hash partitioned so that
identical key values will be in the same node, which gives the desired result
and also eases grouping operation.
37. What is Full load & Incremental or Refresh
load?
38. Describe cleanup resource and clear status
file?
The Cleanup Resources command is to:
• View and end job processes
• View and release the associated locks
Clear Status file command is for resetting the
status records associated with all stages in that job.
Cleanup Resources command and Clear Status file
command is available in Director under Job menu.
39. What is lookup stage? Can you define
derivations in the lookup stage output?
Lookup stage is used to perform lookup operations
on a data set read into memory from any other Parallel job stage that can
output data. It can also perform lookups directly in a DB2 or Oracle database
or in a lookup table contained in a Lookup File Set stage.
The most common use for a lookup is to map short
codes in the input data set onto expanded information from a lookup table which
is then joined to the incoming data and output.
Lookups can also be used for validation of a row.
If there is no corresponding entry in a lookup table to the key’s values, the
row is rejected.
The Lookup stage can have a reference link, a
single input link, a single output link, and a single rejects link. Depending
upon the type and setting of the stage(s) providing the look up information, it
can have multiple reference links (where it is directly looking up a DB2 table
or Oracle table, it can only have a single reference link).
The input link carries the data from the source
data set and is known as the primary link. For each record of the source data
set from the primary link, the Lookup stage performs a table lookup on each of
the lookup tables attached by reference links. The table lookup is based on the
values of a set of lookup key columns, one set for each table. The keys are
defined on the Lookup stage.
Lookup stages do not require data on the input link
or reference links to be sorted. Each record of the output data set contains
columns from a source record plus columns from all the corresponding lookup
records where corresponding source and lookup records have the same value for
the lookup key columns. The lookup key columns do not have to have the same
names in the primary and the reference links. The optional reject link carries
source records that do not have a corresponding entry in the input lookup
tables.
There are some special partitioning considerations
for lookup stages. You need to ensure that the data being looked up in the
lookup table is in the same partition as the input data referencing it. One way
of doing this is to partition the lookup tables using the Entire method.
Another way is to partition it in the same way as the input data (although this
implies sorting of the data).
Yes, we can define
derivations in the lookup stage output.
40. Parallel DataStage jobs can have many
sources of reference data for lookups including database tables, sequential
files or native datasets. Which is the most efficient?
--- In DataStage server jobs the answer is
quite simple, local hash files are the fastest method of a key based lookup, as
long as the time taken to build the hash file does not wipe out your benefits
from using it.
In a DataStage Parallel jobs there are a
very large number of stages that can be used as a lookup, a much wider variety
than server jobs, this includes most
data sources and the parallel staging format of
datasets and lookup filesets. Database lookups can be discounted as the
overhead of the database connectivity and any network passage makes them slower
than most local storage.
So, choosing sources as reference data for Lookups
depends size of the sources for Lookup:
-- If your lookup size is low enough to fit into
memory then the source is irrelevant, they all load up very quickly, even
database lookups are fast.
-- If you have very large lookup files spilling
into lookup table resources then the Lookup fileset outstrips the other
options.
-- A Join becomes a viable option. They are
a bit harder to design as you can only join one source at a time whereas a
lookup can join multiple sources.
41. What is copy stage? When do you use that?
The Copy stage copies a single input data set to a
number of output data sets. Each record of the input data set is copied to every
output data set. Records can be copied without modification or you can drop or
change the order of columns.
Copy stage is useful when we want to make a backup
copy of a data set on disk while performing an operation on another copy.
Copy stage with a single input and a single output,
needs Force set to be TRUE. This prevents DataStage from deciding that the Copy
operation is superfluous and optimizing it out of the job.
42. What is Change Capture stage? Which execution
mode would you use when you used for comparison of data?
The Change Capture stage takes two input data sets,
denoted before and after, and outputs a single data set whose records represent
the changes made to the before data set to obtain the after data set.
The stage produces a change data set, whose table
definition is transferred from the after data set’s table definition with the
addition of one column: a change code with values encoding the four actions:
insert, delete, copy, and edit. The preserve-partitioning flag is set on the change
data set.
The compare is based on a set a set of key columns,
rows from the two data sets are assumed to be copies of one another if they
have the same values in these key columns. You can also optionally specify
change values. If two rows have identical key columns, you can compare the
value columns in the rows to see if one is an edited copy of the other.
The stage assumes that the incoming data is
key-partitioned and sorted in ascending order. The columns the data is hashed
on should be the key columns used for the data compare. You can achieve the
sorting and partitioning using the Sort stage or by using the built-in sorting
and partitioning abilities of the Change Capture stage.
We can use both Sequential as well as parallel
modes of execution for change capture stage.
43. What is Dataset Stage?
The Data Set stage is a file stage. It allows you
to read data from or write data to a data set. It can be configured to execute
in parallel or sequential mode. DataStage parallel extender jobs use data sets
to manage data within a job.
The Data Set stage allows you to store data being
operated on in a persistent form, which can then be used by other DataStage
jobs. Data sets are operating system files, each referred to by a control file,
which by convention has the suffix .ds.
Using data sets wisely can be key to good
performance in a set of linked jobs. You can also manage data sets
independently of a job using the Data Set Management utility, available from
the DataStage Designer, Manager, or Director.
44. How do you drop dataset?
There two ways of dropping a data set, first is by
using Data Set Management Utility (GUI) located in the Manager, Director,
Designer and second is by using Unix command-line utility orchadmin.
45. How do you eliminate duplicate in a Dataset?
The simplest way to remove duplicate is by the use
of Remove Duplicate Stage. The Remove Duplicates stage takes a single sorted
data set as input, removes all duplicate rows, and writes the results to an
output data set.
46. What is Peek Stage? When do you use it?
The Peek stage is a Development/Debug stage. It can
have a single input link and any number of output links.
The Peek stage lets you print record column values
either to the job log or to a separate output link as the stage copies records
from its input data set to one or more output data sets, like the Head stage
and the Tail stage.
The Peek stage can be helpful for monitoring the
progress of your application or to diagnose a bug in your application.
47. What are the different join options available
in Join Stage?
There are four join option available in the join
stage.
Inner: Transfers
records from input data sets whose key columns contain equal values to the
output data set. Records whose key columns do not contain equal values are
dropped
Left Outer: Transfers all
values from the left data set but transfers values from the right data set and
intermediate data sets only where key columns match. The stage drops the key
column from the right and intermediate data sets.
Right Outer: Transfers all
values from the right data set and transfers values from the left data set and
intermediate data sets only where key columns match. The stage drops the key
column from the left and intermediate data sets.
Full Outer: Transfers records
in which the contents of the key columns are equal from the left and right
input data sets to the output data set. It also transfers records whose key
columns contain unequal values from both input data sets to the output data
set. (Full outer joins do not support more than two input links.)
The default is inner.
48. What is the difference between Lookup, Merge
& Join Stage?
These "three Stages" combine two or more
input links according to values of user-designated "key" column(s).
They differ mainly in:
Memory usage
Treatment of rows with unmatched key values
Input requirements (sorted, de-duplicated)
The main difference between joiner and lookup is in
the way they handle the data and the reject links. In joiner, no reject links
are possible. So we cannot get the rejected records directly. Lookup provides a
reject link.
Also lookup is used if the data being looked up can
fit in the available temporary memory. If the volume of data is quite huge,
then it is safe to go for Joiner
Join requires the input dataset to be key
partitioned and sorted. Lookup does not have this requirement
Lookup allows reject links. Join does not allow
reject links. If the volume of data is huge to be fit into memory you go for
join and avoid lookup as paging can occur when lookup is used.
Merge stage allow us to capture failed
lookups from each reference input separately. It also requires identically
sorted and partitioned inputs and, if more than one reference input,
de-duplicated reference inputs
In case of merge stage as part of pre
processing step duplicates should be removed from master dataset. If there are
more than one update dataset then duplicates should be removed from update
datasets as well.
The above-mentioned step is not required for join
and lookup stages.
49. What is RCP? How it is implemented?
DataStage is flexible about Meta
data. It can cope with the situation where Meta
data isn’t fully defined. You can define part of your schema and specify that,
if your job encounters extra columns that are not defined in the meta data when
it actually runs, it will adopt these extra columns and propagate them through
the rest of the job. This is known as runtime column propagation (RCP).
This can be enabled for a project via the DataStage
Administrator, and set for individual links via the Outputs Page Columns tab
for most stages, or in the Outputs page General tab for Transformer stages. You
should always ensure that runtime column propagation is turned on.
RCP is implemented through Schema File. The schema
file is a plain text file contains a record (or row) definition.
50. What is row generator? When do you use it?
The Row Generator stage is a Development/Debug
stage. It has no input links, and a single output link.
The Row Generator stage produces a set of mock data
fitting the specified meta data.
This is useful where we want to test our job but
have no real data available to process. Row
Generator is also useful when we want processing
stages to execute at least once in absence of data from the source.
51. How to extract data from more than 1
heterogeneous Sources?
We can extract data from different sources (i.e.
ORACL, DB2 , Sequential file etc) in a job. After getting the data we can use
Join, Merge, Aggregator or Lookup stage to unify the incoming data.
52. How can we pass parameters to job by using
file?
This can be done through the shell script where we
can read the different parameter from the file and call dsjob command to
execute the job for those interpreted parameters.
No comments:
Post a Comment