The Datastage configuration file is a master
control file (a text file which sits on the server side) for a job
which describes the parallel system resources and architecture. The
configuration file provides hardware configuration for supporting such
architectures as SMP (Single machine with multiple CPU, shared memory
and disk), Grid, Cluster or MPP (multiple CPU, multiple nodes and
dedicated memory per node). DataStage understands the architecture of the
system through this file.
This
is one of the biggest strengths of Datastage. For cases in which you have
changed your processing configurations, or changed servers or platform, you
will never have to worry about it affecting your jobs since all the jobs
depend on this configuration file for execution. Datastage jobs determine which
node to run the process on, where to store the temporary data, where to store
the dataset data, based on the entries provide in the configuration file. There
is a default configuration file available whenever the server is installed.
The configuration files have extension
".apt". The main outcome from having the configuration file is
to separate software and hardware configuration from job design. It allows
changing hardware and software resources without changing a job design.
Datastage jobs can point to different configuration files by using job
parameters, which means that a job can utilize different hardware
architectures without being recompiled.
The configuration file contains the
different processing nodes and also specifies the disk space provided for each
processing node which are logical processing nodes that are specified in the
configuration file. So if you have more than one CPU this does not mean the
nodes in your configuration file correspond to these CPUs. It is possible to
have more than one logical node on a single physical node. However you should
be wise in configuring the number of logical nodes on a single physical node.
Increasing nodes, increases the degree of parallelism but it does not
necessarily mean better performance because it results in more number of
processes. If your underlying system should have the capability to handle these
loads then you will be having a very inefficient configuration on your hands.
1. APT_CONFIG_FILE is
the file using which DataStage determines the configuration file (one can
have many configuration files for a project) to be used. In fact, this is
what is generally used in production. However, if this environment
variable is not defined then how DataStage determines which file to use??
1. If the
APT_CONFIG_FILE environment variable is not defined then DataStage look for
default configuration file (config.apt) in following path:
1. Current working
directory.
2. INSTALL_DIR/etc,
where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage
installation.
2. Define Node in
configuration file
A Node is a logical processing unit. Each node
in a configuration file is distinguished by a virtual name and defines a number
and speed of CPUs, memory availability, page and swap space, network
connectivity details, etc.
3. What are the
different options a logical node can have in the configuration file?
1. fastname – The fastname is the physical node
name that stages use to open connections for high volume data transfers. The
attribute of this option is often the network name. Typically, you can get this
name by using UNIX command ‘uname -n’.
2. Pools – Name of the pools to which the node is
assigned to. Based on the characteristics of the processing nodes you can group
nodes into set of pools.
1. A pool can be
associated with many nodes and a node can be part of many pools.
2. A node belongs to
the default pool unless you explicitly specify a pools list for it, and omit
the default pool name (“”) from the list.
3. A parallel job or
specific stage in the parallel job can be constrained to run on a pool (set of
processing nodes).
1. In case jobs as well
as stage within the job are constrained to run on specific processing nodes
then stage will run on the node which is common to stage as well as job.
3. Resource – reso urce res ourc e_ty pe
“location” [{pools “disk_pool_name”}] | resource resource_type
“value” . resource_type can be canonical hostname (Which
takes quoted ethernet name of a node in cluster that is unconnected to
Conductor node by the high speed network.) or disk (To
read/write persistent data to this directory.) or scratch disk (Quoted absolute path name of a
directory on a file system where intermediate data will be temporarily stored.
It is local to the processing node.) or RDBMS Specific
resourses (e.g. DB2, INFORMIX, ORACLE, etc.)
4. How datastage
decides on which processing node a stage should be run?
1. If a job or stage is
not constrained to run on specific nodes then parallel engine executes a
parallel stage on all nodes defined in the default node pool. (Default Behavior)
2. If the node is
constrained then the constrained processing nodes are chosen while executing
the parallel stage.
-Courtesy : Atul.Singh
0 comments:
Post a Comment