Tuesday, June 18, 2013

A Quick look at Routines in Datastage

Datasatge has 2 types of routines ,Below are the 2 types.

1.Before/After Subroutine.
2.Transformer routines/Functions.

Before/After Subroutines :

 These are built-in routines.which can be called in before or after subroutines. Below is the list of the same.

1.DSSendMail :Used to Send mail using Local send mail program.
2.DSWaitForFile : This routine is called to suspend a job until a named job either exists, or does not exist.
3.DSReport :Used to Generate Job Execution Report.
4.ExecDos: This routine executes a command via an MS-DOS shell. The command executed is specified in the routine's input argument.
5.ExecDOSSilent. As ExecDOS, but does not write the command line to the job log.
6. ExecTCL. This routine executes a command via an Info Sphere Information Server engine shell. The command executed is specified in the routine's input argument.
7.ExecSH:This routine executes a command via a UNIX Korn shell.
8.ExecSHSilent:As ExecSH, but does not write the command line to the job log.

Transformer Routines:

Transformer Routines are custom developed functions, as you all know even DS has some limitations corresponding to inbuilt functions(TRIM,PadString,.etc), like in DS version 8.1 we don’t have any function to return ASCII value of a character, Now from  8.5 they have introduced seq() function for above mentioned scenario.

            These Custom routines are developed in C++ Writing a routine in CPP and linking it to our datastage project is really simple task as follows,

v      Write CPP code 
v      Compiling with the required flags.
v      Put the output file in a shared dir.
v      Link it in the datastage.
v      Use it in a transformer like other functions.

Below is the Sample C++ Code :

#include <string>
#include <iostream.h>

using namespace std;
int addNumber(int a,int b)
{
     return a+b;
}

Note :We need to make sure ,our code should not contain main() function as it is not required.

Compiling with the required flags:

 Get the values of below 2 Environment variables.

1.APT_COMPILER
2.APT_COMPILER_OPT

Use as below to compile the code from unix prompt.

APT_COMPILER<Space> APT_COMPILER_OPT<space>File_Name(with Cpp Code).

v      Once you run the above command it will create object file at same path with .o as extension. Now login to DS Designer at routines folder do a right click and select new Parallel routine.
v      In new opened window put required details like Name and Select type as External Function external subroutine name as the function name we need to access, . Select proper return type and also provide the complete path of the .o file.
v      Now Select Arguments tab and add required arguments ,select proper Data types for arguments.
v      Now you are all done. Go to your job open any transformer and in any expression just select the ellipsis button [...] you will get the list and there select routine. There you will get our new routine listed.

But remember few points while writing the function i.e the cpp code,

v      Data stage cannot accept return type as string, so we need to design our function to return char* instead.

v      The same applies to input arguments too. So our function can accept char* only not string. But later in the cpp code we can change it to string.

Saturday, June 15, 2013

Useful Options of sed command

          How to use sed, a special editor for modifying files automatically. If you want to write a program to make changes in a file, sed is the tool to use. Sed is the ultimate stream editor. If that sounds strange, picture a stream flowing through a pipe. sed is a marvelous utility.

The essential command: s for substitution:
            Sed has several commands, but most people only use substitute: s. The substitute command changes all occurrences of the regular expression into a new value. A simple example is changing "day" in the "old" file to "night" in the "new" file.
This will output "night".
sed 's/day/night/' File_1 >File_2
I must emphasize that the sed editor changes exactly what you tell it to. So if you executed
echo Sunday | sed 's/day/rise/'
This would output the word "Sunrise" because sed found the string "day" in the input.
Another important concept is that sed is line oriented. Suppose you have the input file:
one two three, one two three
four three two one
one hundred
and you used the command
sed 's/one/ONE/' <file
The output would be :
ONE two three, one two three
four three two ONE
ONE hundred
Note :You can also change delimiter "/"  to ":" like sed 's:one:ONE:' 
s/pattern/replacement/flags
Substitutes the replacement string for the first occurrence of the pattern parameter in the pattern space. Any character that is displayed after the s
subcommand can substitute for the / (slash) separator except for the space or new-line character.
 See the "Pattern Matching" section of the ed command.
 The value of the flags variable must be zero or more of:
g :Substitutes all non-overlapping instances of the pattern parameter rather than just the first one.
n :Substitutes for the n-th occurrence only of the pattern parameter.
p :Writes the pattern space to standard output if a replacement was made.
w WFile :Writes the pattern space to the WFile variable if a replacement was made. Appends the pattern space to the WFile variable. If the WFile variable was not already created by a previous write by this sed script, the sed command creates it.
Flags:
       -e Script :Uses the Script variable as the editing script. If you are using just one -e flag and no -f flag, the -e flag can be omitted.
       -f  ScriptFile :Uses the ScriptFile variable as the source of the edit script. The ScriptFile variable is a prepared set of editing commands applied to the File
            parameter.
       -n :Suppresses all information normally written to standard output.
       -u :Displays the output in an un buffered mode. When this flag is set, the sed command displays the output instantaneously instead of buffering the output. The default is buffered mode.  

Note: You can specify multiple -e and -f flags. All sub commands are added to the script in the order specified, regardless of their origin.



Thursday, June 13, 2013

Datastage Parallelism Explained


               The simultaneous use of more than one CPU or processor core to execute a program or multiple computational threads is called parallel processing or Parallelism. Ideally, parallel processing makes programs run faster because there are more engines (CPUs or Cores) running it. as you all know Datastage supports 2 types of parallelism.

1.Pipeline parallelism.
2.Partition parallelism.


Pipeline Parallelism :

     As and when a row/set of rows is/are processed at a particular stage that record or rows is sent out to process at another stage for processing or storing. Below image explains the same in detail.


We have set of rows in source and 1k rows being read in a single segment,When ever those rows got processed at Transform,those are being sent to ENRICH and From there to LOAD ,so By this way we can keep processor busy and reduce disk usage for staging.


Partition Parallelism :

              Partition Parallel depends on dividing large data into smaller subsets (partitions) across resources ,Goal is to evenly distribute data,some transforms require all data within same group to be in same partition Requires the same transform on all partitions.

            Using partition parallelism the same job would effectively be run simultaneously by several processors, each handling a separate subset of the total data, but Each partition is independent of others, there is no concept of “global” state.


Datastage combines both Partition and Pipeline parallelism together to implement ETL Solutions.


Importance of Configuration File in Datastage.

            The Datastage configuration file is a master control file (a text file which sits on the server side) for a job which describes the parallel system resources and architecture. The configuration file provides hardware configuration for supporting such architectures as SMP (Single machine with multiple CPU, shared memory and disk), Grid, Cluster or MPP (multiple CPU, multiple nodes and dedicated memory per node). DataStage understands the architecture of the system through this file.
           This is one of the biggest strengths of Datastage. For cases in which you have changed your processing configurations, or changed servers or platform, you will never have to worry about it affecting your jobs since all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run the process on, where to store the temporary data, where to store the dataset data, based on the entries provide in the configuration file. There is a default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration file is to separate software and hardware configuration from job design. It allows changing hardware and software resources without changing a job design. Datastage jobs can point to different configuration files by using job parameters, which means that a job can utilize different hardware architectures without being recompiled.
The configuration file contains the different processing nodes and also specifies the disk space provided for each processing node which are logical processing nodes that are specified in the configuration file. So if you have more than one CPU this does not mean the nodes in your configuration file correspond to these CPUs. It is possible to have more than one logical node on a single physical node. However you should be wise in configuring the number of logical nodes on a single physical node. Increasing nodes, increases the degree of parallelism but it does not necessarily mean better performance because it results in more number of processes. If your underlying system should have the capability to handle these loads then you will be having a very inefficient configuration on your hands.

1.    APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use??
1.    If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path:
1.    Current working directory.
2.    INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.

2.    Define Node in configuration file
A Node is a logical processing unit. Each node in a configuration file is distinguished by a virtual name and defines a number and speed of CPUs, memory availability, page and swap space, network connectivity details, etc.

3.    What are the different options a logical node can have in the configuration file?
1.    fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using UNIX command ‘uname -n’.
2.    Pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools.
1.    A pool can be associated with many nodes and a node can be part of many pools.
2.    A node belongs to the default pool unless you explicitly specify a pools list for it, and omit the default pool name (“”) from the list.
3.    A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes).
1.    In case jobs as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job.
3.    Resource – resource resource_type “location” [{pools “disk_pool_name”}]  | resource resource_type “value” . resource_type can be canonical hostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the high speed network.) or disk (To read/write persistent data to this directory.) or scratch disk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.)

4.    How datastage decides on which processing node a stage should be run?
1.    If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior)
2.    If the node is constrained then the constrained processing nodes are chosen while executing the parallel stage.


                                                                                          -Courtesy : Atul.Singh


Wednesday, June 12, 2013

How to Run Datastage Job from Unix Command Line .?

                    Most Data warehousing project requires that your job runs in batches at specified time slots. In such cases the Datastage jobs are usually scheduled by using an external scheduling tool like ESP Scheduler, Control M, Autosys, etc… This is made possible by writing scripts that will run your jobs through the command line.  I would consider the command line & very powerful interface to Datastage which lets us do more than just run the normal job. There guides present in the Datastage documentation will be very helpful in exploring the various things that can be done through the command line. However I plan on giving you the basics you will need to need to carry out your execution
In UNIX, the Datastage home directory location will always be specified in the “.dshome” file which will be present in the root directory.  Before you can run your Datastage commands you will have to run the following commands
§                     cd  `cat /.dshome`
This will change the location to the home directory. By default this will be /opt/IBM/InformationServer/Server/DSEngine
§                     . ./dsenv > /dev/null 2>&1
This will run the dsenv file which contains all the environment variables. Without doing this, your UNIX commands won’t run on the command prompt.
After you have done this then you can use any Datastage command for interacting with the server. The main command you can use is the ‘dsjob’ command which is not used only to run jobs but for a wide variety of reasons. Let’s look at the various ways you can use the dsjob command
To run a job:
Using the dsjob command you can start,stop,reset or run the job in  validation mode.
dsjob  –run –mode VALIDATE/RESET/RESTART  project_name  job_name
This command will actually run the job in validation mode. Similarly you can use RESET or RESTART instead of VALIDATE depending on what type of run you want. If you want a normal run then you will not need to specify the –mode keyword as shown below
dsjob –run project_name  job_name | job_name.invocationid
Running with the invocationid would mean that the job would be run with that specific invocation id
Now if you have parameters to set or paratemeterset values to set then this can also be as set as shown below
dsjob –run –param variable_name=”VALUE” –param psParameterSet=”vsValueSet” project_name  job_name
To stop a job:
Stopping a job is fairly simple. You might not actually require it but still its worth to take a look. It acts the same way as you would stop a running job the Datastage director.
dsjob –stop  project_name  job_name|job_name.invocationid
To list projects, jobs, stages in jobs, links in jobs, parameters in jobs and invocations of jobs
dsjob can very easily give you all the above based on the different keywords. It will be useful for you if you want to get a report of what’s being used in what project and things like that
The various commands are shown below
‘dsjob –lprojects’  will give you a list of all the projects on the server
‘dsjob –ljobs  project_name’ will give you a list of jobs in a particular project
‘dsjobs –lstages  project_name job_name’  will give you a list of all the stages used in your job. Replacing –lstage with –links will give you a list of all the links in your job. Using –lparams will give you a list of all the parameters used in your job. Using –linvocations will give you a list of all the invocations of your multiple instance job.
To generate reports of a job
You can get the basic information of a job buy using the  ‘jobinfo’ option as shown below
dsjob -jobinfo  project_name job_name
Running this command will give you a short report of your job which includes The current status of the job, the name of any controlling job for the job, the date and time when the job started , the wave number of the last or current run (internal InfoSphere Datastage reference number) and the user status
You can get a more detailed report using the below command
dsjob -report  project  job_name BASIC|DETAIL|XML
BASIC means that your report will contain very basic information like start/end time of the job , time elapsed and the current status of the job. DETAIL as the name indicates will give you a very detailed report on the job down to the stages and link level. XML would give you an XML report which is also a detailed report in an XML format.
To access logs:
You can use the below command to get the list of latest 5 fatal errors  from the log of the job that was just run
dsjob -logsum –type FATAL –max 5 project_name job_name
You can get different types of information based on the keyword you specify for –type. Full list of allowable types are available in the help guide for reference.
There are a number of other options also available to get different log information. You can explore this in more detail in the developer guide. With the Datastage commands you can administer jobs, run jobs, maintain jobs, handle errors, prepare meaningful job logs and even prepare reports.  The possibilities are endless. If you like to code then you won’t mind spending your time exploring the command line options available.

Tuesday, June 11, 2013

Dataset Handling in Unix with Orchadmin


                DataStage stores data in persistent internal (specific to DataStage) format in the form of Data sets. Orchestrate Data set aid in the parallel processing of data and are much faster performance wise. They help in achieving end-to-end parallelism by writing data in partitioned  form and maintaining the sort order. Orchestrate Data set consists of one or more data  files stored on multiple processing nodes. A parallel data set is represented on disk by: 

• A  single descriptor  file  - defines  the  record  schema of  the data  set and  the  location  of  all  data  files  in  the  set.  It  does  not  contain  the actual data. 

• Data  files  (which  contain  the  actual  data)  located  on  one  or more processing nodes.

Orchadmin Utility 

This  is an Orchestrate Administrator Utility.  It  can perform  operations on Data sets which cannot be performed by normal UNIX file commands. The basic syntax is: 

orchadmin [command] [options] [descriptor_files] 

Commands 

The  various  commands  that  are  available  with  orchadmin  are  dump, delete, truncate, copy and describe. 

Dump Command 

This  command  can  be  used  to write  records  from  a  given  data  set  onto standard output or can be redirected to a sequential file. The syntax is: 

Syntax :

orchadmin dump [options] descriptor_file 

If  no  option  is  specified,  all  the  records will  be  returned  to  the  standard output.

ex 1)orchadmin dump test.ds
    2)orchadmin dump test.ds>temp.txt

In the second example temp.txt file will contain data present in test.ds

Dump Command with  -field name option will print Single column data fro Dataset .

Note:There are still more options available with dump like -x,-n ,-name,-skip.


Delete Command  

rm deletes only  descriptor file and the actual data  is  not  deleted  as  it  is  present  in  the  data  files  which  reside  on  the processing  nodes.To remove  the  persistent  data  from  the  data  sets  the conventional approach is the use of Data set management in data stage Designer.
  
Orchadmin  utility  simplifies  the  whole  process  by  providing  the
delete command. The syntax is: 

Syntax :

orchadmin delete | del | rm [-option] ds_1 ... ds_N 

ex :
orchadmin delete test.ds

Describe Command  

This command outputs a report about the datasets specified. The syntax is: 

orchadmin describe [-options] descriptor_file 

ex-:
orchadmin describe test.ds

Copy Command

This  command  can  be  used  to  create  an  identical  dataset  with  the  same column definition and number of records. Orchadmin  copy  command  can  be  used  to  take  backups  of  existing datasets.  
  
Syntax :
orchadmin copy | cp  source-ds  target-ds 

ex:
orchadmin copy temp.ds temp_target.ds

Note:1)If  one  uses  the  UNIX  cp  command  then  only  the  descriptor  file  is copied, and these descriptor  files point to the same data  files residing in the processing nodes.

2)Type orchadmin on command prompt to get help information about this command

Monday, June 10, 2013

Know your DS Server architechture is SMP or MPP. ?

                   To know whether you are working on SMP or MPP ,Just have look at your configuration file .If "FastName"has same value for all the nodes in your configuration file then you are running on SMP.

If there are different "FastName's" then its MPP.


Sample config File for SMP:


main_program: APT configuration file: /ibmiscode/Configurations/ETL/my_config.apt


  node "Node1" { 
    fastname "FastName1" 
    pools "" 
    resource disk "/is_resource1/isent" {pools ""} 
    resource scratchdisk "/scratch1/sent" {pools ""} 
  } 

 node "Node2" { 

    fastname "FastName1" 
    pools "" 
    resource disk "/is_resource2/isent" {pools ""} 
    resource scratchdisk "/scratch2/sent" {pools ""} 
  } 

 node "Node3" { 

    fastname "FastName1" 
    pools "" 
    resource disk "/is_resource3/isent" {pools ""} 
    resource scratchdisk "/scratch3/sent" {pools ""} 
  } 

  node "Node4" { 

    fastname "FastName1" 
    pools "" 
    resource disk "/is_resource4/isent" {pools ""} 
    resource scratchdisk "/scratch4/sent" {pools ""} 
  } 
}

Note :For Server jobs it won't matter if it is SMP or MPP because Server jobs execute only on the machine where Data stage installed.


Hope This will help :)
 

Datastage Doctrina Copyright © 2011 -- Template created by O Pregador -- Powered by Blogger

Receive all updates, tips and tricks via Facebook. Just Click the Like Button Below