- When you need to run
the same sequence of jobs again and again, better create a sequencer with
all the jobs that you need to run. Running this sequencer will run all the
jobs. You can provide the sequence as per your requirement.
- If you are using a copy or a filter stage
either immediately after or immediately before a transformer stage, you
are reducing the efficiency by using more stages because a transformer
does the job of both copy stage as well as a filter stage
- Use Sort stages
instead of Remove duplicate stages. Sort stage has got more grouping
options and sort indicator options.
- Turn off Runtime
Column propagation wherever it’s not required.
- Make use of Modify, Filter, and
Aggregation, Col. Generator etc stages instead of Transformer stage only
if the anticipated volumes are high and performance becomes a problem.
Otherwise use Transformer. It is very easy to code a transformer than a
modify stage.
- Avoid propagation of
unnecessary metadata between the stages. Use Modify stage and drop the
metadata. Modify stage will drop the metadata only when explicitly
specified using DROP clause.
- Add reject files
wherever you need reprocessing of rejected records or you think
considerable data loss may happen. Try to keep reject file at least at
Sequential file stages and writing to Database stages.
- Make use of Order By
clause when a DB stage is being used in join. The intention is to make use
of Database power for sorting instead of Data Stage resources. Keep the
join partitioning as Auto. Indicate don’t sort option between DB stage and
join stage using sort stage when using order by clause.
- While doing Outer
joins, you can make use of Dummy variables for just Null checking instead
of fetching an explicit column from table.
- Data Partitioning is
very important part of Parallel job design. It’s always advisable to have
the data partitioning as ‘Auto’ unless you are comfortable with
partitioning, since all Data Stage stages are designed to perform in the
required way with Auto partitioning.
- Do remember that Modify drops the
Metadata only when it is explicitly asked to do so using KEEP/DROP
clauses.
- Range Look-up: Range Look-up is
equivalent to the operator between. Lookup against a range of values was
difficult to implement in previous Data Stage versions. By having this
functionality in the lookup stage, comparing a source column to a range of
two lookup columns or a lookup column to a range of two source columns can
be easily implemented.
- Use a Copy stage to dump out data to
intermediate peek stages or sequential debug files. Copy stages get
removed during compile time so they do not increase overhead
- Where you are using a
Copy stage with a single input and a single output, you should ensure that
you set the Force property in the stage editor TRUE. This prevents
DataStage from deciding that the Copy operation is superfluous and
optimizing it out of the job
Saturday, December 21, 2013
Things to be considered while desinging a DS job
SQL Query Order of Operations:
SQL Query Order of Operations:
Tags: SQL
Lately, I have been looking into SQL query optimization.
We recently installed SeeFusion on our server and I can see where my long
running tasks are causing the server to slow down. Turns out, not surprisingly,
that the slow pages are very query-intense. Granted, a lot of these pages were
pages years ago before I knew what nice code looked like, but the good news it,
lots of room for optimization and clean up.
To start out, I thought it would be good to look up the
order in which SQL directives get executed as this will change the way I can
optimize:
1.
FROM clause
2.
WHERE clause
3.
GROUP BY clause
4.
HAVING clause
5.
SELECT clause
6.
ORDER BY clause
This order holds some very interesting pros/cons:
FROM Clause
Since this clause executes first, it is our first
opportunity to narrow down possible record set sizes. This is why I put as many
of my ON rules (for joins) as possible in this area as opposed to in the WHERE
clause:
·
FROM
·
contact c
·
INNER JOIN
·
display_status d
·
ON
·
(
·
c.display_status_id = d.id
·
AND
·
d.is_active = 1
·
AND
·
d.is_viewable = 1
·
)
This way, by the time we get to the WHERE clause, we
will have already excluded rows where is_active and is_viewable do not equal 1.
WHERE Clause
With the WHERE clause coming second, it becomes obvious
why so many people get confused as to why their SELECT columns are not
referencable in the WHERE clause. If you create a column in the SELECT
directive:
·
SELECT
·
( 'foo' ) AS bar
It will not be available in the WHERE clause because the
SELECT clause has not even been executed at the time the WHERE clause is being
run.
ORDER BY Clause
It might confuse people that their calculated SELECT
columns (see above) are not available in the WHERE clause, but they ARE
available in the ORDER BY clause, but this makes perfect sense. Because the
SELECT clause executed right before hand, everything from the SELECT should be
available at the time of ORDER BY execution.
I am sure there are other implications based on the SQL
clause order of operations, but these are the most obvious to me and can help
people really figure out where to tweak their code.
Courtesy: Ben Nadel
Subscribe to:
Posts (Atom)