Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Pentaho Data Integration Cookbook - Second Edition

You're reading from   Pentaho Data Integration Cookbook - Second Edition The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to use data sources in Kettle, avoid pitfalls, and dig out the advanced features of Pentaho Data Integration the easy way.

Arrow left icon
Product type Paperback
Published in Dec 2013
Publisher Packt
ISBN-13 9781783280674
Length 462 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Toc

Table of Contents (21) Chapters Close

Pentaho Data Integration Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Working with Databases FREE CHAPTER 2. Reading and Writing Files 3. Working with Big Data and Cloud Sources 4. Manipulating XML Structures 5. File Management 6. Looking for Data 7. Understanding and Optimizing Data Flows 8. Executing and Re-using Jobs and Transformations 9. Integrating Kettle and the Pentaho Suite 10. Getting the Most Out of Kettle 11. Utilizing Visualization Tools in Kettle 12. Data Analytics Data Structures References Index

Getting data from a database by running a query built at runtime


When you work with databases, most of the time you start by writing an SQL statement that gets the data you need. However, there are situations in which you don't know that statement exactly. Maybe the name of the columns to query are in a file, or the name of the columns by which you will sort will come as a parameter from outside the transformation, or the name of the main table to query changes depending on the data stored in it (for example sales2010). PDI allows you to have any part of the SQL statement as a variable, so you don't need to know the literal SQL statement text at design time.

Assume the following situation: you have a database with data about books and their authors, and you want to generate a file with a list of titles. Whether to retrieve the data ordered by title or by genre is a choice that you want to postpone until the moment you execute the transformation.

Getting ready

You will need a book database with the structure as explained in Appendix A, Data Structures.

How to do it...

  1. Create a transformation.

  2. The column that will define the order of the rows will be a named parameter. So, define a named parameter named ORDER_COLUMN, and put title as its default value.

    Note

    Remember that named parameters are defined in the Transformation setting window and their role is the same as the role of any Kettle variable. If you prefer, you can skip this step and define a standard variable for this purpose.

  3. Now drag a Table Input step to the canvas. Then create and select the connection to the book's database.

  4. In the SQL frame, type the following statement:

    SELECT * FROM books ORDER BY ${ORDER_COLUMN}
  5. Check the option Replace variables in script? and close the window.

  6. Use an Output step such as a Text file output step to send the results to a file, save the transformation, and run it.

  7. Open the generated file and you will see the books ordered by title.

  8. Now try again. Press the F9 key to run the transformation one more time.

  9. This time, change the value of the ORDER_COLUMN parameter typing genre as the new value.

  10. Click on the Launch button.

  11. Open the generated file. This time you will see the titles ordered by genre.

How it works...

You can use Kettle variables in any part of the SELECT statement inside a Table Input step. When the transformation is initialized, PDI replaces the variables by their values provided that the Replace variables in script? option is checked.

In the recipe, the first time you ran the transformation, Kettle replaced the variable ORDER_COLUMN with the word title and the statement executed was as follows:

SELECT * FROM books ORDER BY title

The second time, the variable was replaced by genre and the executed statement was as follows:

SELECT * FROM books ORDER BY genre

Note

As mentioned in the recipe, any predefined Kettle variable can be used instead of a named parameter.

There's more...

You may use variables not only for the ORDER BY clause, but in any part of the statement: table names, columns, and so on. You could even hold the full statement in a variable. Note however that you need to be cautious when implementing this.

Note

A wrong assumption about the metadata generated by those predefined statements can make your transformation crash.

You can also use the same variable more than once in the same statement. This is an advantage of using variables as an alternative to question marks when you need to execute parameterized SELECT statements.

Named parameters are another option to store parts of statements. They are part of the job or transformation and allow for default values and clear definitions for what the parameter is. To add or edit named parameters, right-click on the transformation or job, go into its settings, and switch to the Parameters tab.

See also

  • Getting data from a database by providing parameters

You have been reading a chapter from
Pentaho Data Integration Cookbook - Second Edition - Second Edition
Published in: Dec 2013
Publisher: Packt
ISBN-13: 9781783280674
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime