You're reading from Pentaho Data Integration Beginner's Guide - Second Edition Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide with this book and ebook

Product type Paperback

Published in Oct 2013

Publisher Packt

ISBN-13 9781782165040

Length 502 pages

Edition 2nd Edition

Languages

Java

Tools

Pentaho

Concepts

Data Visualization

Author (1):

María Carina Roldán

View More author details

Table of Contents (21) Chapters

Preface

1. Getting Started with Pentaho Data Integration FREE CHAPTER

2. Getting Started with Transformations

3. Manipulating Real-world Data

4. Filtering, Searching, and Performing Other Useful Operations with Data

5. Controlling the Flow of Data

6. Transforming Your Data by Coding

7. Transforming the Rowset

8. Working with Databases

9. Performing Advanced Operations with Databases

10. Creating Basic Task Flows

11. Creating Advanced Transformations and Jobs

12. Developing and Implementing a Simple Datamart

A. Working with Repositories

B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line

C. Quick Reference – Steps and Job Entries

D. Spoon Shortcuts

E. Introducing PDI 5 Features

F. Best Practices

Summary

G. Pop Quiz Answers

Index

Time for action – creating a hello world transformation

How about starting by saying hello to the world? It's not really new, but good enough for our first practical example; here are the steps to follow:

Create a folder named pdi_labs under a folder of your choice.
Open Spoon.
From the main menu, navigate to File | New | Transformation.
On the left of the screen, under the Design tab, you’ll see a tree of Steps. Expand the Input branch by double-clicking on it.
Note
Note that if you work in Mac OS a single click is enough.
Then, left-click on the Generate Rows icon and without releasing the button, drag-and-drop the selected icon to the main canvas. The screen will look like the following screenshot:
Note
Note that we changed the preferred language back to English.
Double-click on the Generate Rows step you just put in the canvas, and fill the textboxes, including Step name and Limit and grid as follows:
From the Steps tree, double-click on the Flow branch.
Click on the Dummy (do nothing) icon and drag-and-drop it to the main canvas.
Put the mouse cursor over the Generate Rows step and wait until a tiny toolbar shows up below the entry icon, as shown in the following screenshot:
Click on the output connector (the last icon in the toolbar), and drag towards the Dummy (do nothing) step. A grayed hop is displayed.
When the mouse cursor is over the Dummy (do nothing) step, release the button. A link—a hop from now on—is created from the Generate Rows step to the Dummy (do nothing) step. The screen should look like the following screenshot:
Right-click anywhere on the canvas to bring a contextual menu.
In the menu, select the New note option. A note editor appears.
Type some description such as Hello, World! Select the Font style tab and choose some nice font and colors for your note, and then click on OK.
From the main menu, navigate to Edit | Settings.... A window appears to specify transformation properties. Fill the Transformation name textbox with a simple name, such as hello world. Fill the Description textbox with a short description such as My first transformation. Finally, provide a more clear explanation in the Extended description textbox, and then click on OK.
From the main menu, navigate to File | Save.
Save the transformation in the folder pdi_labs with the name hello_world.
Select the Dummy (do nothing) step by left-clicking on it.
Click on the Preview icon in the bar menu above the main canvas. The screen should look like the following screenshot:
The Transformation debug dialog window appears. Click on the Quick Launch button.
A window appears to preview the data generated by the transformation as shown in the following screenshot:
Close the preview window and click on the Run icon. The screen should look like the following screenshot:
A window named Execute a transformation appears. Click on Launch.
The execution results are shown at the bottom of the screen. The Logging tab should look as follows:

What just happened?

You have just created your first transformation.

First, you created a new transformation, dragged-and-dropped into the work area two steps: Generate Rows and Dummy (do nothing), and connected them.

With the Generate Rows step you created 10 rows of data with the message Hello World! The Dummy (do nothing) step simply served as a destination of those rows.

After creating the transformation, you did a preview. The preview allowed you to see the content of the created data, this is, the 10 rows with the message Hello World!

Finally, you run the transformation. Then you could see at the bottom of the screen the Execution Results window, where a Logging tab shows the complete detail of what happened. There are other tabs in this window which you will learn later in the book.

Directing Kettle engine with transformations

A transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows—the data enters or is created in a step, the step applies some kind of transformation to it, and finally the data leaves that step. Therefore, it’s said that a transformation is data flow oriented.

A transformation itself is neither a program nor an executable file. It is just plain XML. The transformation contains metadata which tells the Kettle engine what to do.

A step is the minimal unit inside a transformation. A big set of steps is available. These steps are grouped in categories such as the Input and Flow categories that you saw in the example.

Each step is conceived to accomplish a specific function, going from reading a parameter to normalizing a dataset.

Each step has a configuration window. These windows vary according to the functionality of the steps and the category to which they belong. What all steps have in common are the name and description:

Step property	Description
Name	A representative name inside the transformation.
Description	A brief explanation that allows you to clarify the purpose of the step. It’s not mandatory but it is useful.

A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitute the output data of the origin step and the input data of the destination step.

Exploring the Spoon interface

As you just saw, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main menu, Design view, Transformation toolbar, and Canvas (work area):

Note

The words canvas and work area will be used interchangeably throughout the book.

There is also an area named View that shows the structure of the transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Designing a transformation

In the earlier section, you designed a very simple transformation, with just two steps and one explanatory note. You learned to link steps by using the mouseover assistance toolbar. There are alternative ways to do the same thing. You can use the one that you feel more comfortable with. Appendix D, Spoon Shortcuts explains all of the different options to you. It also explains a lot of shortcuts to zoom in and out, align the steps, among others. These shortcuts are very useful as your transformations become more complex.

Note

Appendix F, Best Practices, explains the benefit of using shortcuts as well as other best practices that are invaluable when you work with Spoon, especially when you have to design and develop big ETL projects.

Running and previewing the transformation

The Preview functionality allows you to see a sample of the data produced for selected steps. In the previous example, you previewed the output of the Dummy (do nothing) step.

The Run icon effectively runs the whole transformation.

Whether you preview or run a transformation, you’ll get an Execution Results window showing what happened. You will learn more about this in the next chapter.