Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
$9.99 | ALL EBOOKS & VIDEOS
Save more on purchases! Buy 2 and save 10%, Buy 3 and save 15%, Buy 5 and save 20%
Pentaho Data Integration Quick Start Guide
Pentaho Data Integration Quick Start Guide

Pentaho Data Integration Quick Start Guide: Create ETL processes using Pentaho

By María Carina Roldán
$15.99 per month
Book Aug 2018 178 pages 1st Edition
eBook
$25.99 $9.99
Print
$32.99 $22.99
Subscription
$15.99 Monthly
eBook
$25.99 $9.99
Print
$32.99 $22.99
Subscription
$15.99 Monthly

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Pentaho Data Integration Quick Start Guide

Chapter 1. Getting Started with PDI

Pentaho Data Integration (PDI) is a popular business intelligence tool, used for exploring, transforming, validating, and migrating data, along with other useful operations. PDI allows you to perform all of the preceding tasks thanks to its friendly user interface, modern architecture, and rich functionality. This book will introduce you to the tool, giving you a quick understanding of the daily tasks that you can perform with it.

We will cover the following topics in this chapter:

  • Introducing PDI
  • Installing PDI
  • Configuring the graphical designer tool
  • Creating a simple transformation
  • Understanding the Kettle home directory

Introducing PDI


PDI, also known as Kettle, is a very powerful tool. It can be used for performing typical Extract, Transform, and Load (ETL) processes. PDI gets data from different sources and manipulates it in many ways (deduplicating, filtering, cleaning, and formatting, among others), saving the data in different formats and destinations. The following diagram illustrates a very simple example of an ETL process designed with PDI:

ETL process

Aside from the preceding processes, PDI serves to migrate data between applications, access and manipulate real-time data, access data in the cloud, orchestrate administrative tasks, and more.

Installing PDI


The following are the instructions to install the PDI Community Edition (CE), irrespective of the operating system that you may be using:

  • Make sure that you have JRE 8.0 installed.

Note

If you don't have JRE 8.0 installed, download it from http://www.java.com Redash source code by cloning the repository, and install it before proceeding. Make sure that the JAVA_HOME system variable is set.

PDI on SourceForge.net

  • Download the available ZIP file, which will serve you for all platforms.
  • Unzip the downloaded file in a folder of your choice (for example, c:/software/pdi or /home/pdi_user/pdi).
  • Browse your disk and look for the PDI folder that was just created. You will see a folder named data-integration, with several subfolders (lib, plugins, samples, and more) and a bunch of scripts (spoon.bat, pan.bat, and others), which we will soon learn how to use.

Configuring the graphical designer tool


Spoon is PDI's desktop designer tool. With Spoon, you can design, preview, and test all of your work (that is, transformations and jobs).

Before starting to work with PDI, it's advisable to take a look at the Spoon interface and do some minimal configuration. The instructions are as follows:

  • Start Spoon: If your system is Windows, run Spoon.bat from within the PDI installation directory. On other platforms, such as Unix, Linux, and so on, open a Terminal window and type spoon.sh.
  • The main window will show up, with a Welcome! window already open, as shown in the following screenshot:

Welcome page

Note

The Welcome! page includes some links to web resources, forums, and more, as well as some shortcuts for working with PDI. You can reach that window at any time by navigating to the Help Welcome Screen option.

In order to customize Spoon, do the following:

  • Click on Options... in the Tools menu. A window appears, where you can change various general characteristics, as follows:

Options

 

  • Many of the options in this tab will not make sense to you yet. Instead of doing anything here, select the tab Look & Feel:

Look & Feel options

  • Feel free to change any of the options in this tab (for example, the font color or size). Click on the OK button.
  • Restart Spoon to apply the changes.

Creating a simple transformation


Transformations and jobs are the main PDI artifacts. Transformations are data-flow oriented entities, while jobs are task-oriented. In this book, we will start by learning all about transformations, focusing on jobs later. To get a quick idea of what, exactly, a transformation is, we will start by creating a simple one. This will also allow you to see what it's like to work with Spoon.

Our first transformation will find out the current version of PDI (Kettle), and will print the value to the log. Proceed as follows:

  • On the Welcome page, click on the New transformation link, located under the WORK link group. Alternatively, press Ctrl + N.
  • A new tab will appear, with the title Transformation 1. It's in this tab that you will create your work.
  • To the left of the screen, under the Design tab, you'll see a tree of folders. Expand the Input folder by double-clicking on it.

Note

Note that if you work in macOS, a single click is enough.

  • Then, left-click on the Get System Info icon, and, without releasing the button, drag and drop the selected icon to the work area (that is, the blank area that occupies almost all of the screen). You should see something like this:

Dragging and dropping a step

  • Double-click on the Get System Info icon. A configuration window will show up. Fill in the first row in the grid, as shown in the following screenshot. Note that you don't have to type the Kettle version. Instead, you can choose it from a list of available options:

Configuring the Get System Info step

  • In the Design tab, double-click on the Utility folder, click on the Write to log icon, and drag and drop it to the work area.
  • Put the mouse cursor over the Get System Info icon and wait until a tiny toolbar shows up, as shown in the following screenshot:

Mouseover assistance toolbar

  • Click on the output connector (the icon highlighted in the preceding image) and drag it towards the Write to log icon. A greyed hop is displayed.
  • When the mouse cursor is over the Write to log step, release the button. A link (a hop, from now on) is created, from the first step to the second one. The screen should look as follows:

Connecting steps with a hop

Let's add some color note to our work, as follows:

  • Right-click anywhere in the work area to bring up a contextual menu.
  • In the menu, select the New Note... option. A note editor will appear.
  • Type a description, such as My first transformation. Select the Font style tab and choose a nice font and some colors for your note, and then click on OK. The following should be the final result:

My first transformation

  • Save the transformation by pressing Ctrl + S. PDI will ask for a destination folder. Select the folder of your choice, and give the transformation a name. PDI will save the transformation as a file with a ktr extension (for example, sample_transformation.ktr).

Finally, let's run the transformation to see what happens:

  • Click on the Run icon, located in the transformation toolbar:

Run icon in the transformation toolbar

  • A window named Run Options will appear. Click on Run.
  • At the bottom of the screen, you should see a log with the results of the execution:

Execution Results

Understanding the Kettle home directory


When you run Spoon for the first time, a folder named .kettle is created in your home directory by default. This folder is referred to as the Kettle home directory.

The folder contains several configuration files, mainly created and updated by the different PDI tools. Among these files, there is the kettle.properties file.

The purpose of the kettle.properties file – created along with the .kettle folder, the first time you run Spoon – is to contain variable definitions with a broad scope: Java Virtual Machine. Therefore, it's the perfect place to define general settings; some examples are as follows:

  • Database connection settings: host, database name, and so on
  • SMTP settings: SMTP server, port, and so on
  • Common input and output folders
  • Directory to send log files to

Before continuing, let's add some variables to the file. Suppose that you have two folders, named C:/PDI/INPUT and C:/PDI/OUTPUT, which you will use for storing files. The objective will be to add two variables, named INPUT_FOLDER and OUTPUT_FOLDER, containing those values:

  1. Locate the Kettle home directory. If you work in Windows, the folder could be C:\Documents and Settings\<your_name> or C:\Users\<your_name>, depending on which Windows version you have. If you work in Linux (or similar) or macOS, the folder will most likely be /home/<your_name>/.
  2. Edit the kettle.properties file. You will see that it only contains commented sample lines.
  3. You can safely remove the contents of the file and define your own variables by typing the following lines:
       INPUT_FOLDER=C:/PDI/INPUT
       OUTPUT_FOLDER=C:/PDI/OUTPUT

Save the file and restart Spoon, so that it can recognize the variables defined in the file. We will learn how to use these variables in Chapter 2Getting Familiar with Spoon.

 

 

Summary


In this chapter, you were introduced to Pentaho Data Integration. Specifically, you learned what PDI is, and you installed the tool. You were introduced to Spoon, PDI's graphical designer tool, and you created your first transformation. You were also introduced to the Kettle home directory and the kettle.properties file, which will be used throughout the rest of the book.

In Chapter 2, Getting Familiar with Spoon, you will learn much more about the process of creating, testing, and running transformations in Spoon.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Take away the pain of starting with a complex and powerful system
  • Simplify your data transformation and integration work
  • Explore, transform, and validate your data with Pentaho Data Integration

Description

Pentaho Data Integration(PDI) is an intuitive and graphical environment packed with drag and drop design and powerful Extract-Transform-Load (ETL) capabilities. Given its power and flexibility, initial attempts to use the Pentaho Data Integration tool can be difficult or confusing. This book is the ideal solution. This book reduces your learning curve with PDI. It provides the guidance needed to make you productive, covering the main features of Pentaho Data Integration. It demonstrates the interactive features of the graphical designer, and takes you through the main ETL capabilities that the tool offers. By the end of the book, you will be able to use PDI for extracting, transforming, and loading the types of data you encounter on a daily basis.

What you will learn

Design, preview and run transformations in Spoon Run transformations using the Pan utility Understand how to obtain data from different types of files Connect to a database and explore it using the database explorer Understand how to transform data in a variety of ways Understand how to insert data into database tables Design and run jobs for sequencing tasks and sending emails Combine the execution of jobs and transformations

Product Details

Country selected

Publication date : Aug 30, 2018
Length 178 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789343328
Vendor :
Pentaho
Category :

What do you get with a Packt Subscription?

Free for first 7 days. $15.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details


Publication date : Aug 30, 2018
Length 178 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781789343328
Vendor :
Pentaho
Category :

Table of Contents

15 Chapters
Title Page Chevron down icon Chevron up icon
Copyright and Credits Chevron down icon Chevron up icon
Dedication Chevron down icon Chevron up icon
Packt Upsell Chevron down icon Chevron up icon
Foreword Chevron down icon Chevron up icon
Contributors Chevron down icon Chevron up icon
Preface Chevron down icon Chevron up icon
1. Getting Started with PDI Chevron down icon Chevron up icon
2. Getting Familiar with Spoon Chevron down icon Chevron up icon
3. Extracting Data Chevron down icon Chevron up icon
4. Transforming Data Chevron down icon Chevron up icon
5. Loading Data Chevron down icon Chevron up icon
6. Orchestrating Your Work Chevron down icon Chevron up icon
1. Other Books You May Enjoy Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%
Top Reviews
No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.