Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Pentaho Data Integration Beginner's Guide - Second Edition
Pentaho Data Integration Beginner's Guide - Second Edition

Pentaho Data Integration Beginner's Guide - Second Edition: Get up and running with the Pentaho Data Integration tool using this hands-on, easy-to-read guide with this book and ebook , Second Edition

eBook
$22.99 $32.99
Paperback
$54.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Table of content icon View table of contents Preview book icon Preview Book

Pentaho Data Integration Beginner's Guide - Second Edition

Chapter 1. Getting Started with Pentaho Data Integration

Pentaho Data Integration or PDI is an engine along with a suite of tools responsible for the processes of Extracting, Transforming, and Loading; also known as ETL processes. This book is meant to teach you how to use PDI.

In this chapter, you will:

  • Learn what Pentaho Data Integration is
  • Install the software and start working with the PDI graphical designer
  • Install MySQL, a database engine that you will use when you start working with databases

Pentaho Data Integration and Pentaho BI Suite

Before introducing PDI, let’s talk about Pentaho BI Suite. The Pentaho Business Intelligence Suite is a collection of software applications intended to create and deliver solutions for decision making. The main functional areas covered by the suite are:

  • Analysis: The analysis engine serves multidimensional analysis. It’s provided by the Mondrian OLAP server.
  • Reporting: The reporting engine allows designing, creating, and distributing reports in various known formats (HTML, PDF, and so on), from different kinds of sources.
  • Data Mining: Data mining is used for running data through algorithms in order to understand the business and do predictive analysis. Data mining is possible thanks to the Weka Project.
  • Dashboards: Dashboards are used to monitor and analyze Key Performance Indicators (KPIs). The Community Dashboard Framework (CDF), a plugin developed by the community and integrated in the Pentaho BI Suite, allows the creation of interesting dashboards including charts, reports, analysis views, and other Pentaho content, without much effort.
  • Data Integration: Data integration is used to integrate scattered information from different sources (applications, databases, files, and so on), and make the integrated information available to the final user. Pentaho Data Integration—the tool that we will learn to use throughout the book—is the engine that provides this functionality.

All of this functionality can be used standalone but also integrated. In order to run analysis, reports, and so on, integrated as a suite, you have to use the Pentaho BI Platform. The platform has a solution engine, and offers critical services, for example, authentication, scheduling, security, and web services.

This set of software and services form a complete BI Platform, which makes Pentaho Suite the world’s leading open source Business Intelligence Suite.

Exploring the Pentaho Demo

Despite being out of the scope of this book, it’s worth to briefly introduce the Pentaho Demo. The Pentaho BI Platform Demo is a pre-configured installation that allows you to explore several capabilities of the Pentaho platform. It includes sample reports, cubes, and dashboards for Steel Wheels. Steel Wheels is a fictional store that sells all kind of scale replicas of vehicles. The following screenshot is a sample dashboard available in the demo:

Exploring the Pentaho Demo

The Pentaho BI Platform Demo is free and can be downloaded from http://sourceforge.net/projects/pentaho/files/. Under the Business Intelligence Server folder, look for the latest stable version. By the time you read the book, Pentaho 5.0 may already have arrived. At the time of writing this book, the latest stable version is 4.8.0, so the file you have to download is biserver-ce-4.8.0-stable.zip for Windows and biserver-ce-4.8.0-stable.tar.gz for other systems.

Note

You can find out more about Pentaho BI Suite Community Edition at http://community.pentaho.com/projects/bi_platform. There is also an Enterprise Edition of the platform with additional features and support. You can find more on this at www.pentaho.org.

Pentaho Data Integration

Most of the Pentaho engines, including the engines mentioned earlier, were created as community projects and later adopted by Pentaho. The PDI engine is not an exception—Pentaho Data Integration is the new denomination for the business intelligence tool born as Kettle.

Note

The name Kettle didn’t come from the recursive acronym Kettle Extraction, Transportation, Transformation, and Loading Environment it has now. It came from KDE Extraction, Transportation, Transformation, and Loading Environment, since the tool was planned to be written on top of KDE, a Linux desktop environment, as mentioned in the introduction of the book.

In April 2006, the Kettle project was acquired by the Pentaho Corporation and Matt Casters, the Kettle founder, also joined the Pentaho team as a Data Integration Architect.

When Pentaho announced the acquisition, James Dixon, Chief Technology Officer said:

We reviewed many alternatives for open source data integration, and Kettle clearly had the best architecture, richest functionality, and most mature user interface. The open architecture and superior technology of the Pentaho BI Platform and Kettle allowed us to deliver integration in only a few days, and make that integration available to the community.

By joining forces with Pentaho, Kettle benefited from a huge developer community, as well as from a company that would support the future of the project.

From that moment, the tool has grown with no pause. Every few months a new release is available, bringing to the users improvements in performance, existing functionality, new functionality, ease of use, and great changes in look and feel. The following is a timeline of the major events related to PDI since its acquisition by Pentaho:

  • June 2006: PDI 2.3 is released. Numerous developers had joined the project and there were bug fixes provided by people in various regions of the world. The version included among other changes, enhancements for large-scale environments and multilingual capabilities.
  • February 2007: Almost seven months after the last major revision, PDI 2.4 is released including remote execution and clustering support, enhanced database support, and a single designer for jobs and transformations, the two main kind of elements you design in Kettle.
  • May 2007: PDI 2.5 is released including many new features; the most relevant being the advanced error handling.
  • November 2007: PDI 3.0 emerges totally redesigned. Its major library changed to gain massive performance. The look and feel had also changed completely.
  • October 2008: PDI 3.1 arrives, bringing a tool which was easier to use, and with a lot of new functionality as well.
  • April 2009: PDI 3.2 is released with a really large amount of changes for a minor version: new functionality, visualization and performance improvements, and a huge amount of bug fixes. The main change in this version was the incorporation of dynamic clustering.
  • June 2010: PDI 4.0 was released, delivering mostly improvements with regard to enterprise features, for example, version control. In the community version, the focus was on several visual improvements such as the mouseover assistance that you will experiment with soon.
  • November 2010: PDI 4.1 is released with many bug fixes.
  • August 2011: PDI 4.2 comes to light not only with a large amount of bug fixes, but also with a lot of improvements and new features. In particular, several of them were related to the work with repositories (see Appendix A, Working with Repositories for details).
  • April 2012: PDI 4.3 is released also with a lot of fixes, and a bunch of improvements and new features.
  • November 2012: PDI 4.4 is released. This version incorporates a lot of enhancements and new features. In this version there is a special emphasis on Big Data—the ability of reading, searching, and in general transforming large and complex collections of datasets.
  • 2013: PDI 5.0 will be released, delivering interesting low-level features such as step load balancing, job transactions, and restartability.

Using PDI in real-world scenarios

Paying attention to its name, Pentaho Data Integration, you could think of PDI as a tool to integrate data.

In fact, PDI not only serves as a data integrator or an ETL tool. PDI is such a powerful tool, that it is common to see it used for these and for many other purposes. Here you have some examples.

Loading data warehouses or datamarts

The loading of a data warehouse or a datamart involves many steps, and there are many variants depending on business area, or business rules.

But in every case, no exception, the process involves the following steps:

  • Extracting information from one or different databases, text files, XML files and other sources. The extract process may include the task of validating and discarding data that doesn’t match expected patterns or rules.
  • Transforming the obtained data to meet the business and technical needs required on the target. Transformation implies tasks as converting data types, doing some calculations, filtering irrelevant data, and summarizing.
  • Loading the transformed data into the target database. Depending on the requirements, the loading may overwrite the existing information, or may add new information each time it is executed.

Kettle comes ready to do every stage of this loading process. The following screenshot shows a simple ETL designed with Kettle:

Loading data warehouses or datamarts

Integrating data

Imagine two similar companies that need to merge their databases in order to have a unified view of the data, or a single company that has to combine information from a main ERP (Enterprise Resource Planning) application and a CRM (Customer Relationship Management) application, though they’re not connected. These are just two of hundreds of examples where data integration is needed. The integration is not just a matter of gathering and mixing data. Some conversions, validation, and transport of data have to be done. Kettle is meant to do all of those tasks.

Data cleansing

It’s important and even critical that data be correct and accurate for the efficiency of business, to generate trust conclusions in data mining or statistical studies, to succeed when integrating data. Data cleansing is about ensuring that the data is correct and precise. This can be achieved by verifying if the data meets certain rules, discarding or correcting those which don’t follow the expected pattern, setting default values for missing data, eliminating information that is duplicated, normalizing data to conform minimum and maximum values, and so on. These are tasks that Kettle makes possible thanks to its vast set of transformation and validation capabilities.

Migrating information

Think of a company, any size, which uses a commercial ERP application. One day the owners realize that the licenses are consuming an important share of its budget. So they decide to migrate to an open source ERP. The company will no longer have to pay licenses, but if they want to change, they will have to migrate the information. Obviously, it is not an option to start from scratch, nor type the information by hand. Kettle makes the migration possible thanks to its ability to interact with most kind of sources and destinations such as plain files, commercial and free databases, and spreadsheets, among others.

Exporting data

Data may need to be exported for numerous reasons:

  • To create detailed business reports
  • To allow communication between different departments within the same company
  • To deliver data from your legacy systems to obey government regulations, and so on

Kettle has the power to take raw data from the source and generate these kind of ad-hoc reports.

Integrating PDI along with other Pentaho tools

The previous examples show typical uses of PDI as a standalone application. However, Kettle may be used embedded as part of a process or a dataflow. Some examples are pre-processing data for an online report, sending mails in a scheduled fashion, generating spreadsheet reports, feeding a dashboard with data coming from web services, and so on.

Note

The use of PDI integrated with other tools is beyond the scope of this book. If you are interested, you can find more information on this subject in the Pentaho Data Integration 4 Cookbook by Packt Publishing at http://www.packtpub.com/pentaho-data-integration-4-cookbook/book.

Pop quiz – PDI data sources

Q1. Which of the following are not valid sources in Kettle?

  1. Spreadsheets.
  2. Free database engines.
  3. Commercial database engines.
  4. Flat files.
  5. None.

Installing PDI

In order to work with PDI, you need to install the software. It’s a simple task, so let’s do it now.

Time for action – installing PDI

These are the instructions to install PDI, for whatever operating system you may be using.

The only prerequisite to install the tool is to have JRE 6.0 installed. If you don’t have it, please download it from www.javasoft.com and install it before proceeding. Once you have checked the prerequisite, follow these steps:

  1. Go to the download page at http://sourceforge.net/projects/pentaho/files/Data Integration.
  2. Choose the newest stable release. At this time, it is 4.4.0, as shown in the following screenshot:
    Time for action – installing PDI
  3. Download the file that matches your platform. The preceding screenshot should help you.
  4. Unzip the downloaded file in a folder of your choice, that is, c:/util/kettle or /home/pdi_user/kettle.
  5. If your system is Windows, you are done. Under Unix-like environments, you have to make the scripts executable. Assuming that you chose /home/pdi_user/kettle as the installation folder, execute:
    cd /home/pdi_user/kettle
    chmod +x *.sh
  6. In Mac OS you have to give execute permissions to the JavaApplicationStub file. Look for this file; it is located in Data Integration 32-bit.app\Contents\MacOS\, or Data Integration 64-bit.app\Contents\MacOS\ depending on your system.

What just happened?

You have installed the tool in just a few minutes. Now, you have all you need to start working.

Pop quiz – PDI prerequisites

Q1. Which of the following are mandatory to run PDI? You may choose more than one option.

  1. Windows operating system.
  2. Pentaho BI platform.
  3. JRE 6.
  4. A database engine.

Launching the PDI graphical designer – Spoon

Now that you’ve installed PDI, you must be eager to do some stuff with data. That will be possible only inside a graphical environment. PDI has a desktop designer tool named Spoon. Let’s launch Spoon and see what it looks like.

Time for action – starting and customizing Spoon

In this section, you are going to launch the PDI graphical designer, and get familiarized with its main features.

  1. Start Spoon.
    • If your system is Windows, run Spoon.bat

      Tip

      You can just double-click on the Spoon.bat icon, or Spoon if your Windows system doesn’t show extensions for known file types. Alternatively, open a command window—by selecting Run in the Windows start menu, and executing cmd, and run Spoon.bat in the terminal.

    • In other platforms such as Unix, Linux, and so on, open a terminal window and type spoon.sh
    • If you didn’t make spoon.sh executable, you may type sh spoon.sh
    • Alternatively, if you work on Mac OS, you can execute the JavaApplicationStub file, or click on the Data Integration 32-bit.app, or Data Integration 64-bit.app icon
  2. As soon as Spoon starts, a dialog window appears asking for the repository connection data. Click on the Cancel button.

    Note

    Repositories are explained in Appendix A, Working with Repositories. If you want to know what a repository connection is about, you will find the information in that appendix.

  3. A small window labeled Spoon tips... appears. You may want to navigate through various tips before starting. Eventually, close the window and proceed.
  4. Finally, the main window shows up. A Welcome! window appears with some useful links for you to see. Close the window. You can open it later from the main menu.
  5. Click on Options... from the menu Tools. A window appears where you can change various general and visual characteristics. Uncheck the highlighted checkboxes, as shown in the following screenshot:
    Time for action – starting and customizing Spoon
  6. Select the tab window Look & Feel.
  7. Change the Grid size and Preferred Language settings as shown in the following screenshot:
    Time for action – starting and customizing Spoon
  8. Click on the OK button.
  9. Restart Spoon in order to apply the changes. You should not see the repository dialog, or the Welcome! window. You should see the following screenshot full of French words instead:
Time for action – starting and customizing Spoon

What just happened?

You ran for the first time Spoon, the graphical designer of PDI. Then you applied some custom configuration.

In the Option… tab, you chose not to show the repository dialog or the Welcome! window at startup. From the Look & Feel configuration window, you changed the size of the dotted grid that appears in the canvas area while you are working. You also changed the preferred language. These changes were applied as you restarted the tool, not before.

The second time you launched the tool, the repository dialog didn’t show up. When the main window appeared, all of the visible texts were shown in French which was the selected language, and instead of the Welcome! window, there was a blank screen.

You didn’t see the effect of the change in the Grid option. You will see it only after creating or opening a transformation or job, which will occur very soon!

Spoon

Spoon, the tool you’re exploring in this section, is the PDI’s desktop design tool. With Spoon, you design, preview, and test all your work, that is, Transformations and Jobs. When you see PDI screenshots, what you are really seeing are Spoon screenshots. The other PDI components which you will learn in the following chapters, are executed from terminal windows.

Setting preferences in the Options window

In the earlier section, you changed some preferences in the Options window. There are several look and feel characteristics you can modify beyond those you changed. Feel free to experiment with these settings.

Note

Remember to restart Spoon in order to see the changes applied.

In particular, please take note of the following suggestion about the configuration of the preferred language.

Tip

If you choose a preferred language other than English, you should select a different language as an alternative. If you do so, every name or description not translated to your preferred language, will be shown in the alternative language.

One of the settings that you changed was the appearance of the Welcome! window at startup. The Welcome! window has many useful links, which are all related with the tool: wiki pages, news, forum access, and more. It’s worth exploring them.

Tip

You don’t have to change the settings again to see the Welcome! window. You can open it by navigating to Help | Welcome Screen.

Storing transformations and jobs in a repository

The first time you launched Spoon, you chose not to work with repositories. After that, you configured Spoon to stop asking you for the Repository option. You must be curious about what the repository is and why we decided not to use it. Let’s explain it.

As we said, the results of working with PDI are transformations and jobs. In order to save the transformations and jobs, PDI offers two main methods:

  • Database repository: When you use the database repository method, you save jobs and transformations in a relational database specially designed for this purpose.
  • Files: The files method consists of saving jobs and transformations as regular XML files in the filesystem, with extension KJB and KTR respectively.

It’s not allowed to mix the two methods in the same project. That is, it makes no sense to mix jobs and transformations in a database repository with jobs and transformations stored in files. Therefore, you must choose the method when you start the tool.

Note

By clicking on Cancel in the repository window, you are implicitly saying that you will work with the files method.

Why did we choose not to work with repositories? Or, in other words, to work with the files method? Mainly for two reasons:

  • Working with files is more natural and practical for most users.
  • Working with a database repository requires minimal database knowledge, and that you have access to a database engine from your computer. Although it would be an advantage for you to have both preconditions, maybe you haven’t got both of them.

There is a third method called File repository, that is a mix of the two above—it’s a repository of jobs and transformations stored in the filesystem. Between the File repository and the files method, the latest is the most broadly used. Therefore, throughout this book we will use the files method. For details of working with repositories, please refer to Appendix A, Working with Repositories.

Creating your first transformation

Until now, you’ve seen the very basic elements of Spoon. You must be waiting to do some interesting task beyond looking around. It’s time to create your first transformation.

Time for action – creating a hello world transformation

How about starting by saying hello to the world? It's not really new, but good enough for our first practical example; here are the steps to follow:

  1. Create a folder named pdi_labs under a folder of your choice.
  2. Open Spoon.
  3. From the main menu, navigate to File | New | Transformation.
  4. On the left of the screen, under the Design tab, you’ll see a tree of Steps. Expand the Input branch by double-clicking on it.

    Note

    Note that if you work in Mac OS a single click is enough.

  5. Then, left-click on the Generate Rows icon and without releasing the button, drag-and-drop the selected icon to the main canvas. The screen will look like the following screenshot:
    Time for action – creating a hello world transformation

    Note

    Note that we changed the preferred language back to English.

  6. Double-click on the Generate Rows step you just put in the canvas, and fill the textboxes, including Step name and Limit and grid as follows:
    Time for action – creating a hello world transformation
  7. From the Steps tree, double-click on the Flow branch.
  8. Click on the Dummy (do nothing) icon and drag-and-drop it to the main canvas.
  9. Put the mouse cursor over the Generate Rows step and wait until a tiny toolbar shows up below the entry icon, as shown in the following screenshot:
    Time for action – creating a hello world transformation
  10. Click on the output connector (the last icon in the toolbar), and drag towards the Dummy (do nothing) step. A grayed hop is displayed.
  11. When the mouse cursor is over the Dummy (do nothing) step, release the button. A link—a hop from now on—is created from the Generate Rows step to the Dummy (do nothing) step. The screen should look like the following screenshot:
    Time for action – creating a hello world transformation
  12. Right-click anywhere on the canvas to bring a contextual menu.
  13. In the menu, select the New note option. A note editor appears.
  14. Type some description such as Hello, World! Select the Font style tab and choose some nice font and colors for your note, and then click on OK.
  15. From the main menu, navigate to Edit | Settings.... A window appears to specify transformation properties. Fill the Transformation name textbox with a simple name, such as hello world. Fill the Description textbox with a short description such as My first transformation. Finally, provide a more clear explanation in the Extended description textbox, and then click on OK.
  16. From the main menu, navigate to File | Save.
  17. Save the transformation in the folder pdi_labs with the name hello_world.
  18. Select the Dummy (do nothing) step by left-clicking on it.
  19. Click on the Preview icon in the bar menu above the main canvas. The screen should look like the following screenshot:
    Time for action – creating a hello world transformation
  20. The Transformation debug dialog window appears. Click on the Quick Launch button.
  21. A window appears to preview the data generated by the transformation as shown in the following screenshot:
    Time for action – creating a hello world transformation
  22. Close the preview window and click on the Run icon. The screen should look like the following screenshot:
    Time for action – creating a hello world transformation
  23. A window named Execute a transformation appears. Click on Launch.
  24. The execution results are shown at the bottom of the screen. The Logging tab should look as follows:
    Time for action – creating a hello world transformation

What just happened?

You have just created your first transformation.

First, you created a new transformation, dragged-and-dropped into the work area two steps: Generate Rows and Dummy (do nothing), and connected them.

With the Generate Rows step you created 10 rows of data with the message Hello World! The Dummy (do nothing) step simply served as a destination of those rows.

After creating the transformation, you did a preview. The preview allowed you to see the content of the created data, this is, the 10 rows with the message Hello World!

Finally, you run the transformation. Then you could see at the bottom of the screen the Execution Results window, where a Logging tab shows the complete detail of what happened. There are other tabs in this window which you will learn later in the book.

Directing Kettle engine with transformations

A transformation is an entity made of steps linked by hops. These steps and hops build paths through which data flows—the data enters or is created in a step, the step applies some kind of transformation to it, and finally the data leaves that step. Therefore, it’s said that a transformation is data flow oriented.

Directing Kettle engine with transformations

A transformation itself is neither a program nor an executable file. It is just plain XML. The transformation contains metadata which tells the Kettle engine what to do.

A step is the minimal unit inside a transformation. A big set of steps is available. These steps are grouped in categories such as the Input and Flow categories that you saw in the example.

Each step is conceived to accomplish a specific function, going from reading a parameter to normalizing a dataset.

Each step has a configuration window. These windows vary according to the functionality of the steps and the category to which they belong. What all steps have in common are the name and description:

Step property

Description

Name

A representative name inside the transformation.

Description

A brief explanation that allows you to clarify the purpose of the step. It’s not mandatory but it is useful.

A hop is a graphical representation of data flowing between two steps: an origin and a destination. The data that flows through that hop constitute the output data of the origin step and the input data of the destination step.

Exploring the Spoon interface

As you just saw, Spoon is the tool with which you create, preview, and run transformations. The following screenshot shows you the basic work areas: Main menu, Design view, Transformation toolbar, and Canvas (work area):

Exploring the Spoon interface

Note

The words canvas and work area will be used interchangeably throughout the book.

There is also an area named View that shows the structure of the transformation currently being edited. You can see that area by clicking on the View tab at the upper-left corner of the screen:

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com . If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Exploring the Spoon interface

Designing a transformation

In the earlier section, you designed a very simple transformation, with just two steps and one explanatory note. You learned to link steps by using the mouseover assistance toolbar. There are alternative ways to do the same thing. You can use the one that you feel more comfortable with. Appendix D, Spoon Shortcuts explains all of the different options to you. It also explains a lot of shortcuts to zoom in and out, align the steps, among others. These shortcuts are very useful as your transformations become more complex.

Note

Appendix F, Best Practices, explains the benefit of using shortcuts as well as other best practices that are invaluable when you work with Spoon, especially when you have to design and develop big ETL projects.

Running and previewing the transformation

The Preview functionality allows you to see a sample of the data produced for selected steps. In the previous example, you previewed the output of the Dummy (do nothing) step.

The Run icon effectively runs the whole transformation.

Whether you preview or run a transformation, you’ll get an Execution Results window showing what happened. You will learn more about this in the next chapter.

Pop quiz – PDI basics

Q1. There are several graphical tools in PDI, but Spoon is the most used.

  1. True.
  2. False.

Q2. You can choose to save transformations either in files or in a database.

  1. True.
  2. False.

Q3. To run a transformation, an executable file has to be generated from Spoon.

  1. True.
  2. False.

Q4. The grid size option in the Look & Feel window allows you to resize the work area.

  1. True.
  2. False.

Q5. To create a transformation you have to provide external data (that is, text file, spreadsheet, database, and so on).

  1. True.
  2. False.

Installing MySQL

Before skipping to the next chapter, let’s devote some time to the installation of MySQL.

In Chapter 8, Working with Databases, you will begin working with databases from PDI. In order to do that, you will need access to a database engine. As MySQL is the world’s most popular open source database, it was the database engine chosen for the database-related tutorials in this book.

In this section, you will learn how to install the MySQL database engine both on Windows and on Ubuntu, the most popular distribution of Linux these days. As the procedures for installing the software are different, a separate explanation is given for each system.

Note

Mac users may refer to the Ubuntu section, as the installation procedure is similar for both systems.

Time for action – installing MySQL on Windows

In order to install MySQL on your Windows system, please follow these instructions:

  1. Open an Internet browser and type http://dev.mysql.com/downloads/installer.
  2. You will be directed to a page with the downloadable installer. Click on Download and the download process begins.
  3. Double-click on the downloaded file, whose name should be mysql-installer-community-5.5.29.0.msi or similar, depending on the current version that you are running in this section.
  4. In the window that shows up, select Install MySQL Products. A wizard will guide you through the process.
  5. When asked to choose a setup type, select Server only.
  6. Several screens follow. In all cases, leave the proposed default values. If you are prompted for the installation of missing components (for example, Microsoft .NET Framework 4 Client Profile), accept it, or you will not be able to continue.
  7. When the installation is complete, you will have to configure the server. You will have to supply a password for the root user.

    Note

    MySQL will not allow remote connections by default, so a simple password such as 123456 or passwd will suffice. Stronger passwords are necessary only if you plan to open up the MySQL server to external connections.

  8. Optionally, you will have the choice of creating additional users. The following screenshot shows this step of the installation. In this case, we are telling the installer to create a user named pdi_user with the role of a DB Designer:
    Time for action – installing MySQL on Windows
  9. When the configuration process is complete, click on Finish.
  10. MySQL server is now installed as a service. To verify that the installation has been successful, navigate to Control Panel | Administrative Tools | Services, and look for MySQL. This is what you should see:
    Time for action – installing MySQL on Windows
  11. At any moment you can start or stop the service using the buttons in the menu bar at the top of the Services window, or the contextual menu that appears when you right-click on the service.

What just happened?

You downloaded and installed MySQL on your Windows system, using the MySQL Installer software. MySQL Installer simplifies the installation and upgrading of MySQL server and all related products. However, using this software is not the only option you have.

Note

For custom installations of MySQL or for troubleshooting you can visit http://dev.mysql.com/doc/refman/5.5/en/windows-installation.html.

Time for action – installing MySQL on Ubuntu

This section shows you the procedure to install MySQL on Ubuntu. Before starting, please note that Ubuntu typically includes MySQL out of the box. So if that’s the case, you’re done. If not, please follow these instructions:

Note

In order to follow the tutorial you need to be connected to the Internet

  1. Open Ubuntu Software Center.
  2. In the search textbox, type mysql. A list of results will be displayed as shown in the following screenshot:
    Time for action – installing MySQL on Ubuntu
  3. Among the results, look for MySQL Server and click on it. In the window that shows up, click on Install. The installation begins.

    Note

    Note that if MySQL is already installed, this button will not be available.

  4. At a particular moment, you will be prompted for a password for the root user—the administrator of the database engine. Enter a password of your choice. You will have to enter it twice.
  5. When the installation ends, the MySQL server should start automatically. To check if the server is running, open a terminal and run this:
    sudo netstat -tap | grep mysql
    
  6. You should see the following line or similar:
    tcp    0   0 localhost:mysql       *:*      LISTEN  -
    
  7. At any moment, you can start the service using this command:
    /etc/rc.d/init.d/mysql start
    
  8. Or stop it using this:
    /etc/rc.d/init.d/mysql stop
    

What just happened?

You installed MySQL server in your Ubuntu system. In particular, the screens that were displayed belong to Version 12 of the operating system.

Note

The previous directions are for a standard installation. For custom installations you can visit this page https://help.ubuntu.com/12.04/serverguide/mysql.html. For instructions related to other operating systems or for troubleshooting information you can check the MySQL documentation at http://dev.mysql.com/doc/refman/5.5/en/windows-installation.html.

Have a go hero – installing a visual software for administering and querying MySQL

Beside the MySQL server, it’s recommended that you install some visual software that will allow you to administer and query MySQL. Now it’s your time to look for a software of your choice and install it.

One option would be installing the official GUI tool: MySQL Workbench. On Windows, you can install it with the MySQL Installer. In Ubuntu, the installation process is similar to that of the MySQL server.

Another option would be to install a generic open source tool, for example, SQuirrel SQL Client, a graphical program that will allow you to work with MySQL as well as with other database engines. For more information about this software, visit this link: http://squirrel-sql.sourceforge.net/.

Summary

In this chapter, you were introduced to Pentaho Data Integration. Specifically, you learned what Pentaho Data Integration is and you installed the tool. You also were introduced to Spoon, the graphical designer tool of PDI, and created your first transformation.

As an additional exercise, you installed a MySQL server. You will need this software when you start working with databases in Chapter 8, Working with Databases.

Now that you have learned the basics, you are ready to begin experimenting with transformations. That is the topic of the next chapter.

Left arrow icon Right arrow icon

Key benefits

  • Manipulate your data by exploring, transforming, validating, and integrating it
  • Learn to migrate data between applications
  • Explore several features of Pentaho Data Integration 5.0
  • Connect to any database engine, explore the databases, and perform all kind of operations on databases

Description

Capturing, manipulating, cleansing, transferring, and loading data effectively are the prime requirements in every IT organization. Achieving these tasks require people devoted to developing extensive software programs, or investing in ETL or data integration tools that can simplify this work. Pentaho Data Integration is a full-featured open source ETL solution that allows you to meet these requirements. Pentaho Data Integration has an intuitive, graphical, drag-and-drop design environment and its ETL capabilities are powerful. However, getting started with Pentaho Data Integration can be difficult or confusing. "Pentaho Data Integration Beginner's Guide - Second Edition" provides the guidance needed to overcome that difficulty, covering all the possible key features of Pentaho Data Integration. "Pentaho Data Integration Beginner's Guide - Second Edition" starts with the installation of Pentaho Data Integration software and then moves on to cover all the key Pentaho Data Integration concepts. Each chapter introduces new features, allowing you to gradually get involved with the tool. First, you will learn to do all kinds of data manipulation and work with plain files. Then, the book gives you a primer on databases and teaches you how to work with databases inside Pentaho Data Integration. Moreover, you will be introduced to data warehouse concepts and you will learn how to load data in a data warehouse. After that, you will learn to implement simple and complex processes. Finally, you will have the opportunity of applying and reinforcing all the learned concepts through the implementation of a simple datamart. With "Pentaho Data Integration Beginner's Guide - Second Edition", you will learn everything you need to know in order to meet your data manipulation requirements.

Who is this book for?

This book is a must-have for software developers, database administrators, IT students, and everyone involved or interested in developing ETL solutions, or, more generally, doing any kind of data manipulation. Those who have never used Pentaho Data Integration will benefit most from the book, but those who have, they will also find it useful. This book is also a good starting point for database administrators, data warehouse designers, architects, or anyone who is responsible for data warehouse projects and needs to load data into them.

What you will learn

  • Install and get started with Pentaho Data Integration
  • Get started with MySQL
  • Learn the ins and outs of Spoon, the graphical designer tool
  • Transform data in several ways such as performing simple and complex calculations, cleaning, counting, de-duplicating, filtering, and ordering
  • Learn to get data from all kind of data sources as plain files, Excel spreadsheets, databases, XML files and more, then preview it, and send it back to the same or different destinations
  • Discover how to read and parse unstructured files
  • Embed Java and JavaScript code in your Pentaho Data Integration transformations to enrich the treatment of data
  • Use Pentaho Data Integration to perform CRUD (create, read, update, and delete) operations on databases
  • Learn the basic concepts of data warehousing
  • Populate a data warehouse with Pentaho Data Integration including loading slowly changing dimensions, junk dimensions, time dimensions and more
  • Implement business processes by scheduling tasks, checking conditions, organizing files and folders, running daily processes, treating errors, and so on in a way that meets your requirements

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Oct 24, 2013
Length: 502 pages
Edition : 2nd
Language : English
ISBN-13 : 9781782165057
Vendor :
Pentaho
Category :
Languages :
Tools :

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want

Product Details

Publication date : Oct 24, 2013
Length: 502 pages
Edition : 2nd
Language : English
ISBN-13 : 9781782165057
Vendor :
Pentaho
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 164.97
Pentaho Data Integration Cookbook - Second Edition
$54.99
Pentaho 5.0 Reporting by Example: Beginner's Guide
$54.99
Pentaho Data Integration Beginner's Guide - Second Edition
$54.99
Total $ 164.97 Stars icon

Table of Contents

20 Chapters
1. Getting Started with Pentaho Data Integration Chevron down icon Chevron up icon
2. Getting Started with Transformations Chevron down icon Chevron up icon
3. Manipulating Real-world Data Chevron down icon Chevron up icon
4. Filtering, Searching, and Performing Other Useful Operations with Data Chevron down icon Chevron up icon
5. Controlling the Flow of Data Chevron down icon Chevron up icon
6. Transforming Your Data by Coding Chevron down icon Chevron up icon
7. Transforming the Rowset Chevron down icon Chevron up icon
8. Working with Databases Chevron down icon Chevron up icon
9. Performing Advanced Operations with Databases Chevron down icon Chevron up icon
10. Creating Basic Task Flows Chevron down icon Chevron up icon
11. Creating Advanced Transformations and Jobs Chevron down icon Chevron up icon
12. Developing and Implementing a Simple Datamart Chevron down icon Chevron up icon
A. Working with Repositories Chevron down icon Chevron up icon
B. Pan and Kitchen – Launching Transformations and Jobs from the Command Line Chevron down icon Chevron up icon
C. Quick Reference – Steps and Job Entries Chevron down icon Chevron up icon
D. Spoon Shortcuts Chevron down icon Chevron up icon
E. Introducing PDI 5 Features Chevron down icon Chevron up icon
F. Best Practices Chevron down icon Chevron up icon
G. Pop Quiz Answers Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.1
(10 Ratings)
5 star 70%
4 star 10%
3 star 0%
2 star 0%
1 star 20%
Filter icon Filter
Top Reviews

Filter reviews by




Elkin A. Cantillo Garcia Sep 08, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excelente Libro
Amazon Verified review Amazon
Profound Reviewer Jun 13, 2017
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This is really a wonderful resource. I understand both the basics of Pentaho and much more beyond that. The book is filled with example after example. All are helpful and useful. I can't imagine a better introduction to Pentaho. Highly recommended!
Amazon Verified review Amazon
Jaime A Solares Silva Jul 09, 2015
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Excelente libro
Amazon Verified review Amazon
Antonio Tostes Jr. Sep 19, 2014
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Extremely helpful!
Amazon Verified review Amazon
Hugh Powers May 19, 2016
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good introduction book if you're looking to get into using Kettle. The author does a good job of presenting the subject using good examples, and doesn't get overly techie with the reader. Good book also for those who know SSIS but are looking for other ETL tools to explore.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.