Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Pentaho Data Integration Cookbook - Second Edition

You're reading from   Pentaho Data Integration Cookbook - Second Edition The premier open source ETL tool is at your command with this recipe-packed cookbook. Learn to use data sources in Kettle, avoid pitfalls, and dig out the advanced features of Pentaho Data Integration the easy way.

Arrow left icon
Product type Paperback
Published in Dec 2013
Publisher Packt
ISBN-13 9781783280674
Length 462 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Toc

Table of Contents (21) Chapters Close

Pentaho Data Integration Cookbook Second Edition
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
1. Working with Databases FREE CHAPTER 2. Reading and Writing Files 3. Working with Big Data and Cloud Sources 4. Manipulating XML Structures 5. File Management 6. Looking for Data 7. Understanding and Optimizing Data Flows 8. Executing and Re-using Jobs and Transformations 9. Integrating Kettle and the Pentaho Suite 10. Getting the Most Out of Kettle 11. Utilizing Visualization Tools in Kettle 12. Data Analytics Data Structures References Index

Inserting, deleting, or updating a table depending on a field


PDI allows you to perform basic operations that modify the data in your tables, that is: insert, update, and delete records. For each of those operations you have at least one step that allows you to do the task. It may happen that you have to do one or another operation depending on the value of a field. That is possible with a rather unknown step named Synchronize after merge.

Suppose you have a database with books. You received a file with a list of books. In that list there are books you already have and there are books you don't have.

For the books you already have, you intend to update the prices.

Among the other books, you will insert in your database only those which have been published recently. You will recognize them because they have the text NEW in the Comment field.

Getting ready

For this recipe, you will need the database which can be created and filled by running the script books_2.sql. You also will need the file books_news.txt that accompanies the material for this chapter.

As the recipe will modify the data in the database, before proceeding, explore the database to see what is inside. In particular, run the following statements and pay attention to the results:

SELECT count(*)
FROM   books;
+----------+
| count(*) |
+----------+
|       34 |
+----------+
1 row in set (0.00 sec)

SELECT id_title, title, price
FROM   books
WHERE  author_id='A00001';
+----------+----------------------------------------+-------+
| id_title | title                                  | price |
+----------+----------------------------------------+-------+
| 123-400  | The Girl with the Dragon Tattoo        |    37 |
| 123-401  | The Girl who Played with Fire          |  35.9 |
| 123-402  | The Girl who Kicked the Hornett's Nest |    39 |
+----------+----------------------------------------+-------+
3 rows in set (0.00 sec)

SELECT *
FROM   books
WHERE  title="Mockingjay";

Empty set (0.00 sec)

How to do it...

  1. Create a new transformation and create a connection to the books database.

  2. Drop a Text file input step to the canvas and use the step to read the books_news.txt file. As separator, type |. Read all fields as String except the price that has to be read as a Number with 0.00 as the Format.

  3. Do a preview to verify you have read the file properly. You should see the following:

  4. Use a Split Fields step to split the name field into two: firstname and lastname.

  5. Use a Database lookup step to look up in the authors table for an author that matches the firstname and lastname fields. As the value for Values to return from the lookup table: add id_author.

  6. Check the option Do not pass the row if the lookup fails and close the window.

  7. From the Output category of steps drag-and-drop a Synchronize after merge step to the canvas and create a hop from the last step toward this one. Your transformation looks like the following:

  8. Double-click on the step. For the Connection field, select the books connection. As Target table, type books.

  9. Fill the grids as shown:

    Tip

    Remember that you can avoid typing by clicking on the Get Fields and Get update fields buttons to the right-hand side.

  10. Select the Advanced tab.

  11. As Operation fieldname, select comment. As Insert when value equal, type NEW. As Update when value equal, type In Stock. Leave the other fields blank.

  12. Close the window and save the transformation.

  13. Then run the transformation.

  14. Explore the database again. In particular, run for the second time the same statements you ran before doing the recipe. Now you will get the following:

    SELECT count(*)
    FROM   books;
    +----------+
    | count(*) |
    +----------+
    |       38 |
    +----------+
    1 row in set (0.00 sec)
    
    SELECT id_title, title, price
    FROM   books
    WHERE  author_id='A00001';
    +----------+----------------------------------------+-------+
    | id_title | title                                  | price |
    +----------+----------------------------------------+-------+
    | 123-400  | The Girl with the Dragon Tattoo        | 34.98 |
    | 123-401  | The Girl who Played with Fire          | 35.99 |
    | 123-402  | The Girl who Kicked the Hornett's Nest | 37.99 |
    +----------+----------------------------------------+-------+
    3 rows in set (0.00 sec)
    
    SELECT *
    FROM   books
    WHERE  title="Mockingjay";
    +----------+------------+-----------+-------+-------+
    | id_title | title      | id_author | price | genre |
    +----------+------------+-----------+-------+-------+
    | 523-110  | Mockingjay | A00012    | 37.99 | Teens |
    +----------+------------+-----------+-------+-------+
    1 row in set (0.00 sec)
    

How it works...

The Synchronize after merge step allows you to insert, update, or delete rows in a table based on the value of a field in the stream. In the recipe, you used the Synchronize after merge step both for inserting the new books (for example, Mockingjay) and for updating the prices for the books you already had (for example, The Girl with the Dragon Tattoo).

In order to tell PDI whether to execute an insert or an update, you used the field comment. Under the Advanced tab, you told PDI that it should insert the records where the comment was equal to NEW, and update those where the comment was In Stock.

Note that, because you didn't intend to delete rows, you left the Delete when value equal option blank. However, you could also have configured this option in the same way you configured the others. An example of that, could be deleting the books that will stop being published. If there are books that match the out of market criteria, you could type out of market in the Delete when value equal option and those books would be deleted.

The inserts and updates were made based on the fields you entered in the grids under the General tab, which work exactly as the grids in an Insert/Update or an Update step.

There's more...

Let's see a little more about the step you used in this recipe.

Insert, update, and delete all-in-one

The Synchronize after merge step is like an all-in-one step. It allows you to insert, update, and delete rows from a table all in a single step, based on a field present in the dataset. For each row, Kettle uses the value of that column to decide which of the three basic operations to execute. This happens as follows.

Suppose that the Operation fieldname is called op and the values that should cause an insert, update, or delete are NEW, In Stock, and Discontinued respectively:

Operation

How it works

Insert

The insert is made for all rows where the field op is equal to NEW. The insert is made based on the key fields just like in an Insert/Update step.

Update

The update is made for all rows where the field op is equal to the value In Stock. The update is made based on the key fields just like in an Insert/Update or an Update step.

Delete

The delete is made for all rows where the field op is equal to the value Discontinued. The delete is made based on the key fields just like in a Delete step. For delete operations, the content of the lower grid is ignored.

Synchronizing after merge

You may wonder what the name Synchronize after merge has to do with this, if you neither merged nor synchronized anything. The fact is that the step was named after the Merge Rows (diff) step, as those steps can perfectly be used together. The Merge Rows (diff) step has the ability to find differences between two streams, and those differences are used later to update a table by using a Synchronize after merge step.

See also

  • Deleting data from a table

  • The Comparing two streams and generating differences recipe in Chapter 7, Understanding and Optimizing Data Flows

You have been reading a chapter from
Pentaho Data Integration Cookbook - Second Edition - Second Edition
Published in: Dec 2013
Publisher: Packt
ISBN-13: 9781783280674
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image