Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases now! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Learning Pentaho Data Integration 8 CE

You're reading from   Learning Pentaho Data Integration 8 CE An end-to-end guide to exploring, transforming, and integrating your data across multiple sources

Arrow left icon
Product type Paperback
Published in Dec 2017
Publisher Packt
ISBN-13 9781788292436
Length 500 pages
Edition 3rd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
María Carina Roldán María Carina Roldán
Author Profile Icon María Carina Roldán
María Carina Roldán
Arrow right icon
View More author details
Toc

Table of Contents (17) Chapters Close

Preface 1. Getting Started with Pentaho Data Integration 2. Getting Started with Transformations FREE CHAPTER 3. Creating Basic Task Flows 4. Reading and Writing Files 5. Manipulating PDI Data and Metadata 6. Controlling the Flow of Data 7. Cleansing, Validating, and Fixing Data 8. Manipulating Data by Coding 9. Transforming the Dataset 10. Performing Basic Operations with Databases 11. Loading Data Marts with PDI 12. Creating Portable and Reusable Transformations 13. Implementing Metadata Injection 14. Creating Advanced Jobs 15. Launching Transformations and Jobs from the Command Line 16. Best Practices for Designing and Deploying a PDI Project

Parsing unstructured files with JavaScript

It's ideal to have input files where the information is well-formed, that is, the number of columns and the type of its data is precise, all rows follow the same pattern, and so on. However, it is very common to find input files where the information has little or no structure or the structure doesn't follow the matrix (n rows by m columns) you expect. This is one of the situations where JavaScript can help.

Suppose that you have a file with a description of houses, which looks like the following:

... 
Property Code: MCX-011
Status: Active
5 bedrooms
5 baths
Style: Contemporary
Basement
Laundry room
Fireplace
2 car garage
Central air conditioning
More Features: Attic, Clothes dryer, Clothes washer, Dishwasher

Property Code: MCX-012
4 bedrooms
3 baths
Fireplace
Attached parking
More Features: Alarm System, Eat-in Kitchen, Powder Room

Property Code: MCX-013
3 bedrooms
...

You want to compare the properties among them but it would be easier if the file had a precise structure. The JavaScript step can help you with this.

The first attempt to give structure to the data will be to add to every row the code of the house to which that row belongs. The purpose is to have the following:

Previewing some data
  1. Create a new Transformation.
  2. Get the sample file from the book site and read it with a Text file input step. Uncheck the Header checkbox and create a single field named text.
  3. Run a preview. You should see the content of the file under a single column named text.
  4. After the input step, add a JavaScript step and double-click on it to edit it.
  5. In the editing area, type the following JavaScript code to create a field with the code of the property:
var prop_code; 
posCod = indexOf(text,'Property Code:');
if (posCod>=0)
prop_code = trim(substr(text,posCod+15));
The indexOf function identifies the column where the property code is in the text. The substr function cuts the Property Code:, text, keeping only the code itself.
  1. Click on Get variables to add the prop_code variable to the grid under the code. The variable will contain for every row, the code for the house to which it belongs.
  2. Click on OK and with the JavaScript step selected, run a preview. You should see the data transformed as expected.

The code you wrote may seem a little strange at the beginning, but it is not really so complex. The general idea is to simulate a loop over the dataset rows.

The code creates a variable named prod_code, which will be used to create a new field to identify the houses. When the JavaScript code detects a property header row as for example:

Property Code: MCX-002 

It sets the prop_code variable to the code it finds in that line, in this case, MCX-002.

Here comes the trick: until a new header row appears, the prop_code variable keeps that value. Thus, all the rows following a row like the one shown previously will have the same value for the prop_code variable.

This is an example where you can keep values from the previous rows in the dataset to be used in the current row.

Note that here you use JavaScript to see and use values from previous rows, but you can't modify them! JavaScript always works on the current row.
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime