Dealing with variable length records
In this section, we will explore a way of dealing with length records. Our approach essentially converts each of the rows to a fixed length record equal to the maximum length record. In our example, as each row represents a portfolio and there is no unique identifier, this method is useful for manipulating data into the familiar fixed length records case. We will generate the requisite number of fields to equal the maximum number of stocks in the largest portfolio. This will lead to empty fields where the number of stocks is less than the maximum number of stocks in any portfolio. Another way to deal with variable length records is to use the explode()
function to create new rows for each stock in a given portfolio (for an example of using the explode()
function, refer Chapter 9, Developing Applications with Spark SQL).
To avoid repeating all the steps from previous examples to read in all the files, we have combined the data into a single input file...