Accumulating text data from a file path
One of the easiest ways to get started with processing input is by reading raw text from a local file. In this recipe, we will be extracting all the text from a specific file path. Furthermore, to do something interesting with the data, we will count the number of words per line.
Tip
Haskell is a purely functional programming language, right? Sure, but obtaining input from outside the code introduces impurity. For elegance and reusability, we must carefully separate pure from impure code.
Getting ready
We will first create an input.txt
text file with a couple of lines of text to be read by the program. We keep this file in an easy-to-access directory because it will be referenced later. For example, the text file we're dealing with contains a seven-line quote by Plato. Here's what our terminal prints when we issue the following command:
$ cat input.txt And how will you inquire, Socrates, into that which you know not? What will you put forth as the subject of inquiry? And if you find what you want, how will you ever know that this is what you did not know?
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you. The code will also be hosted on GitHub at https://github.com/BinRoot/Haskell-Data-Analysis-Cookbook.
How to do it...
Create a new file to start coding. We call our file Main.hs.
- As with all executable Haskell programs, start by defining and implementing the
main
function, as follows:main :: IO () main = do
- Use Haskell's
readFile :: FilePath -> IO String
function to extract data from aninput.txt
file path. Note that a file path is just a synonym forString
. With the string in memory, pass it into acountWords
function to count the number of words in each line, as shown in the following steps:input <- readFile "input.txt" print $ countWords input
- Lastly, define our pure function,
countWords
, as follows:countWords :: String -> [Int] countWords input = map (length.words) (lines input)
- The program will print out the number of words per line represented as a list of numbers as follows:
$ runhaskell Main.hs [6,6,10,7,6,7]
How it works...
Haskell provides useful input and output (I/O) capabilities for reading input and writing output in different ways. In our case, we use readFile
to specify a path of a file to be read. Using the do
keyword in main
suggests that we are joining several IO actions together. The output of readFile
is an I/O string, which means it is an I/O action that returns a String
type.
Now we're about to get a bit technical. Pay close attention. Alternatively, smile and nod. In Haskell, the I/O data type is an instance of something called a Monad. This allows us to use the <-
notation to draw the string out of this I/O action. We then make use of the string by feeding it into our countWords
function that counts the number of words in each line. Notice how we separated the countWords
function apart from the impure main
function.
Finally, we print the output of countWords
. The $
notation means we are using a function application to avoid excessive parenthesis in our code. Without it, the last line of main
would look like print (countWords input)
.
See also
For simplicity's sake, this code is easy to read but very fragile. If an input.txt
file does not exist, then running the code will immediately crash the program. For example, the following command will generate the error message:
$ runhaskell Main.hs Main.hs: input.txt: openFile: does not exist…
To make this code fault tolerant, refer to the Catching I/O code faults recipe.