Reading an XML file using the HXT package
Extensible Markup Language (XML) is an encoding of plain text to provide machine-readable annotations on a document. The standard is specified by W3C (http://www.w3.org/TR/2008/REC-xml-20081126/).
In this recipe, we will parse an XML document representing an e-mail conversation and extract all the dates.
Getting ready
We will first set up an XML file called input.xml
with the following values, representing an e-mail thread between Databender and Princess on December 18, 2014 as follows:
$ cat input.xml <thread> <email> <to>Databender</to> <from>Princess</from> <date>Thu Dec 18 15:03:23 EST 2014</date> <subject>Joke</subject> <body>Why did you divide sin by tan?</body> </email> <email> <to>Princess</to> <from>Databender</from> <date>Fri Dec 19 3:12:00 EST 2014</date> <subject>RE: Joke</subject> <body>Just cos.</body> </email> </thread>
Using Cabal, install the HXT library which we use for manipulating XML documents:
$ cabal install hxt
How to do it...
- We only need one import, which will be for parsing XML, using the following line of code:
import Text.XML.HXT.Core
- Define and implement
main
and specify the XML location. For this recipe, the file is retrieved frominput.xml
. Refer to the following code:main :: IO () main = do input <- readFile "input.xml"
- Apply the
readString
function to the input and extract all the date documents. We filter items with a specific name using thehasName :: String -> a XmlTree XmlTree
function. Also, we extract the text using thegetText :: a XmlTree String
function, as shown in the following code snippet:dates <- runX $ readString [withValidate no] input //> hasName "date" //> getText
- We can now use the list of extracted dates as follows:
print dates
- By running the code, we print the following output:
$ runhaskell Main.hs ["Thu Dec 18 15:03:23 EST 2014", "Fri Dec 19 3:12:00 EST 2014"]
How it works...
The library function, runX
, takes in an
Arrow. Think of an Arrow as a more powerful version of a Monad. Arrows allow for stateful global XML processing. Specifically, the runX
function in this recipe takes in IOSArrow XmlTree String
and returns an IO
action of the String
type. We generate this IOSArrow
object using the readString
function, which performs a series of operations to the XML data.
For a deep insight into the XML document, //>
should be used whereas />
only looks at the current level. We use the //>
function to look up the date attributes and display all the associated text.
As defined in the documentation, the hasName
function tests whether a node has a specific name, and the getText
function selects the text of a text node. Some other functions include the following:
isText
: This is used to test for text nodesisAttr
: This is used to test for an attribute treehasAttr
: This is used to test whether an element node has an attribute node with a specific namegetElemName
: This is used to select the name of an element node
All the Arrow functions can be found on the Text.XML.HXT.Arrow.XmlArrow
documentation at http://hackage.haskell.org/package/hxt/docs/Text-XML-HXT-Arrow-XmlArrow.html.