An XML-encoded document has a tree-like structure. To read the contents, we start at the root of the tree (normally, the name of the element that follows the XML declaration <?xml version="1.0" encoding="UTF-8"?>
; every XML-encoded document needs to begin with such declaration). In our case, the root of our XML-encoded document is <records>
. A single <record>...</record>
contains a list of <var var_name=...>...</var>
.
Let's analyze the preceding code step by step. First, we import all the modules that we need. The xml.etree.ElementTree
module is a lightweight XML parser of the XML tree and we will use it to parse the XML structure of our file. As before, we define names of the files to read and write in separate variables (r_filenameXML
, w_filenameXML
).
To read the data from the XML-encoded file, we use the
read_xml(...)
method:
The method takes the name of the file as its only parameter. First, the file is opened. Using the .parse(...)
method, we create a tree-like structure from our XML-encoded file and store it in the tree
object. We then extract the root using the .getroot()
method on the tree
object: this is the starting point to process the data further. The return statement calls the iter_records
method passing the reference to the root of the tree and then converts the returned information to a DataFrame:
The iter_records
method is a generator: a method that, as the name suggests, generates the values. Unlike regular methods that have to return all the values when the function finishes (a return
statement), generators hand over the data back to the calling method one at a time (hence the yield
keyword) until done.
Our iter_records
method, for each record read, emits a temp_dict
dictionary object back to the read_xml
method. Each element of the dictionary has a key equal to the var_name
attribute of the <var>
XML element. (Our <var>
has the following format: <var var_name=...>
.)
Tip
The <var>
tag could have more attributes with other names—these would be stored in the .attrib
dictionary (a property of the XML tree node) and would be accessible by their names—see the highlighted line in the previous source code.
The value of <var>
(contained within <var>...</var>
) is accessible through the .text
property of the XML node, while the .tag
property stores its name (in our case, var
).
The return
statement of the read_xml
method creates a list from all the dictionaries passed, which is then turned into a DataFrame.
To write the data in an XML format, we use the write_xml(...)
method:
The method opens the file specified by the xmlFileName
parameter. Every XML file needs to start with the XML declaration (see the introduction to this recipe) in the first line. Then, we write out the root of our XML schema, <records>
.
Next, it is time to write out the data. We use the .apply(...)
method of the DataFrame object to iterate through the records contained within. Its first parameter specifies the method to be applied to each record. By default, the axis
parameter is set to 0
. This means that the method specified in the first parameter would be applied to each column
of the DataFrame. By setting the parameter to 1
, we instruct the .apply(...)
method that we want to apply the xml_encode(...)
method specified in the first parameter to each row
. We use the
xml_encode(...)
method to process each record from the data
DataFrame:
The code creates a list of strings, xmlItem
. The first element of the list is the <record>
indicator and the last one will be </record>
. Then, for each field in the row, we append values of each column for that record encapsulated within <var var_name=
<column_name>>
<value></var>
. The variables in bold indicate specific column names from the record (<column_name>
) and corresponding value (<value>
). Once all the fields of the record have been parsed, we create a long string by concatenating all the items of the xmlItem
list using the '\n'.join(...)
method. Each <var>...</var>
tag is then separated by \n
. The string is returned to the caller (write_xml
). Each record is further concatenated in the write_xml(...)
method and then output to the file. We finish with the closing tag, </records>
.