Data comes in many forms. There are two main file categories for storing/retrieving data: human-readable files and binary files. Examples of human-readable files are XML, RSS/Atom, YAML, and JSON. Examples of binary files are the .sq3 file format used by SQLite and the .mp3 audio file format used to listen to music.
In this example, we will focus on two popular human-readable formats—XML and JSON. Although human-readable files are generally slower to parse than binary files, they make data exchange, inspection, and modification much easier. For this reason, it is advised that you work with human-readable files, unless there are other restrictions that do not allow it (mainly unacceptable performance and proprietary binary formats).
In this case, we have some input data stored in an XML and a JSON file, and we want to parse them and retrieve some information. At the same time, we want to centralize the client's connection to those (and all future) external services. We will use the factory method to solve this problem. The example focuses only on XML and JSON, but adding support for more services should be straightforward.
First, let's take a look at the data files.
The JSON file, movies.json, is an example (found on GitHub) of a dataset containing information about American movies (title, year, director name, genre, and so on). This is actually a big file but here is an extract, simplified for better readability, to show how its content is organized:
[
{"title":"After Dark in Central Park",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Boarding School Girls' Pajama Parade",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Buffalo Bill's Wild West Parad",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Caught",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Clowns Spinning Hats",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Capture of Boer Battery by British",
"year":1900,
"director":"James H. White", "cast":null, "genre":"Short documentary"},
{"title":"The Enchanted Drawing",
"year":1900,
"director":"J. Stuart Blackton", "cast":null,"genre":null},
{"title":"Family Troubles",
"year":1900,
"director":null, "cast":null, "genre":null},
{"title":"Feeding Sea Lions",
"year":1900,
"director":null, "cast":"Paul Boyton", "genre":null}
]
The XML file, person.xml, is based on a Wikipedia example (j.mp/wikijson), and contains information about individuals (firstName, lastName, gender, and so on) as follows:
- We start with the enclosing tag of the persons XML container:
<persons>
- Then, an XML element representing a person's data code is presented as follows:
<person>
<firstName>John</firstName>
<lastName>Smith</lastName>
<age>25</age>
<address>
<streetAddress>21 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber type="home">212 555-1234</phoneNumber>
<phoneNumber type="fax">646 555-4567</phoneNumber>
</phoneNumbers>
<gender>
<type>male</type>
</gender>
</person>
- An XML element representing another person's data is shown by the following code:
<person>
<firstName>Jimy</firstName>
<lastName>Liar</lastName>
<age>19</age>
<address>
<streetAddress>18 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber type="home">212 555-1234</phoneNumber>
</phoneNumbers>
<gender>
<type>male</type>
</gender>
</person>
- An XML element representing a third person's data is shown by the code:
<person>
<firstName>Patty</firstName>
<lastName>Liar</lastName>
<age>20</age>
<address>
<streetAddress>18 2nd Street</streetAddress>
<city>New York</city>
<state>NY</state>
<postalCode>10021</postalCode>
</address>
<phoneNumbers>
<phoneNumber type="home">212 555-1234</phoneNumber>
<phoneNumber type="mobile">001 452-8819</phoneNumber>
</phoneNumbers>
<gender>
<type>female</type>
</gender>
</person>
- Finally, we close the XML container:
</persons>
We will use two libraries that are part of the Python distribution for working with JSON and XML, json and xml.etree.ElementTree, as follows:
import json
import xml.etree.ElementTree as etree
The JSONDataExtractor class parses the JSON file and has a parsed_data() method that returns all data as a dictionary (dict). The property decorator is used to make parsed_data() appear as a normal attribute instead of a method, as follows:
class JSONDataExtractor:
def __init__(self, filepath):
self.data = dict()
with open(filepath, mode='r', encoding='utf-8') as
f:self.data = json.load(f)
@property
def parsed_data(self):
return self.data
The XMLDataExtractor class parses the XML file and has a parsed_data() method that returns all data as a list of xml.etree.Element as follows:
class XMLDataExtractor:
def __init__(self, filepath):
self.tree = etree.parse(filepath)
@property
def parsed_data(self):
return self.tree
The dataextraction_factory() function is a factory method. It returns an instance of JSONDataExtractor or XMLDataExtractor depending on the extension of the input file path as follows:
def dataextraction_factory(filepath):
if filepath.endswith('json'):
extractor = JSONDataExtractor
elif filepath.endswith('xml'):
extractor = XMLDataExtractor
else:
raise ValueError('Cannot extract data from {}'.format(filepath))
return extractor(filepath)
The extract_data_from() function is a wrapper of dataextraction_factory(). It adds exception handling as follows:
def extract_data_from(filepath):
factory_obj = None
try:
factory_obj = dataextraction_factory(filepath)
except ValueError as e:
print(e)
return factory_obj
The main() function demonstrates how the factory method design pattern can be used. The first part makes sure that exception handling is effective, as follows:
def main():
sqlite_factory = extract_data_from('data/person.sq3')
print()
The next part shows how to work with the JSON files using the factory method. Based on the parsing, the title, year, director name, and genre of the movie can be shown (when the value is not empty), as follows:
json_factory = extract_data_from('data/movies.json')
json_data = json_factory.parsed_data
print(f'Found: {len(json_data)} movies')
for movie in json_data:
print(f"Title: {movie['title']}")
year = movie['year']
if year:
print(f"Year: {year}")
director = movie['director']
if director:
print(f"Director: {director}")
genre = movie['genre']
if genre:
print(f"Genre: {genre}")
print()
The final part shows you how to work with the XML files using the factory method. XPath is used to find all person elements that have the last name Liar (using liars = xml_data.findall(f".//person[lastName='Liar']")). For each matched person, the basic name and phone number information are shown, as follows:
xml_factory = extract_data_from('data/person.xml')
xml_data = xml_factory.parsed_data
liars = xml_data.findall(f".//person[lastName='Liar']")
print(f'found: {len(liars)} persons')
for liar in liars:
firstname = liar.find('firstName').text
print(f'first name: {firstname}')
lastname = liar.find('lastName').text
print(f'last name: {lastname}')
[print(f"phone number ({p.attrib['type']}):", p.text)
for p in liar.find('phoneNumbers')]
print()
Here is the summary of the implementation (you can find the code in the factory_method.py file):
- We start by importing the modules we need (json and ElementTree).
- We define the JSON data extractor class (JSONDataExtractor).
- We define the XML data extractor class (XMLDataExtractor).
- We add the factory function, dataextraction_factory(), for getting the right data extractor class.
- We also add our wrapper for handling exceptions, the extract_data_from() function.
- Finally, we have the main() function, followed by Python's conventional trick for calling it when invoking this file from the command line. The following are the aspects of the main function:
- We try to extract data from an SQL file (data/person.sq3), to show how the exception is handled
- We extract data from a JSON file and parse the result
- We extract data from an XML file and parse the result
The following is the type of output (for the different cases) you will get by calling the python factory_method.py command.
First, there is an exception message when trying to access an SQLite (.sq3) file:
Then, we get the following result from processing the movies file (JSON):
Finally, we get this result from processing the person XML file to find the people whose last name is Liar:
Notice that although JSONDataExtractor and XMLDataExtractor have the same interfaces, what is returned by parsed_data() is not handled in a uniform way. Different Python code must be used to work with each data extractor. Although it would be nice to be able to use the same code for all extractors, this is not realistic for the most part, unless we use some kind of common mapping for the data, which is very often provided by external data providers. Assuming that you can use exactly the same code for handling the XML and JSON files, what changes are required to support a third format, for example, SQLite? Find an SQLite file, or create your own and try it.