Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Python Automation Cookbook

You're reading from   Python Automation Cookbook Explore the world of automation using Python recipes that will enhance your skills

Arrow left icon
Product type Paperback
Published in Sep 2018
Publisher Packt
ISBN-13 9781789133806
Length 398 pages
Edition 1st Edition
Languages
Tools
Concepts
Arrow right icon
Author (1):
Arrow left icon
Jaime Buelta Jaime Buelta
Author Profile Icon Jaime Buelta
Jaime Buelta
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Preface 1. Let Us Begin Our Automation Journey FREE CHAPTER 2. Automating Tasks Made Easy 3. Building Your First Web Scraping Application 4. Searching and Reading Local Files 5. Generating Fantastic Reports 6. Fun with Spreadsheets 7. Developing Stunning Graphs 8. Dealing with Communication Channels 9. Why Not Automate Your Marketing Campaign? 10. Debugging Techniques 11. Other Books You May Enjoy

Extracting data from structured strings

In a lot of automated tasks, we'll need to treat input text that's in a particular format and extract the relevant information. For example, a spreadsheet may define a percentage in text (such as 37.4%) that we want to retrieve in numerical format to apply it later (0.374, as a float).

In this recipe, we'll see how to process sale logs that contain inline information about a product, such as sold, price, profit, and some other information.

Getting ready

Imagine that we need to parse information stored in sales logs. We'll use a sales log with the following structure:

[<Timestamp in iso format>] - SALE - PRODUCT: <product id> - PRICE: $<price of the sale>

For example, a specific log may look like this:

[2018-05-05T10:58:41.504054] - SALE - PRODUCT: 1345 - PRICE: $09.99

Note that the price has a leading zero. All prices will have two digits for the dollars, and two for the cents.

We need to activate our virtual environment before we start:

$ source .venv/bin/activate

How to do it...

  1. In the Python interpreter, make the following imports. Remember to activate your virtualenv, as described in the Creating a virtual environment recipe:
>>> import delorean
>>> from decimal import Decimal
  1. Enter the log to parse:
>>> log = '[2018-05-05T11:07:12.267897] - SALE - PRODUCT: 1345 - PRICE: $09.99'
  1. Split the log into its parts, which are divided by - (note the space before and after the dash). We ignore the SALE part as it doesn't add any relevant information:
>>> divide_it = log.split(' - ')
>>> timestamp_string, _, product_string, price_string = divide_it
  1. Parse the timestamp into a datetime object:
>>> timestamp = delorean.parse(tmp_string.strip('[]'))
  1. Parse the product_id into a integer:
>>> product_id = int(product_string.split(':')[-1])
  1. Parse the price into a Decimal type:
>>> price = Decimal(price_string.split('$')[-1])
  1. Now, you have all the values in native Python formats:
>> timestamp, product_id, price
(Delorean(datetime=datetime.datetime(2018, 5, 5, 11, 7, 12, 267897), timezone='UTC'), 1345, Decimal('9.99'))

How it works...

The basic working of this is to isolate each of the elements and then parse them in to the proper type. The first step is to split the full log into smaller parts. The - string is a good divider, as it splits it into four parts—a timestamp one, one with just the word SALE, the product, and the price.

In the case of the timestamp, we need to isolate the ISO format, which is in brackets in the log. That's why it's stripped off the brackets. We use the delorean module (introduced earlier) to parse it in to a datetime object.

The word SALE is ignored. There's no relevant information there.

To isolate the product ID, we split the product part at the colon. Then, we parse the last element as an integer:

>>> product_string.split(':')
['PRODUCT', ' 1345']
>>> int(' 1345')
1345

To divide the price, we use the dollar sign as a separator, and parse it as a Decimal character:

>>> price_string.split('$')
['PRICE: ', '09.99']
>>> Decimal('09.99')
Decimal('9.99')

As described in the next section, do not parse this value into a float type.

There's more...

These log elements can be combined together into a single object, helping with parsing and aggregating them. For example, we could define a class in Python code in the following way:

class PriceLog(object):
def __init__(self, timestamp, product_id, price):
self.timestamp = timestamp
self.product_id = product_id
self.price = price
def __repr__(self):
return '<PriceLog ({}, {}, {})>'.format(self.timestamp,
self.product_id,
self.price)
@classmethod
def parse(cls, text_log):
'''
Parse from a text log with the format
[<Timestamp>] - SALE - PRODUCT: <product id> - PRICE: $<price>
to a PriceLog object
'''
divide_it = text_log.split(' - ')
tmp_string, _, product_string, price_string = divide_it
timestamp = delorean.parse(tmp_string.strip('[]'))
product_id = int(product_string.split(':')[-1])
price = Decimal(price_string.split('$')[-1])
return cls(timestamp=timestamp, product_id=product_id, price=price)

So, the parsing can be done as follows:

>>> log = '[2018-05-05T12:58:59.998903] - SALE - PRODUCT: 897 - PRICE: $17.99'
>>> PriceLog.parse(log)
<PriceLog (Delorean(datetime=datetime.datetime(2018, 5, 5, 12, 58, 59, 998903), timezone='UTC'), 897, 17.99)>

Avoid using float types for prices. Floats numbers have precision problems that may produce strange errors when aggregating multiple prices, for example:

>>> 0.1 + 0.1 + 0.1 
0.30000000000000004

Try these two options to avoid problems:

  • Use integer cents as the base unit: This means multiplying currency inputs by 100 and transforming them into integers (or whatever fractional unit is correct for the currency used). You may still want to change the base when displaying them.
  • Parse into the Decimal type: The Decimal type keeps the fixed precision and works as you'd expect. You can find further information about the Decimal type in the Python docs at https://docs.python.org/3.6/library/decimal.html.
If you use the Decimal type, parse the results directly into Decimal from the string. If transforming it first into a float, you can carry the precision errors to the new type.

See also

  • The Creating a virtual environment recipe
  • The Using a third-party tool—parse recipe
  • The Introducing regular expressions recipe
  • The Going deeper into regular expressions recipe
You have been reading a chapter from
Python Automation Cookbook
Published in: Sep 2018
Publisher: Packt
ISBN-13: 9781789133806
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image