You're reading from Modern Python Cookbook The latest in modern Python recipes for the busy modern programmer

Product type Paperback

Published in Nov 2016

Publisher Packt

ISBN-13 9781786469250

Length 692 pages

Edition 1st Edition

Languages

Python

Concepts

Programming Language

Encoding strings – creating ASCII and UTF-8 bytes

Our computer files are bytes. When we upload or download from the Internet, the communication works in bytes. A byte only has 256 distinct values. Our Python characters are Unicode. There are a lot more than 256 Unicode characters.

How do we map Unicode characters to bytes for writing to a file or transmitting?

Getting ready

Historically, a character occupied 1 byte. Python leverages the old ASCII encoding scheme for bytes; this sometimes leads to confusion between bytes and proper strings of Unicode characters.

Unicode characters are encoded into sequences of bytes. We have a number of standardized encodings and a number of non-standard encodings.

Plus, we also have some encodings that only work for a small subset of Unicode characters. We try to avoid this, but there are some situations where we'll need to use a subset encoding scheme.

Unless we have a really good reason, we almost always use the UTF-8 encoding for Unicode characters. Its main advantage is that it's a compact representation for the Latin alphabet used for English and a number of European languages.

Sometimes, an Internet protocol requires ASCII characters. This is a special case that requires some care because the ASCII encoding can only handle a small subset of Unicode characters.

How to do it...

Python will generally use our OS's default encoding for files and Internet traffic. The details are unique to each OS:

We can make a general setting using the PYTHONIOENCODING environment variable. We set this outside of Python to assure that a particular encoding is used everywhere. Set the environment variable as:

      export PYTHONIOENCODING=UTF-8

Run Python:

      python3.5

We sometimes need to make specific settings when we open a file inside our script. We'll return this in Chapter 8, Input/Output, Physical Format, Logical Layout. Open the file with a given encoding. Read or write Unicode characters to the file:

      >>> with open('some_file.txt', 'w', encoding='utf-8') as output:
      ...     print( 'You drew \U0001F000', file=output )
      >>> with open('some_file.txt', 'r', encoding='utf-8') as input:
      ...     text = input.read()
      >>> text
      'You drew �'

We can also manually encode characters, in the rare case that we need to open a file in bytes mode; if we use a mode of wb, we'll need to use manual encoding:

>>> string_bytes = 'You drew \U0001F000'.encode('utf-8')
>>> string_bytes
b'You drew \xf0\x9f\x80\x80'

We can see that a sequence of bytes (\xf0\x9f\x80\x80) was used to encode a single Unicode character, U+1F000, .

How it works...

Unicode defines a number of encoding schemes. While UTF-8 is the most popular, there are also UTF-16 and UTF-32. The number is the typical number of bits per character. A file with 1000 characters encoded in UTF-32 would be 4000 8-bit bytes. A file with 1000 characters encoded in UTF-8 could be as few as 1000 bytes, depending on the exact mix of characters. In the UTF-8 encoding, characters with Unicode numbers above U+007F require multiple bytes.

Various OS's have their own coding schemes. Mac OS X files are often encoded in Mac Roman or Latin-1. Windows files might use CP1252 encoding.

The point with all of these schemes is to have a sequence of bytes that can be mapped to a Unicode character. And—going the other way—a way to map each Unicode character to one or more bytes. Ideally, all of the Unicode characters are accounted for. Pragmatically, some of these coding schemes are incomplete. The tricky part is to avoid writing any more bytes than is necessary.

The historical ASCII encoding can only represent about 250 of the Unicode characters as bytes. It's easy to create a string which cannot be encoded using the ASCII scheme.

Here's what the error looks like:

>>> 'You drew \U0001F000'.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f000' in position 9: ordinal not in range(128)

We may see this kind of error when we accidentally open a file with a poorly chosen encoding. When we see this, we'll need to change our processing to select a more useful encoding; ideally, UTF-8.

Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.

You're reading from Modern Python Cookbook The latest in modern Python recipes for the busy modern programmer

Table of Contents (12) Chapters

Encoding strings – creating ASCII and UTF-8 bytes

Getting ready

How to do it...

How it works...

See also

Personalised recommendations for you