Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Modern Python Cookbook

You're reading from   Modern Python Cookbook 130+ updated recipes for modern Python 3.12 with new techniques and tools

Arrow left icon
Product type Paperback
Published in Jul 2024
Publisher Packt
ISBN-13 9781835466384
Length 818 pages
Edition 3rd Edition
Languages
Arrow right icon
Author (1):
Arrow left icon
Steven F. Lott Steven F. Lott
Author Profile Icon Steven F. Lott
Steven F. Lott
Arrow right icon
View More author details
Toc

Table of Contents (20) Chapters Close

Preface 1. Chapter 1 Numbers, Strings, and Tuples FREE CHAPTER 2. Chapter 2 Statements and Syntax 3. Chapter 3 Function Definitions 4. Chapter 4 Built-In Data Structures Part 1: Lists and Sets 5. Chapter 5 Built-In Data Structures Part 2: Dictionaries 6. Chapter 6 User Inputs and Outputs 7. Chapter 7 Basics of Classes and Objects 8. Chapter 8 More Advanced Class Design 9. Chapter 9 Functional Programming Features 10. Chapter 10 Working with Type Matching and Annotations 11. Chapter 11 Input/Output, Physical Format, and Logical Layout 12. Chapter 12 Graphics and Visualization with Jupyter Lab 13. Chapter 13 Application Integration: Configuration 14. Chapter 14 Application Integration: Combination 15. Chapter 15 Testing 16. Chapter 16 Dependencies and Virtual Environments 17. Chapter 17 Documentation and Style 18. Other Books You May Enjoy
19. Index

1.7 Encoding strings – creating ASCII and UTF-8 bytes

Our computer files are bytes. When we upload or download from the internet, the communication works in bytes. A byte only has 256 distinct values. Our Python characters are Unicode. There are a lot more than 256 Unicode characters.

How do we map Unicode characters to bytes to write to a file or for transmission?

1.7.1 Getting ready

Historically, a character occupied 1 byte. Python leverages the old ASCII encoding scheme for bytes; this sometimes leads to confusion between bytes and text strings of Unicode characters.

Unicode characters are encoded into sequences of bytes. There are a number of standardized encodings and a number of non-standard encodings.

Plus, there also are some encodings that only work for a small subset of Unicode characters. We try to avoid these, but there are some situations where we’ll need to use a subset encoding scheme.

Unless we have a really good reason not to, we almost always use UTF-8 encoding for Unicode characters. Its main advantage is that it’s a compact representation of the Latin alphabet, which is used for English and a number of European languages.

Sometimes, an internet protocol requires ASCII characters. This is a special case that requires some care because the ASCII encoding can only handle a small subset of Unicode characters.

1.7.2 How to do it...

Python will generally use our OS’s default encoding for files and internet traffic. The details are unique to each OS:

  1. We can make a general setting using the PYTHONIOENCODING environment variable. We set this outside of Python to ensure that a particular encoding is used everywhere. When using Linux or macOS, use the shell’s export statement to set the environment variable. For Windows, use the set command, or the PowerShell Set-Item cmdlet. For Linux, it looks like this:

    (cookbook3) % export PYTHONIOENCODING=UTF-8
  2. Run Python:

    (cookbook3) % python
  3. We sometimes need to make specific settings when we open a file inside our script. We’ll return to this topic in Chapter 11. Open the file with a given encoding. Read or write Unicode characters to the file:

    >>> with open(’some_file.txt’, ’w’, encoding=’utf-8’) as output: 
     
    ...     print(’You drew \U0001F000’, file=output) 
     
    >>> with open(’some_file.txt’, ’r’, encoding=’utf-8’) as input: 
     
    ...     text = input.read() 
     
    >>> text 
     
    ’You drew ’

We can also manually encode characters, in the rare case that we need to open a file in bytes mode; if we use a mode of wb, we’ll also need to use manual encoding of each string:

>>> string_bytes = ’You drew \U0001F000’.encode(’utf-8’) 
 
>>> string_bytes 
 
b’You drew \xf0\x9f\x80\x80’

We can see that a sequence of bytes (\xf0\x9f\x80\x80) was used to encode a single Unicode character, U+1F000, PIC.

1.7.3 How it works...

Unicode defines a number of encoding schemes. While UTF-8 is the most popular, there is also UTF-16 and UTF-32. The number is the typical number of bits per character. A file with 1,000 characters encoded in UTF-32 would be 4,000 8-bit bytes. A file with 1,000 characters encoded in UTF-8 could be as few as 1,000 bytes, depending on the exact mix of characters. In UTF-8 encoding, characters with Unicode numbers above U+007F require multiple bytes.

Various OSes have their own coding schemes. macOS files can be encoded in Mac Roman or Latin-1. Windows files might use CP1252 encoding.

The point with all of these schemes is to have a sequence of bytes that can be mapped to a Unicode character and—going the other way—a way to map each Unicode character to one or more bytes. Ideally, all of the Unicode characters are accounted for. Pragmatically, some of these coding schemes are incomplete.

The historical form of ASCII encoding can only represent about 100 of the Unicode characters as bytes. It’s easy to create a string that cannot be encoded using the ASCII scheme.

Here’s what the error looks like:

>>> ’You drew \U0001F000’.encode(’ascii’) 
 
Traceback (most recent call last): 
 
... 
 
UnicodeEncodeError: ’ascii’ codec can’t encode character ’\U0001f000’ in position 9: ordinal not in range(128

We may see this kind of error when we accidentally open a file with an encoding that’s not the widely used standard of UTF-8. When we see this kind of error, we’ll need to change our processing to select the encoding actually used to create the file. It’s almost impossible to guess what encoding was used, so some research may be required to locate metadata about the file that states the encoding.

Bytes are often displayed using printable characters. We’ll see b’hello’ as shorthand for a five-byte value. The letters are chosen using the old ASCII encoding scheme, where byte values from 0x20 to 0x7F will be shown as characters, and outside this range, more complex-looking escapes will be used.

This use of characters to represent byte values can be confusing. The prefix of b’ is our hint that we’re looking at bytes, not proper Unicode characters.

1.7.4 See also

  • There are a number of ways to build strings of data. See the Building complicated strings with f-strings and the Building complicated strings from lists of strings recipes for examples of creating complex strings. The idea is that we might have an application that builds a complex string, and then we encode it into bytes.

  • For more information on UTF-8 encoding, see https://en.wikipedia.org/wiki/UTF-8.

  • For general information on Unicode encodings, see http://unicode.org/faq/utf_bom.html.

You have been reading a chapter from
Modern Python Cookbook - Third Edition
Published in: Jul 2024
Publisher: Packt
ISBN-13: 9781835466384
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime