Unicode defines a number of encoding schemes. While UTF-8 is the most popular, there are also UTF-16 and UTF-32. The number is the typical number of bits per character. A file with 1000 characters encoded in UTF-32 would be 4000 8-bit bytes. A file with 1000 characters encoded in UTF-8 could be as few as 1000 bytes, depending on the exact mix of characters. In the UTF-8 encoding, characters with Unicode numbers above U+007F require multiple bytes.
Various OS's have their own coding schemes. Mac OS X files are often encoded in Mac Roman or Latin-1. Windows files might use CP1252 encoding.
The point with all of these schemes is to have a sequence of bytes that can be mapped to a Unicode character. And—going the other way—a way to map each Unicode character to one or more bytes. Ideally, all of the Unicode characters are accounted for. Pragmatically, some of these coding schemes are incomplete. The tricky part is to avoid writing any more bytes than is necessary.
The historical ASCII encoding can only represent about 250 of the Unicode characters as bytes. It's easy to create a string which cannot be encoded using the ASCII scheme.
Here's what the error looks like:
>>> 'You drew \U0001F000'.encode('ascii')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character '\U0001f000' in position 9: ordinal not in range(128)
We may see this kind of error when we accidentally open a file with a poorly chosen encoding. When we see this, we'll need to change our processing to select a more useful encoding; ideally, UTF-8.
Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.