Let's say we're interested in offshore marine weather forecasts. Perhaps because we own a large sailboat. Or perhaps because good friends of ours have a large sailboat and are departing the Chesapeake Bay for the Caribbean.
Are there any special warnings coming from the National Weather Services office in Wakefield, Virginia?
Here's where we can get the warnings: http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ.
We can download this with Python's urllib module:
>>> import urllib.request
>>> warnings_uri= 'http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ'
>>> with urllib.request.urlopen(warnings_uri) as source:
... warnings_text= source.read()
Or, we can use programs like curl or wget to get this. We might do:
curl -O http://www.nws.noaa.gov/view/national.php?prod=SMW&sid=AKQ
mv national.php\?prod\=SMW AKQ.html
Since curl left us with an awkward file name, we needed to rename the file.
The forecast_text value is a stream of bytes. It's not a proper string. We can tell because it starts like this:
>>> warnings_text[:80]
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or'
And goes on for a while providing details. Because it starts with b', it's bytes, not proper Unicode characters. It was probably encoded with UTF-8, which means some characters could have weird-looking \xnn escape sequences instead of proper characters. We want to have the proper characters.
Bytes vs Strings Bytes are often displayed using printable characters. We'll see b'hello' as a short-hand for a five-byte value. The letters are chosen using the old ASCII encoding scheme. Many byte values from about 0x20 to 0xFE will be shown as characters. This can be confusing. The prefix of b' is our hint that we're looking at bytes, not proper Unicode characters.
Generally, bytes behave somewhat like strings. Sometimes we can work with bytes directly. Most of the time, we'll want to decode the bytes and create proper Unicode characters.