Disk cache
To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists the limitations for some popular filesystems:
Operating system |
File system |
Invalid filename characters |
Maximum filename length |
---|---|---|---|
Linux |
Ext3/Ext4 |
/ and \0 |
255 bytes |
OS X |
HFS Plus |
: and \0 |
255 UTF-16 code units |
Windows |
NTFS |
\, /, ?, :, *, ", >, <, and | |
255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:
>>> import re >>> url = 'http://example.webscraping.com/default/view/Australia-1' >>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url) 'http_//example.webscraping.com/default/view/Australia-1'
Additionally, the filename and the parent directories...