To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename. The following table lists limitations for some popular filesystems:
Operating system | Filesystem | Invalid filename characters | Maximum filename length |
Linux | Ext3/Ext4 | / and \0 | 255 bytes |
OS X | HFS Plus | : and \0 | 255 UTF-16 code units |
Windows | NTFS | \, /, ?, :, *, ", >, <, and | | 255 characters |
To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, and basic punctuation, and it should replace all other characters with an underscore, as shown in the following code:
>>> import re
>>> url = 'http://example.webscraping.com/default/view/Australia-1'
>>> re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url)
'http_//example.webscraping...