Solving problems – recovering a lost password
We'll apply many of our techniques to writing a program to help us poke around inside a locked ZIP file. It's important to note that any competent encryption scheme doesn't encrypt a password. Passwords are, at worst, reduced to a hash value. When someone enters a password, the hash values are compared. The original password remains essentially unrecoverable except by guessing.
We'll look at a kind of brute-force password recovery scheme. It will simply try all of the words in a dictionary. More elaborate guessing schemes will use dictionary words and punctuation to form longer and longer candidate passwords. Even more elaborate guessing will include leet speak replacements of characters. For example, using 1337 sp3@k
instead of leet speak
.
Before we look into how ZIP files work, we'll have to find a usable word corpus. A common stand-in for a corpus is a spell-check dictionary. For GNU/Linux or Mac OS X computers, there are several places a dictionary can be found. Three common places are: /usr/dict/words
, /usr/share/dict/words
, or possibly /usr/share/myspell/dicts
.
Windows agents may have to search around a bit for similar dictionary resources. Look in %AppData%\Microsoft\Spelling\EN
as a possible location. The dictionaries are often a .dic
file. There may also be an associated .aff
(affix rules) file, with additional rules for building words from the stem words (or lemmas) in the .dic
file.
If we can't track down a usable word corpus, it may be best to install a standalone spell checking program, along with its dictionaries. Programs such as aspell, ispell, Hunspell, Open Office, and LibreOffice contain extensive collections of spelling dictionaries for a variety of languages.
There are other ways to get various word corpora. One way is to search all of the text files for all of the words in all of those files. The words we used to create a password may be reflected in words we actually use in other files.
Another good approach is to use the Python Natural Language Toolkit (NLTK), which has a number of resources for handling natural language processing. As this manual was going to press, a version has been released which works with Python3. See https://pypi.python.org/pypi/nltk. This library provides lexicons, several wordlist corpora, and word stemming tools that are far better than simplistic spell-checking dictionaries.
Your mission is to locate a dictionary on your computer. If you can't find one, then download a good spell-check program and use its dictionary. A web search for web2 (Webster's Second International)
may turn up a usable corpus.
Reading a word corpus
The first thing we need to do is read our spell-check corpus. We'll call it a corpus—a body of words—not a dictionary. The examples will be based on web2 (Webster's Second International) all 234,936 words worth. This is generally available in BSD Unix and Mac OS X.
Here's a typical script that will examine a corpus:
count= 0 corpus_file = "/usr/share/dict/words" with open( corpus_file ) as corpus: for line in corpus: word= line.strip() if len(word) == 10: print(word) count += 1 print( count )
We've opened the corpus file and read all of the lines. The word was located by stripping whitespace from the line; this removes the trailing \n
character. An if
statement was used to filter the 10-letter words. There are 30,878 of those, from abalienate to Zyzzogeton.
This little script isn't really part of any larger application. It's a kind of technology spike—something we're using to nail down a detail. When writing little scripts like this, we'll often skip careful design of classes or functions and just slap some Python statements into a file.
In POSIX-compliant OSes, we can do two more things to make a script easy to work with. First, we can add a special comment on the very first line of the file to help the OS figure out what to do with it. The line looks like this:
#!/usr/bin/env python3
This tells the OS how to handle the script. Specifically, it tells the OS to use the env
program. The env
program will then locate our installation of Python 3. Responsibility will be handed off to the python3
program.
The second step is to mark the script as executable. We use the OS command, chmod +x some_file.py
, to mark a Python file as an executable script.
If we've done these two steps, we can execute a script by simply typing its name at the command prompt.
In Windows, the file extension (.py
) is associated with the Python program. There is an Advanced Settings panel that defines these file associations. When you installed Python, the association was built by the installer. This means that you can enter the name of a Python script and Windows will search through the directories named in your PATH
value and execute that script properly.
Reading a ZIP archive
We'll use Python's zipfile
module to work with a ZIP archive. This means we'll need to use import zipfile
before we can do anything else. Since a ZIP archive contains multiple files, we'll often want to get a listing of the available files in the archive. Here's how we can survey an archive:
import zipfile with zipfile.ZipFile( "demo.zip", "r" ) as archive: archive.printdir()
We've opened the archive, creating a file processing context. We then used the archive's printdir()
method to dump the members of the archive.
We can't, however, extract any of the files because the ZIP archive was encrypted and we lost the password. Here's a script that will try to read the first member:
import zipfile with zipfile.ZipFile( "demo.zip", "r" ) as archive: archive.printdir() first = archive.infolist()[0] with archive.open(first) as member: text= member.read() print( text )
We've created a file processing context using the open archive. We used the infolist()
method to get information on each member. The archive.infolist()[0]
statement will pick item zero from the list, that is, the first item.
We tried to create a file processing context for this specific member. Instead of seeing the content of the member, we get an exception. The details will vary, but your exception message will look like this:
RuntimeError: File <zipfile.ZipInfo object at 0x1007e78e8> is encrypted, password required for extraction
The hexadecimal number (0x1007e78e8
) may not match your output, but you'll still get an error trying to read an encrypted ZIP file.
Using brute-force search
To recover the files, we'll need to resort to brute-force search for a workable password. This means inserting our corpora reading loop into our archive processing context. It's a bit of flashy copy-and-paste that leads to a script like the following:
import zipfile import zlib corpus_file = "/usr/share/dict/words" with zipfile.ZipFile( "demo.zip", "r" ) as archive: first = archive.infolist()[0] print( "Reading", first.filename ) with open( corpus_file ) as corpus: for line in corpus: word= line.strip().encode("ASCII") try: with archive.open(first, 'r', pwd=word) as member: text= member.read() print( "Password", word ) print( text ) break except (RuntimeError, zlib.error, zipfile.BadZipFile): pass
We've imported two libraries: zipfile
as well as zlib
. We added zlib
because it turns out that we'll sometimes see zlib.error
exceptions when guessing passwords. We created a context for our open archive file. We used the infolist()
method to get names of members and fetched just the first file from that list. If we can read one file, we can read them all.
Then we opened our corpus file, and created a file processing context for that file. For each line in the corpora, we used two methods of the line: the strip()
method will remove the trailing "\n"
, and the encode("ASCII")
method will transform the line from Unicode characters to ASCII bytes. We need this because ZIP library passwords are ASCII bytes, not proper Unicode character strings.
The try:
block attempts to open and read the first member. We created a file processing context for this member within the archive. We tried to read the member. If anything goes wrong while we are trying to read the encrypted member, an exception will be raised. The usual culprit, of course, is attempting to read the member with the wrong password.
If everything works well, then we guessed the correct password. We can print the recovered password, as well as the text of the member as a confirmation.
Note that we've used a break
statement to end the corpora processing for
loop. This changes the for
loop's semantics from for all words
to there exists a word
. The break
statement means the loop ends as soon as a valid password is found. No further words in the corpus need to be processed.
We've listed three kinds of exceptions that might be raised from attempting to use a bad password. It's not obvious why different kinds of exceptions may be raised by wrong passwords. But it's easy to run some experiments to confirm that a variety of different exceptions really are raised by a common underlying problem.