We will now discuss encoding ASCII data as bytes and base64 encoding these bytes. We will also cover base64 encoding for binary data and decoding to get back to the original input.
base64 encoding
ASCII data
In ASCII, each character turns into one byte:
- A is 65 in base 10, and in binary, it is 0b01000001. Here, you have 0 in the most significant bit because there's no 128, then you have 1 in the next bit for 64 and 1 in the end, so you have 64 + 1=65.
- The next is B with base 66 and C with base 67. The binary for B is 0b01000010, and for C, it is 0b01000011.
The three-letter string ABC can be interpreted as a 24-bit string that looks like this:
We've added these blue lines just to show where the bytes are broken out. To interpret that as base64, you need to break it into groups of 6 bits. 6 bits have a total of 64 combinations, so you need 64 characters to encode it.
The characters used are as follows:
We use the capital letters for the first 26, lowercase letters for another 26, the digits for another 10, which gets you up to 62 characters. In the most common form of base64, you use + and / for the last two characters:
If you have an ASCII string of three characters, it turns into 24 bits interpreted as 3 groups of 8. If you just break them up into 4 groups of 6, you have 4 numbers between 0 and 63, and in this case, they turn into Q, U, J, and D. In Python, you just have a string followed by the command:
>>> "ABC".encode("base64")
'QUJD\n'
This will do the encoding. Then add an extra carriage return at the end, which neither matters nor affects the decoding.
What if you have something other than a group of 3 bytes?
If you have four bytes for the input, then the base64 encoding ends with two equals signs, just to indicate that it had to add two characters of padding. If you have five bytes, you have one equals sign, and if you have six bytes, then there's no equals signs, indicating that the input fit neatly into base64 with no need for padding. The padding is null.
You take ABCD and encode it and then you take ABCD with explicit byte of zero. x00 means a single character with eight bits of zero, and you get the same result with just an extra A and one equals, and if you fill it out all the way with two bytes of zero, you get capital A all the way. Remember: a capital A is the very first character in base64. It stands for six bits of zero.
Let's take a look at base64 encoding in Python:
- We will start python up and make a string. If you just make a string with quotes and press Enter, it will print it in immediate mode:
>>> "ABC"
'ABC'
- Python will print the result of each calculation automatically. If we encode that with base64, we will get this:
>>> "ABC".encode(""base64")
'QUJD\n'
- It turns into QUJD with an extra courage return at the end and if we make it longer:
>>> "ABCD".encode("base64")
'QUJDRA==\n'
- This has two equals signs because we started with four bytes, and it had to add two more to make it a multiple of three:
>>> "ABCDE".encode("base64")
'QUJDREU=\n'
>>> "ABCDEF".encode("base64")
'QUJDREVG\n'
- With a five-byte input, we have one equals sign; and with six bytes of input, we have no more equal signs, instead, we have a total of eight characters with base64.
- Let's go back to ABCD with the two equals signs:
>>>"ABCD".encode("base64")
'QUJDRA==\n'
- You can see how the padding was done by putting it in explicitly here:
>>> "ABCD\x00\x00".encode("base64")
'QUJDRAA=\n'
There's a first byte of zero, and now we get another single equals sign.
- Let's put in a second byte of zero:
>>> "ABCD\x00\x00".encode("base64")
'QUJDRAAA\n'
We have no padding here, and we see that the last characters are all A, indicating that there's been a filling of binary zeros.
Binary data
The next issue is handling binary data. Executable files are binary and not ASCII. Also, images, movies, and many other files have binary data. ASCII data always starts with a zero as the first bit, but base64 works fine with binary data. Here is a common executable file, a forensic utility; it starts with MZê and has unprintable ASCII characters:
As this is a hex viewer, you see the raw data in hexadecimal, and on the right, it attempts to print it as ASCII. Windows programs have this string at the start, and this program cannot be run in DOS mode, but they have a lot of unprintable characters, such as FF and 0, which really doesn't matter for Python at all. An easy way to encode data like that is to read it directly from the file. You can use the with command. It will just open a file with filename and mode read binary with the handle f and then you can read it. The with command is here just to tell Python to open the file, and that if it cannot be opened due to some error, then just to close the handle and then decode it exactly the same way. To decode data you've encoded in this fashion, you just take the output string and you put .decode instead of .encode.
Now let's take a look at how to handle binary data:
- We will first exit Python so that we can see the filesystem, and then we'll look for the Ac file using the command shown here:
>>> exit()
$ ls Ac*
AccessData Registry Viewer_1.8.3.exe
There's the filename. Since that's kind of a long block, we are just going to copy and paste it.
- Now we start Python and clear the screen using the following command:
$ clear
- We will start python again:
$ python
- Alright, so, now we use the following command:
>>> with open("AccessData Registry Viewer_1.8.3.exe", "rb") as f:
... data = f.read()
... print data.encode("base64")
Here we enter the filename first and then the mode, which is read binary. We will give it filename handle of f. We will take all the data and put it in a single variable data. We could just encode the data in base64, and it would automatically print it. If you have an intended block in Python, you have to press Enter twice so it knows the block is done, and then base64 encodes it.
- You get a long block of base64 that is not very readable, but this is a handy way to handle data like that; say, if you want to email it or put it in some other text format. So, to do the decoding, let's encode something simpler so that we can easily see the result:
>>> "ABC".encode("base64")
'QUJD\n'
- If we want to play with it, put that in a c variable using the following command:
>>> c = "ABC".encode("base64")
>>> print c
QUJD
- Now we can print c to make sure that we have got what we expected. We have QUJD, which is what we expected. So, now we can decode it using the following command:
>>> c.decode("base64")
'ABC'
base64 is not encrypting. It is not hiding anything, but it is just another way to represent it. In the next section, we'll cover XOR.