Encoding details
Up until now, we talked a lot about “algorithms”; however, we did not get too much into the specifics. In this section, we are going to see the major algorithms that are behind the serialization and deserialization processes in Protobuf. We are first going to see all the types that we can use for our fields, then with that, we are going to divide them into three categories, and finally, we are going to explain which algorithm is used for each category.
In Protobuf, types that are considered simple and that are provided by Protobuf out of the box are called scalar types. We can use 15 of such types, as listed here:
int32
int64
uint32
uint64
sint32
sint64
fixed32
fixed64
sfixed32
sfixed64
double
float
string
bytes
bool
And out of these 15 types, 10 are for integers (the 10 first ones). These types might be intimidating at first, but do not worry too much about how to choose between them right now; we are going to discuss that throughout this section. The most important thing to understand right now is that two-thirds of the types are for integers, and this shows what Protobuf is good at—encoding integers.
Now that we know the scalar types, let us separate these types into three categories. However, we are not here to make simple categories such as numbers, arrays, and so on. We want to make categories that are related to the Protobuf serialization algorithms. In total, we have three: fixed-size numbers, variable-size integers (varints), and length-delimited types. Here is a table with each category populated:
Fixed-size numbers |
Varints |
Length-delimited types |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Let’s go through each now.
Fixed-size numbers
The easiest one to understand for developers who are used to typed languages is fixed-size numbers. If you worked with lower-level languages in which you tried to optimize storage space, you know that we can, on most hardware, store an integer in 32 bits (4 bytes) or in 64 bits (8 bytes). fixed32
and fixed64
are just binary representations of a normal number that you would have in languages that give you control over the storage size of your integers (for example, Go, C++, Rust, and so on). If we serialize the number 42 into a fixed32
type, we will have the following:
$ cat fixed.txt | protoc --encode=Fixed32Value wrappers.proto | hexdump -C 00000000 0d 2a 00 00 00 |.*...| 00000005
Here, 2a
is 42
, and 0d
is a combination of the field tag and the type of the field (more about that later in this section). In the same manner, if we serialize 42
in a fixed64
type, we will have the following:
$ cat fixed.txt | protoc --encode=Fixed64Value wrappers.proto | hexdump -C 00000000 09 2a 00 00 00 00 00 00 00 |.*.......| 00000009
And the only thing that changed is the combination between the type of the field and the field tag (09
). This is mostly because we changed the type to 64-bit numbers.
Two other scalar types that are easy to understand are float
and double
. Once again, Protobuf produces the binary representation of these types. If we encode 42.42
as float
, we will get the following output:
$ cat floating_point.txt | protoc --encode=FloatValue wrappers.proto | hexdump -C 00000000 0d 14 ae 29 42 |...)B| 00000005
In this case, this is a little bit more complicated to decode, but this is simply because float numbers are encoded differently. If you are interested in this kind of data storage, you can look at the IEEE Standard for Floating-Point Arithmetic (IEEE 754), which explains how a float is formed in memory. What is important to note here is that floats are encoded in 4 bytes, and in front, we have our tag + type. And for a double
type with a value of 42.42
, we will get the following:
$ cat floating_point.txt | protoc --encode=DoubleValue wrappers.proto | hexdump -C 00000000 09 f6 28 5c 8f c2 35 45 40 |..(\..5E@| 00000009
This is encoded in 8 bytes and the tag + type. Note that the tag + type also changed here because we are in the realm of 64-bit numbers.
Finally, we are left with sfixed32
and sfixed64
. We did not mention it earlier, but fixed32
and fixed64
are unsigned numbers. This means that we cannot store negative numbers in fields with these types. sfixed32
and sfixed64
solve that. So, if we encode –42 in a sfixed32
type, we will have the following:
$ cat sfixed.txt | protoc --encode=SFixed32Value wrappers.proto | hexdump -C 00000000 0d d6 ff ff ff |.....| 00000005
This is obtained by taking the binary for 42, flipping all the bits (1’s complement), and adding one (2’s complement). Otherwise, if you serialize a positive number, you will have the same binary as the fixed32
type. Then, if we encode –42 in a field with type sfixed64
, we will get the following:
$ cat sfixed.txt | protoc --encode=SFixed64Value wrappers.proto | hexdump -C 00000000 09 d6 ff ff ff ff ff ff ff |.........| 00000009
This is like the sfixed32
type, only the tag + type was changed.
To summarize, fixed integers are simple binary representations of integers that resemble how they are stored in most computers’ memory. As their name suggests, their serialized data will always be serialized into the same number of bytes. For some use cases, this is fine to use such representations; however, in most cases, we would like to reduce the number of bits that are just here for padding. And in these use cases, we will use something called varints.
Varints
Now that we have seen fixed integers, let us move to another type of serialization for numbers: variable-length integers. As its name suggests, we will not get a fixed number of bytes when serializing an integer.
To be more precise, the smaller the integer, the smaller the number of bytes it will be serialized into, and the bigger the integer, the larger the number of bytes. Let us look at how the algorithm works.
In this example, let us serialize the number 300. To start, we are going to take the binary representation of that number:
100101100
With this binary, we can now split it into groups of 7 bits and pad with zeros if needed:
0000010 0101100
Now, since we lack 2 more bits to create 2 bytes, we are going to add 1 as the most significant bit (MSB) for all the groups except the first one, and we are going to add 0 as the MSB for the first group:
00000010 10101100
These MSBs are continuation bits. This means that, when we have 1, we still have 7 bits to read after, and if we have 0, this is the last group to be read. Finally, we put this number into little-endian order, and we have the following:
10101100 00000010
Or, we would have AC 02
in hexadecimal. Now that we have serialized 300 into AC 02
, and keeping in mind that deserialization is the opposite of serialization, we can deserialize that data. We take our binary representation for AC 02
, drop the continuation bits (MSBs), and we reverse the order of bytes. In the end, we have the following binary:
100101100
This is the same binary we started with. It equals 300.
Now, in the real world, you might have larger numbers. For a quick reference on positive numbers, here is a list of the thresholds at which the number of bytes will increase:
Threshold value |
Byte size |
0 |
0 |
1 |
1 |
128 |
2 |
16,384 |
3 |
2,097,152 |
4 |
268,435,456 |
5 |
34,359,738,368 |
6 |
4,398,046,511,104 |
7 |
562,949,953,421,312 |
8 |
72,057,594,037,927,936 |
9 |
9,223,372,036,854,775,807 |
9 |
An astute reader might have noticed that having a varint is often beneficial, but in some cases, we might encode our values into more bytes than needed. For example, if we encode 72,057,594,037,927,936 into an int64
type, it will be serialized into 9 bytes, while with a fixed64
type, it will be encoded into 8. Furthermore, a problem coming from the encoding that we just saw is that negative numbers will be encoded into a large positive number and thus will be encoded into 9 bytes. That begs the following question: How can we efficiently choose between the different integer types?
How to choose?
The answer is, as always, it depends. However, we can be systematic in our choices to avoid many errors. We mostly have three choices that we need to make depending on the data we want to serialize:
- The range of numbers needed
- The need for negative numbers
- The data distribution
The range
By now, you might have noticed that the 32 and 64 suffixes on our types are not always about the number of bits into which our data will be serialized. For varints, this is more about the range of numbers that can be serialized. These ranges are dependent on the algorithm used for serialization.
For fixed, signed, and variable-length integers, the range of numbers is the same as the one developers are used to with 32 and 64 bits. This means that we get the following:
[-2^(NUMBER_OF_BITS – 1), 2^(NUMBER_OF_BITS – 1) – 1]
Here, NUMBER_OF_BITS
is either 32 or 64 depending on the type you want to use.
For unsigned numbers (uint
)—this is again like what developers are expecting—we will get the following range:
[0, 2 * 2^(NUMBER_OF_BITS – 1) - 1]
The need for negative numbers
In the case where you simply do not need negative numbers (for example, for IDs), the ideal type to use is an unsigned integer (uint32
, uint64
). This will prevent you from encoding negative numbers, it will have twice the range in positive numbers compared to signed integers, and it will serialize using the varint algorithm.
And another type that you will potentially work with is the one for signed integers (sint32
, sint64
). We won’t go into details about how to serialize them, but the algorithm transforms any negative number into a positive number (ZigZag encoding) and serializes the positive number with the varint algorithm. This is more efficient for serializing negative numbers because instead of being serialized as a large positive number (9 bytes), we take advantage of the varint encoding. However, this is less efficient for serializing positive numbers because now we interleave the previously negative numbers and the positive numbers. This means that for the same positive number, we might have different amounts of encoding bytes.
The data distribution
Finally, one thing that is worth mentioning is that encoding efficiency is highly dependent on your data distribution. You might have chosen some types depending on some assumptions, but your actual data might be different. Two common examples are choosing an int32
or int64
type because we expect to have few negative values and choosing an int64
type because we expect to have few very big numbers. Both situations might result in significant inefficiencies because, in both cases, we might get a lot of values serialized into 9 bytes.
Unfortunately, there is no way of deciding the type that will always perfectly fit the data. In this kind of situation, there is nothing better than doing experiments on real data that is representative of your whole dataset. This will give you an idea of what you are doing correctly and what you are doing wrong.
Length-delimited types
Now that we’ve seen all the types for numbers, we are left with the length-delimited types. These are the types, such as string and bytes, from which we cannot know the length at compile time. Think about these as dynamic arrays.
To serialize such a dynamic structure, we simply prefix the raw data with the length of that data that is following. This means that if we have a string of length 10 and content “0123456789”, we will have the following sequence of bytes:
$ cat length-delimited.txt | protoc --encode=StringValue wrappers.proto | hexdump -C 00000000 0a 0a 30 31 32 33 34 35 36 37 38 39 |..0123456789| 0000000c
Here, the first 0a
instance is the field tag + type, the second 0a
instance is the hexadecimal representation of 10, and then we have the ASCII values for each character. To see why 0 turns into 30, you can check the ASCII manual by typing man ascii
in your terminal and looking for the hexadecimal set. You should have a similar output to the following:
30 0 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 39 9
Here, the first number of each pair is the hexadecimal value for the second one.
Another kind of message field that will be serialized into a length-delimited type is a repeated field. A repeated field is the equivalent of a list. To write such a field, we simply add the repeated
keyword before the field type. If we wanted to serialize a list of IDs, we could write the following:
repeated uint64 ids = 1;
And with this, we could store 0 or more IDs.
Similarly, these fields will be serialized with the length as a prefix. If we take the ids
field and serialize the numbers from 1 to 9, we will have the following:
$ cat repeated.txt | protoc --encode=RepeatedUInt64Values wrappers.proto | hexdump -C 00000000 0a 09 01 02 03 04 05 06 07 08 09 |...........| 0000000b
This is a list of 9 elements followed by 1, 2, … and so on.
Important note
Repeated fields are only serialized as length-delimited types when they are storing scalar types except for strings and bytes. These repeated fields are considered packed. For complex types or user-defined types (messages), the values will be encoded in a less optimal way. Each value will be encoded separately and prefixed by the type + tag byte(s) instead of having the type + tag serialized only once.
Field tags and wire types
Up until now, you read “tag + type” multiple times and we did not really see what this means. As mentioned, the first byte(s) of every serialized field will be a combination of the field type and the field tag. Let us start by seeing what a field tag is. You surely noticed something different about the syntax of a field. Each time we define a field, we add an equals sign and then an incrementing number. Here’s an example:
uint64 id = 1;
While they look like an assignment of value to the field, they are only here to give a unique identifier to the field. These identifiers, called tags, might look insignificant but they are the most important bit of information for serialization. They are used to tell Protobuf into which field to deserialize which data. As we saw earlier during the presentation of the different serialization algorithms, the field name is not serialized—only the type and the tag are. And thus, when deserialization kicks in, it will see a number and it will know where to redirect the following datum.
Now that we know that these tags are simply identifiers, let us see how these values are encoded. Tags are simply serialized as varints but they are serialized with a wire type. A wire type is a number that is given to a group of types in Protobuf. Here is the list of wire types:
Type |
Meaning |
Used for |
0 |
Varint |
|
1 |
64-bit |
|
2 |
Length-delimited |
string, bytes, packed repeated fields |
5 |
32-bit |
|
Here, 0 is the type for varints, 1 is for 64-bit, and so on.
To combine the tag and the wire type, Protobuf uses a concept called bit packing. This is a technique that is designed to reduce the number of bits into which the data will be serialized. In our case here, the data is the field metadata (the famous tag + type). So, here is how it works. The last 3 bits of the serialized metadata are reserved for the wire type, and the rest is for the tag. If we take the first example that we mentioned in the Fixed-size numbers section, where we serialized 42 in a fixed32
field with tag 1, we had the following:
0d 2a 00 00 00
This time we are only interested in the 0d
part. This is the metadata of the field. To see how this was serialized, let us turn 0d
into binary (with 0 padding):
00001101
Here, we have 101 (5) for the wire type—this is the wire type for 32 bits—and we have 00001 (1) for tag 1. Now, since the tag is serialized as a varint, it means that we could have more than 1 byte for that metadata. Here’s a reference for knowing the thresholds at which the number of bytes will increase:
Field tag |
Size (in bits) |
1 |
5 |
16 |
13 |
2,048 |
21 |
262,144 |
29 |
33,554,432 |
37 |
536,870,911 |
37 |
This means that, as fields without values set to them will not be serialized, we need to keep the lowest tags to the fields that are the most often populated. This will lower the overhead needed to store the metadata. In general, 15 tags are enough, but if you come up with a situation where you need more tags, you might consider moving a group of data into a new message with lower tags.